Parser written in PHP is 5.6x faster than the same C++ program in a similar test (g++ 4.8.5)

Question

I'm absolutely dumbfounded by this. I was trying to demonstrate to myself how much faster C++ is than even modern PHP. I ran a simple CSV parsing program in both that have the same output. The CSV file is 40,194,684 parsed down to 1,537,194 lines.

EDIT: This sparked alot more conversation than I had anticipated, here's the hardware stats for the machine both programs were run on, however its actually a VM running on a nutanix server: CPU: Intel(R) Xeon(R) Silver 4215R CPU @ 3.20GHz RAM: 16GB

PHP code (runtime 42.750 s):

<?php $i_fp = fopen("inFile.csv","r"); $o_fp = fopen("outFile.csv","w"); while(!feof($i_fp)) { $line = fgets($i_fp); $split = explode(';',$line); if($split[3] == 'E' || $split[3] == 'T') { fwrite($o_fp,join(',',[ $split[0], $split[1], $split[3], $split[4], $split[5], $split[6], $split[10], $split[9],$split[11],$split[7],$split[32]])."\n"); } } fclose($i_fp); fclose($o_fp);

C++ code (runtime 3 m 59.074s) (compiled using g++ parse.cpp -o parse -O2 -std=c++1y)

#include <fstream> #include <stdlib.h> #include <string> #include <vector> using std::string; using std::vector; vector<string> splitStr(string line, const char delimiter = ',') { vector<string> splitLine; string buf; for(size_t i=0; i<line.length(); i++) { if(line[i] == delimiter) { splitLine.push_back(buf); buf.clear(); }else{ buf += line[i]; } } return splitLine; } string makeCSVLine(vector<string> splitLine) { string line = splitLine[0] + ',' + splitLine[1] + ',' + splitLine[3] + ',' + splitLine[4] + ',' + splitLine[5] + ',' + splitLine[6] + ',' + splitLine[10] + ',' + splitLine[9] + ',' + splitLine[11] + ',' + splitLine[7] + ',' + splitLine[32] + '\n'; return line; } int main(int argc, char* argv[]) { if(argc < 3) { exit(EXIT_SUCCESS); } string inPath = argv[1]; string outPath = argv[2]; std::ifstream inFile; std::ofstream outFile; inFile.open(inPath.c_str()); outFile.open(outPath.c_str()); string line; while(std::getline(inFile,line)) { vector<string> split = splitStr(line, ';'); if(split[3][0] == 'E' || split[3][0] == 'T') { outFile << makeCSVLine(split); } } inFile.close(); outFile.close(); }

Both running on Red Hat Linux 8. I'm sure that it's some mistake I'm making in terms of C++ efficiency (possibly somewhere in how I'm using strings and vectors and whether they need to be re-sized repeatedly per loop), but I'm not sure what it could be. If anyone could help, shed some light. That would be great.

EDIT: Unfortunately, I cannot provide the input file as its a sensitive internal file.

Thanks to everyone for taking so much interest in this and all the advice provided. I've been extremely busy at work lately and unable to re-visit but look forward to doing so soon.

Got a chuckle from the title. :-) Can you share the input file? — Loki Astari, CommentedJul 30, 2020 at 16:46
"I was trying to demonstrate to myself how much faster c++ is than even modern PHP." - You have demonstrated something that is often overlooked, which is that languages don't have performance, programs written in those languages have performance within a particular compilation and runtime configuration. Note that PHP is a compiled language, and the OpCache extension can optimise that compiled code, even before PHP 8 with its JIT engine, so if you're compiling with a high optimisation level, make sure to configure OpCache for a fair PHP comparison. — IMSoP, CommentedJul 31, 2020 at 11:07
Re "PHP parser": Don't you mean the parsed PHP program (not the PHP parser itself)? — Peter Mortensen, CommentedJul 31, 2020 at 13:10
This is an awesome thread. I think it demonstrates something often overlooked. An ordinary competent programmer will, in 2020, get good performances with script languages (PHP, Python...). Getting better performances with traditional languages requires a high level knowledge that not all people have. — Christian Lescuyer, CommentedAug 1, 2020 at 16:41

aki · Accepted Answer · 2020-08-01 17:03:12Z

Always Profile Optimized Code.

https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md#Rper-measure
Use -O3 optimisation: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
Use a profiler: https://github.com/KDAB/hotspot
- https://en.wikipedia.org/wiki/List_of_performance_analysis_tools

Reduce the duplication

string inPath = argv[1]; string outPath = argv[2]; std::ifstream inFile; std::ofstream outFile; inFile.open(inPath.c_str()); outFile.open(outPath.c_str());

to

std::ifstream inFile(argv[1]); std::ofstream outFile(argv[2]);

Avoid string operations and allocations as much as possible. Prefer std::string_view if the string is only being read.
Remove string makeCSVLine(vector<string> splitLine) and use a formatter library like {fmt}https://github.com/fmtlib/fmt. Just to make the code prettier in
```
 outFile << makeCSVLine(split); 
```
, you're paying with a significant time penalty. OR use the good old (discussed below) fprintf if that turns out to be faster. If there is not a significant time gain, follow the guidelines and use fmt + streams.
```
 fmt::print(<FILE*>, "{},{},{},{},{},{},{},{},{},{},{}\n", vec[0], vec[1], vec[3], vec[4], vec[5], vec[6], vec[10], vec[9], vec[11], vec[7], vec[32]); 
```
Make it a macro or a lambda, or a function with inline attribute set if you want to use it with other answers but in a separate block of code.
See speed tests by fmt also. source file
vector<string> splitStr(string line, const char delimiter = ',')
Avoid returning the vector and pass it by reference to fill it inside the function ( return type will be void). This makes it Return Value Optimisation independent. All compilers will treat it the same way.
Also, consider using .reserve(), and/or .emplace_back() for the vector. reserve() has been tested to improve performance.

Use stringstream + getline with a delimiter. If you doubt that this is time costly, profile. Don't guess the performance results, measure them.

void split_by_delim(const string &string, char delim, vector<string> &r_out) { std::stringstream ss(string); string word{}; // reserve space if you can guess it right. while (std::getline(ss, word, delim)) { if (!word.empty()) { r_out.push_back(word); } } }

Avoid fstreamsiff the reader or writer are the biggest time sinks. fprintf has been 40% faster in my tests with no loss in flexibility (I used it for writing ints and floats, it may vary(edit: yes it varied and the gain is insignificant compared to the other benefits of streams (or with fmt) ) with strings.).
Re comments that Stream IO is as fast as printf family IO, take it from Herb Sutter & Bjarne Stroustrup:
It is often (and often correctly) pointed out that the printf() family has two advantages compared to iostreams: flexibility of formatting and performance. This has to be weighed against iostreams advantages of extensibility to handle user-defined types, resilient against security violations, implicit memory management, and locale handling.
If you need I/O performance, you can almost always do better than printf().
Emphasis mine.
- https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md#Rio-streams
In the current code, the reading speed (getline()) is bound by splitting the string and the write speed. In other words, more lines cannot be read as long as file writer has not done its job. You're not using the disk's read speed to the full potential here.
Consider splitting them such that all reading is done at once and data is stored in memory and it is written out at once.
If you want to keep peak memory usage to minimum, make use of threads and separate the reader and the (splitter + writer) in asynchronous threads.

Addendum

Machine: MacBook Air 2017, macOS Mojave, MQD32LL https://en.wikipedia.org/wiki/MacBook_Air#Technical_specifications_2 Profiler: Instruments.app. Compiler: Apple LLVM version 10.0.1 (clang-1001.0.46.4) Target: x86_64-apple-darwin18.7.0 Flags: -Ofast (and linking with {fmt} where required.) PHP: 7.1.23 (cli) (built: Feb 22 2019 22:19:32) ( NTS )

Writer code to make CSV file, derived from Edward's answer for consistency: https://ideone.com/gkmUUN

Note that close enough timings should be considered the same due to fluctuations which can affect 3 to 5 seconds by a lot.

Matthew's code with the knowledge of line length and word length: 2.6s

Matthew's code as of rev 1 : 2.7s

Edward's algorithm with char array storage & {fmt}: https://ideone.com/Kfkp90. This depends on the knowledge that the incoming CSV has a maximum line length of 300 characters and a maximum word length of 20 characters. 2.8s.

Martin's code as of rev 7: 2.8s

For less bug-prone code, and dealing with unknown length strings: https://ideone.com/siFp3A. This is Edward's algorithm which uses {fmt}. 4.1s.

gdate +%s.%3N && php parser.php && gdate +%s.%3N where parser.php is in Edward's code as of rev 5 : 4.4s

Edward's code as of rev 1: 4.75s

Edward's code with fprintfhttps://ideone.com/0Lmr5P : 4.8

OP's code with basic optimisations, and fprintfhttps://ideone.com/5ivw9R : 5.6s

OP's C++ code posted in the question: 6.2s

OP's C++ code with -O2: 6.4s

OP's C++ code with -O0: 45s

Comments are not for extended discussion; this conversation has been moved to chat. — Mathieu Guindon, CommentedAug 1, 2020 at 19:08
"Programs spend surprising amounts of time in the damnedest places" - Anonymous — Bob Jarvis - Слава Україні, CommentedAug 1, 2020 at 21:11

Loki Astari · Accepted Answer · 2020-08-01 03:55:23Z

Overview

Akki has done a fine job on the review. Some things I want to emphasize:

You pass things by value rather than using const references.

vector<string> splitStr(string const& line, const char delimiter = ',') ^^^^^^ otherwise you are copying the line. string makeCSVLine(vector<string> const& splitLine) ^^^^^^ Copying a vector of strings that has to hurt.

Rather than build a string for output. Have a formatter object that knows how to convert stream your object (that is more C++ like).

 std::cout << MyFormat(splitLine);

Now the MyFormat is an object that simply keeps a reference to the splitLine.

 struct MyFormat { std::vector<std::string> const& data; MyFormat(std::vector<std::string> const& data) :data(data) {} };

But then you write an output formatter that knows how to stream the object:

 std::ostream& operator<<(std::ostream& str, MyFormat const& value) { return str << value.data[0] << "," << value.data[22] << "\n"; }

I refer you to my CSVIterator

How can I read and parse CSV files in C++?

Something that has turned up in this optimization battle. The use of string_view definitely helps in terms of performance (not really surprising on that).

But the nicest thing is simply updating the interface to use string_view and re-compiling working without changing the rest of the code.

This should work

#include <iterator> #include <iostream> #include <fstream> #include <sstream> #include <vector> #include <string> class CSVRow { using size_type = std::string::size_type; public: std::string_view operator[](std::size_t index) const { // Note the m_data[x] items point at where the // the ';' is. So there is some extra +1 to move to // the next item and when calculating lengths. return std::string_view(&m_line[m_data[index] + 1], m_data[index + 1] - (m_data[index] + 1)); } std::size_t size() const { // The m_data vector contains one more item // than there are elements. return m_data.size() - 1; } void readNextRow(std::istream& str) { std::getline(str, m_line); m_data.clear(); m_data.emplace_back(-1); size_type pos = 0; while((pos = m_line.find(';', pos)) != std::string::npos) { m_data.emplace_back(pos); ++pos; } // This checks for a trailing comma with no data after it. pos = m_line.size(); m_data.emplace_back(pos); } private: std::string m_line; std::vector<size_type> m_data; }; std::istream& operator>>(std::istream& str, CSVRow& data) { data.readNextRow(str); return str; } class CSVIterator { public: typedef std::input_iterator_tag iterator_category; typedef CSVRow value_type; typedef std::size_t difference_type; typedef CSVRow* pointer; typedef CSVRow& reference; CSVIterator(std::istream& str) :m_str(str.good()?&str:NULL) { ++(*this); } CSVIterator() :m_str(NULL) {} // Pre Increment CSVIterator& operator++() {if (m_str) { if (!((*m_str) >> m_row)){m_str = NULL;}}return *this;} // Post increment CSVIterator operator++(int) {CSVIterator tmp(*this);++(*this);return tmp;} CSVRow const& operator*() const {return m_row;} CSVRow const* operator->() const {return &m_row;} bool operator==(CSVIterator const& rhs) {return ((this == &rhs) || ((this->m_str == NULL) && (rhs.m_str == NULL)));} bool operator!=(CSVIterator const& rhs) {return !((*this) == rhs);} private: std::istream* m_str; CSVRow m_row; }; class CVSRange { std::istream& stream; public: CVSRange(std::istream& str) : stream(str) {} CSVIterator begin() const {return CSVIterator{stream};} CSVIterator end() const {return CSVIterator{};} }; class ReFormatRow { CSVRow const& row; public: ReFormatRow(CSVRow const& row) : row(row) {} friend std::ostream& operator<<(std::ostream& str, ReFormatRow const& data) { str << data.row[0] << ',' << data.row[1] << ',' << data.row[3] << ',' << data.row[4] << ',' << data.row[5] << ',' << data.row[6] << ',' << data.row[10] << ',' << data.row[9] << ',' << data.row[11] << ',' << data.row[7] << ',' << data.row[32] << '\n'; return str; } };

Then the main becomes really simple:

int main(int argc, char* argv[]) { if (argc != 3) { std::cerr << "Bad Arguments\n"; return -1; } std::ifstream input(argv[1]); std::ofstream output(argv[2]); for(auto& row : CVSRange(input)) { if(row[3][0] == 'E' || row[3][0] == 'T') { output << ReFormatRow(row); } } return 0; }

Very elegant! One small detail is that the input file actually uses a semicolon for a delimiter in the input file. The other is that it would be good to have error checking in case a short line is read with fewer than 32 fields. — Edward, CommentedJul 30, 2020 at 18:32
Updated to use string_view (this removed the need to make copies of each element) on the line. I simply store the line and keep track of where each element ends thus allowing me to return a string_view of an element. — Loki Astari, CommentedJul 31, 2020 at 18:00

Edward · Accepted Answer · 2020-07-31 18:11:00Z

There are a number of things you can do to improve your code.

Use const references where practical

The parameters passed to the functions can be sped up by passing them as const references instead of by value. Doing so tells both the compiler and other readers of the code that the passed parameter will not be altered, and allows for additional optimizations by the compiler.

Use `reserve` to improve speed

Since we know that the size of the vector must be at least 33 fields, it makes sense to use reserve to preallocate space.

Avoid constructing temporary variables

Rather than creating a std::string temporarily to print the output, an alternative approach would be to create a function that outputs them directly to the output.

Avoid work if possible

While it sounds like it might be the life philosophy of Tom Sawyer, it's also a good idea for optimizing software for performance. For instance, since the code is looking for something specific in the fourth field, if that criterion is not met by the time the fourth field is parsed, there's no reason to continue to parse the line. One way to convey a value that may or not be there is via std::optional which was introduced in C++17.

Results

csv.cpp

#include <fstream> #include <string> #include <vector> #include <sstream> #include <optional> constexpr std::size_t minfields{33}; std::optional<std::vector<std::string>> splitStr(const std::string& line, const char delimiter = ',') { std::vector<std::string> splitLine; splitLine.reserve(minfields); std::istringstream ss(line); std::string buf; unsigned field{0}; while (std::getline(ss, buf, delimiter)) { splitLine.push_back(buf); if (field == 3 && buf[0] != 'E' && buf[0] != 'T') { return std::nullopt; } ++field; } if (splitLine.size() < minfields) return std::nullopt; return splitLine; } std::ostream& writeLine(std::ostream& out, const std::vector<std::string>& splitLine) { return out << splitLine.at(0) << ',' << splitLine.at(1) << ',' << splitLine.at(3) << ',' << splitLine.at(4) << ',' << splitLine.at(5) << ',' << splitLine.at(6) << ',' << splitLine.at(10) << ',' << splitLine.at(9) << ',' << splitLine.at(11) << ',' << splitLine.at(7) << ',' << splitLine.at(32) << '\n'; } void copy_selective(std::istream& in, std::ostream& out) { std::string line; while(std::getline(in,line)) { auto split = splitStr(line, ';'); if (split) { writeLine(out, split.value()); } } } int main(int argc, char* argv[]) { if(argc >= 3) { std::ifstream inFile(argv[1]); std::ofstream outFile(argv[2]); copy_selective(inFile, outFile); } }

I created a file with one million lines, of which 499980, or just under half, were lines meeting the criteria from the original code. Here are the timings for a million-line file on my machine (Fedora Linux, using GCC 10.1 with -O2 optimization):

$$ \begin{array}{l|c|c} \text{version} & \text{time (s)} & \text{relative to PHP} \\ \hline \text{original} & 2.161 & 1.17 \\ \text{akki} & 1.955 & 1.06 \\ \text{akki w/ writeLine} & 1.898 & 1.03 \\ \text{php} & 1.851 & 1.00 \\ \text{Edward w/ printf} & 1.483 & 0.80 \\ \text{Edward} & 1.456 & 0.79 \\ \text{Matthew} & 0.737 & 0.40 \\ \text{Martin York} & 0.683 & 0.37 \end{array} $$

For these timings, the code labeled akki is https://ideone.com/5ivw9R , akki w/ writeLine is the same code, but modified to use writeLine shown above, and Edward w/ printf is the code shown here but modified to use fprintf. In all cases on my machine, the fstream versions are faster than the corresponding fprintf versions.

Input file

I created a simple file, with one million total lines. As mentioned above, only 499980 have the requisite "E" or "T" in the fourth field. All lines were repetitions of one these four lines:

one;two;three;Efour;five;six;seven;eight;nine;ten;eleven;twelve;thirteen;fourteen;fifteen;sixteen;seventeen;eighteen;nineteen;twenty;twenty-one;twenty-two;twenty-three;twenty-four;twenty-five;twenty-six;twenty-seven;twenty-eight;twenty-nine;thirty;thirty-one;thirty-two;thirty-three;thirty-four one;two;three;Tfour;five;six;seven;eight;nine;ten;eleven;twelve;thirteen;fourteen;fifteen;sixteen;seventeen;eighteen;nineteen;twenty;twenty-one;twenty-two;twenty-three;twenty-four;twenty-five;twenty-six;twenty-seven;twenty-eight;twenty-nine;thirty;thirty-one;thirty-two;thirty-three;thirty-four one;two;three;four;five;six;seven;eight;nine;ten;eleven;twelve;thirteen;fourteen;fifteen;sixteen;seventeen;eighteen;nineteen;twenty;twenty-one;twenty-two;twenty-three;twenty-four;twenty-five;twenty-six;twenty-seven;twenty-eight;twenty-nine;thirty;thirty-one;thirty-two;thirty-three;thirty-four one;two;three;Xfour;five;six;seven;eight;nine;ten;eleven;twelve;thirteen;fourteen;fifteen;sixteen;seventeen;eighteen;nineteen;twenty;twenty-one;twenty-two;twenty-three;twenty-four;twenty-five;twenty-six;twenty-seven;twenty-eight;twenty-nine;thirty;thirty-one;thirty-two;thirty-three;thirty-four

Fixed PHP version

Because I was unable to run the originally posted PHP code (it aborted with an error and produced a 0 length file), I made what I intended to be the minimal possible changes to it to get it to compile and run. A PHP expert (I am not one) might be able to further improve it, but its performance is quite good without taking much effort. (Timings above were using PHP 7.4.8 with Zend Engine v3.4.0.)

<?php $i_fp = fopen("million.in","r"); $o_fp = fopen("sample.out","w") or die("Unable to open outfile"); while(!feof($i_fp)) { $line = fgets($i_fp); $split = explode(';',$line); if(count($split) > 33 && ($split[3][0] == 'E' || $split[3][0] == 'T')) { fwrite($o_fp,join(',',[ $split[0], $split[1], $split[3], $split[4], $split[5], $split[6], $split[10], $split[9],$split[11],$split[7],$split[32]])."\n"); } } fclose($i_fp); fclose($o_fp); ?>

Comments are not for extended discussion; this conversation has been moved to chat. — Mathieu Guindon, CommentedAug 1, 2020 at 13:14

Matthew · Accepted Answer · 2020-07-31 12:52:04Z

Stop allocating memory:

Don't copy vectors around, pass by const ref instead
Don't make new strings when a string_view will do
Don't make new vectors when you can reuse the old one
Don't make a string from a char*, just to turn it back into a char* (this one is very minor since you only do it once)
Output directly to avoid creating a temporary string in makeCSVLine

With all that, here's what I came up with:

#include <fstream> #include <string> #include <string_view> #include <vector> using std::string; using std::string_view; using std::vector; void splitStr(string_view line, const char delimiter, vector<string_view>& splitLine) { splitLine.clear(); for(;;) { std::size_t pos = line.find(delimiter); if (pos == string_view::npos) { splitLine.push_back(line); return; } splitLine.push_back(line.substr(0, pos)); line = line.substr(pos+1, string_view::npos); } } template<typename T> void makeCSVLine(T& out, const vector<string_view>& splitLine) { out << splitLine[0] << ',' << splitLine[1] << ',' << splitLine[3] << ',' << splitLine[4] << ',' << splitLine[5] << ',' << splitLine[6] << ',' << splitLine[10] << ',' << splitLine[9] << ',' << splitLine[11] << ',' << splitLine[7] << ',' << splitLine[32] << '\n'; } int main(int argc, char* argv[]) { if(argc < 3) { exit(EXIT_SUCCESS); } const char* inPath = argv[1]; const char* outPath = argv[2]; std::ifstream inFile; std::ofstream outFile; inFile.open(inPath); outFile.open(outPath); vector<string_view> split; string line; while(std::getline(inFile, line)) { splitStr(line, ';', split); if(split[3][0] == 'E' || split[3][0] == 'T') { makeCSVLine(outFile, split); } } inFile.close(); outFile.close(); }

This could be made more durable by adding error checking. Adding a split.size() > 33 clause to the if in main covers one problem (short line in input file) without sacrificing any measurable performance. — Edward, CommentedJul 31, 2020 at 15:11

Your Common Sense · Accepted Answer · 2020-08-02 15:57:59Z

Initially I wrote an answer related to PHP part, suggesting the usage of dedicated functions for reading and writing csv, fgetcsv() and fputcsv() respectively, but I didn't test the code. Thanks to @akki who pointed out to some errors and the profiling results, I learned that these functions are dramatically slower, as explained in this answer. It looks like fgetcsv() is 40 times slower than fread/explode. However, to parse the proper csv, with field delimiters and escaping, you have to use the proper function anyway.

Here is the code

<?php $t = microtime(1); $i_fp = fopen("inFile.csv","r"); while(!feof($i_fp)) { $line = fgets($i_fp); $split = explode(';',$line); } echo "fgets: ".round(microtime(1)-$t,2)."\n"; $t = microtime(1); $i_fp = fopen("inFile.csv","r"); while (($split = fgetcsv($i_fp, 512, ';')) !== FALSE) { } echo "fgetcsv: ".round(microtime(1)-$t,2)."\n";

outputs for me

fgets: 2.1 fgetcsv: 84.45

on a file with 1 mil rows

Several problems: 1 It gives a syntax error since it has two opening braces after while(..){{ .. }2 It writes nothing to the outFile, it remains zero bytes. 3 It takes 9 seconds to complete (compare it with the timings in my answer.) // Run this quick script to make a file ideone.com/gkmUUN and run your program to test it. command I used (if it matters) gdate +%s.%3N && php parser.php && gdate +%s.%3N — aki, CommentedAug 2, 2020 at 8:28
If I read it correctly, (php noob here) it calls fgetcsv a million times and fputcsv 50k times. That will be very poor even in C++ due to system calls. can php buffer file contents ? — aki, CommentedAug 2, 2020 at 9:32
stackoverflow.com/questions/2749441/… apparently OP had already used a fast way to read files in php. — aki, CommentedAug 2, 2020 at 10:05
Well, surely we can buffer at the expense of memory, but the code in the OP doesn't buffer either. It seems fgetcsv is incredibly slow, like 40 times slower than fgets/explode. fputcsv is better but still several times slower than join/fwrite — Your Common Sense, CommentedAug 2, 2020 at 10:25

jamesqf · Accepted Answer · 2020-07-31 23:57:39Z

0

The other answers do a good job of analyzing the code, but they miss the most obvious point. Don't write parsers in C++, or C for that matter. Use (f)lex if the input is reasonably simple, flex + yacc/bison if it's complicated. Or possibly some other toolset designed for the job, but these are the most common. Your input is simple enough for a standalone flex analyzer.

https://en.wikipedia.org/wiki/Flex_(lexical_analyser_generator)https://en.wikipedia.org/wiki/GNU_Bison

answered Jul 31, 2020 at 23:57

jamesqf

1094 bronze badges

6
\$\begingroup\$I agree with the point of not writing parsers (at least initially) in C++. But in the case we are not really dealing with a gramer (in that it is way to simple to be a gramer it is simply a file format) and this is perfectly valid to be done by C++. By making this to a gramer and using FLEX/Yacc you would be making the resulting code more complicated.\$\endgroup\$
– Loki Astari
CommentedAug 1, 2020 at 0:03
\$\begingroup\$@Martin York: No, if you have a grammar (like a programming language) you will need to use YACC/Bison. The OP's input appears to be something a simple scanner could handle. Flex handles patterns, much as e.g. grep or other regex programs do. The program would be much simpler, though of course there's a learning curve. Unless of course you're talking about the DFA program generated by flex, but you ordinarily don't, any more than you look at the machine code your compiler generates.\$\endgroup\$
– jamesqf
CommentedAug 1, 2020 at 15:43
3
\$\begingroup\$I wrote a scanner for this using Flex and it was slower than the PHP version on my machine. If you write a faster one, please post it as a new question.\$\endgroup\$
– Edward
CommentedAug 1, 2020 at 17:23
1
\$\begingroup\$@MartinYork: Spelling reminder: it's "grammar".\$\endgroup\$
– Peter Cordes
CommentedAug 1, 2020 at 17:37
1
\$\begingroup\$I've posted my Flex version as a new question\$\endgroup\$
– Edward
CommentedAug 1, 2020 at 18:53

| Show 2 more comments

Stack Exchange Network

Parser written in PHP is 5.6x faster than the same C++ program in a similar test (g++ 4.8.5)

6 Answers 6

Always Profile Optimized Code.

Addendum

Overview

This should work

Use const references where practical

Use `reserve` to improve speed

Avoid constructing temporary variables

Avoid work if possible

Results

csv.cpp

Input file

Fixed PHP version

Linked

Hot Network Questions

Parser written in PHP is 5.6x faster than the same C++ program in a similar test (g++ 4.8.5)

6 Answers 6

Always Profile Optimized Code.

Addendum

Overview

This should work

Use const references where practical

Use reserve to improve speed

Avoid constructing temporary variables

Avoid work if possible

Results

csv.cpp

Input file

Fixed PHP version

Linked

Related

Hot Network Questions

Use `reserve` to improve speed