There are a number of things you can do to improve your code.
Use const references where practical
The parameters passed to the functions can be sped up by passing them as const
references instead of by value. Doing so tells both the compiler and other readers of the code that the passed parameter will not be altered, and allows for additional optimizations by the compiler.
Use reserve
to improve speed
Since we know that the size of the vector must be at least 33 fields, it makes sense to use reserve
to preallocate space.
Avoid constructing temporary variables
Rather than creating a std::string
temporarily to print the output, an alternative approach would be to create a function that outputs them directly to the output.
Avoid work if possible
While it sounds like it might be the life philosophy of Tom Sawyer, it's also a good idea for optimizing software for performance. For instance, since the code is looking for something specific in the fourth field, if that criterion is not met by the time the fourth field is parsed, there's no reason to continue to parse the line. One way to convey a value that may or not be there is via std::optional
which was introduced in C++17.
Results
csv.cpp
#include <fstream> #include <string> #include <vector> #include <sstream> #include <optional> constexpr std::size_t minfields{33}; std::optional<std::vector<std::string>> splitStr(const std::string& line, const char delimiter = ',') { std::vector<std::string> splitLine; splitLine.reserve(minfields); std::istringstream ss(line); std::string buf; unsigned field{0}; while (std::getline(ss, buf, delimiter)) { splitLine.push_back(buf); if (field == 3 && buf[0] != 'E' && buf[0] != 'T') { return std::nullopt; } ++field; } if (splitLine.size() < minfields) return std::nullopt; return splitLine; } std::ostream& writeLine(std::ostream& out, const std::vector<std::string>& splitLine) { return out << splitLine.at(0) << ',' << splitLine.at(1) << ',' << splitLine.at(3) << ',' << splitLine.at(4) << ',' << splitLine.at(5) << ',' << splitLine.at(6) << ',' << splitLine.at(10) << ',' << splitLine.at(9) << ',' << splitLine.at(11) << ',' << splitLine.at(7) << ',' << splitLine.at(32) << '\n'; } void copy_selective(std::istream& in, std::ostream& out) { std::string line; while(std::getline(in,line)) { auto split = splitStr(line, ';'); if (split) { writeLine(out, split.value()); } } } int main(int argc, char* argv[]) { if(argc >= 3) { std::ifstream inFile(argv[1]); std::ofstream outFile(argv[2]); copy_selective(inFile, outFile); } }
I created a file with one million lines, of which 499980, or just under half, were lines meeting the criteria from the original code. Here are the timings for a million-line file on my machine (Fedora Linux, using GCC 10.1 with -O2
optimization):
$$ \begin{array}{l|c|c} \text{version} & \text{time (s)} & \text{relative to PHP} \\ \hline \text{original} & 2.161 & 1.17 \\ \text{akki} & 1.955 & 1.06 \\ \text{akki w/ writeLine} & 1.898 & 1.03 \\ \text{php} & 1.851 & 1.00 \\ \text{Edward w/ printf} & 1.483 & 0.80 \\ \text{Edward} & 1.456 & 0.79 \\ \text{Matthew} & 0.737 & 0.40 \\ \text{Martin York} & 0.683 & 0.37 \end{array} $$
For these timings, the code labeled akki
is https://ideone.com/5ivw9R , akki w/ writeLine
is the same code, but modified to use writeLine
shown above, and Edward w/ printf
is the code shown here but modified to use fprintf
. In all cases on my machine, the fstream
versions are faster than the corresponding fprintf
versions.
Input file
I created a simple file, with one million total lines. As mentioned above, only 499980 have the requisite "E" or "T" in the fourth field. All lines were repetitions of one these four lines:
one;two;three;Efour;five;six;seven;eight;nine;ten;eleven;twelve;thirteen;fourteen;fifteen;sixteen;seventeen;eighteen;nineteen;twenty;twenty-one;twenty-two;twenty-three;twenty-four;twenty-five;twenty-six;twenty-seven;twenty-eight;twenty-nine;thirty;thirty-one;thirty-two;thirty-three;thirty-four one;two;three;Tfour;five;six;seven;eight;nine;ten;eleven;twelve;thirteen;fourteen;fifteen;sixteen;seventeen;eighteen;nineteen;twenty;twenty-one;twenty-two;twenty-three;twenty-four;twenty-five;twenty-six;twenty-seven;twenty-eight;twenty-nine;thirty;thirty-one;thirty-two;thirty-three;thirty-four one;two;three;four;five;six;seven;eight;nine;ten;eleven;twelve;thirteen;fourteen;fifteen;sixteen;seventeen;eighteen;nineteen;twenty;twenty-one;twenty-two;twenty-three;twenty-four;twenty-five;twenty-six;twenty-seven;twenty-eight;twenty-nine;thirty;thirty-one;thirty-two;thirty-three;thirty-four one;two;three;Xfour;five;six;seven;eight;nine;ten;eleven;twelve;thirteen;fourteen;fifteen;sixteen;seventeen;eighteen;nineteen;twenty;twenty-one;twenty-two;twenty-three;twenty-four;twenty-five;twenty-six;twenty-seven;twenty-eight;twenty-nine;thirty;thirty-one;thirty-two;thirty-three;thirty-four
Fixed PHP version
Because I was unable to run the originally posted PHP code (it aborted with an error and produced a 0 length file), I made what I intended to be the minimal possible changes to it to get it to compile and run. A PHP expert (I am not one) might be able to further improve it, but its performance is quite good without taking much effort. (Timings above were using PHP 7.4.8 with Zend Engine v3.4.0.)
<?php $i_fp = fopen("million.in","r"); $o_fp = fopen("sample.out","w") or die("Unable to open outfile"); while(!feof($i_fp)) { $line = fgets($i_fp); $split = explode(';',$line); if(count($split) > 33 && ($split[3][0] == 'E' || $split[3][0] == 'T')) { fwrite($o_fp,join(',',[ $split[0], $split[1], $split[3], $split[4], $split[5], $split[6], $split[10], $split[9],$split[11],$split[7],$split[32]])."\n"); } } fclose($i_fp); fclose($o_fp); ?>