3
\$\begingroup\$

I Needed to parse some positive integers with the following constraints:

  • If possible use C++ only (no C API)

  • Needs to be able to process parsed data after each newline

  • Should be reasonably fast

  • Memory usage should be kept low (i.e. don't read the entire file in one go)

  • CSV file is well formed so error checking can be mostly omitted

  • Unknown amount of lines in file

  • Should support both unix (\n) and windows (\r\n) line endings

This code runs fine and is faster than the standard approach of streaming via >>. On average it takes 290ms to parse 10,000,000 lines whereas the version with >> takes 1.1s which I think is not a bad improvement.

However I have some concerns about this:

  • Are there any obvious mistakes that make this slower than it should be?

  • Are there any side effects or undefined behavior?

  • Is the template part applied correctly or is there a better way to do it?

  • Does the naming make sense?

Please also point out anything else you notice.

Sample input:

1,22,333 4444,55555,666666 

Code:

#include <algorithm> #include <fstream> #include <string> #include <vector> inline void parse_uints( const char* buffer_ptr, std::vector<uint_fast16_t>& numbers) { bool is_eol = false; while (!is_eol) { uint_fast16_t parsed_number = 0; for (;;) { if (*buffer_ptr == ',') { break; } if (*buffer_ptr == '\r' || *buffer_ptr == '\0') { is_eol = true; break; } parsed_number = (parsed_number * 10) + (*buffer_ptr++ - '0'); } // skip delimiter ++buffer_ptr; numbers.emplace_back(parsed_number); } } template<typename T> void read_line( const std::string& filename, const uint_fast8_t& line_length, const uint_fast8_t& values_per_line, T callback) { std::ifstream infile{filename}; if (!infile.good()) { return; } std::vector<uint_fast16_t> numbers; numbers.reserve(values_per_line); std::string buffer; buffer.reserve(line_length); while (infile.good() && std::getline(infile, buffer)) { parse_uints(buffer.data(), numbers); callback(numbers); numbers.clear(); } } int main(int argc, char** argv) { constexpr uint_fast8_t line_length = 25; constexpr uint_fast8_t values_per_line = 3; read_line(argv[1], line_length, values_per_line, [](auto& values) { // do something with the values here, for example get the max auto max = std::max_element(std::begin(values), std::end(values)); }); } 
\$\endgroup\$
9
  • \$\begingroup\$You can't rely on the '\r' character portably. You might need to expect a single '\n' as well.\$\endgroup\$CommentedMar 9, 2018 at 17:26
  • \$\begingroup\$@πάνταῥεῖ Are there issues with this when running on Windows? I only intend to run this on Linux (haven't tested it on Windows) and it worked fine when I tried it.\$\endgroup\$
    – yuri
    CommentedMar 9, 2018 at 18:25
  • \$\begingroup\$I'd expect issues the other way round. The sequence \r\nline endings are typical for windows: cs.toronto.edu/~krueger/csc209h/tut/line-endings.html\$\endgroup\$CommentedMar 9, 2018 at 18:30
  • \$\begingroup\$@πάνταῥεῖ Yeah, my plan was to run this on Linux but also be able to parse CSV files coming from windows with their exotic line ending.\$\endgroup\$
    – yuri
    CommentedMar 9, 2018 at 18:37
  • \$\begingroup\$Sorry, I didn't spot that you're using std::getline() to read the lines delimited with '\n' in 1st place. The way you have it your code is portable.\$\endgroup\$CommentedMar 9, 2018 at 18:52

1 Answer 1

1
\$\begingroup\$

Some couple nits.

  1. You function read_line does not what it suggests. It reads a file line by line. So you should find an appropriate name.

  2. Your choice of uint_fast8_t/uint_fast16_t is interesting given that your example csv table hold some larger values. Have you measured the impact vs unsigned int or std::size_t. Be aware that modern cpu can read multiple elements at once so you choice might actually have negative impact.

  3. Every line you clear numbers and build it up again. It will be beneficial to just initialize it once via resize and then just overwrite the old value. That way you can omit the unnecessary clear

  4. Getline takes a third argument which is the delemiter. Also getline stops at newline and end of file so you dont need all that error checking Given that you now the number of entries per line you can simply loop and accumulate

    while (infile.good()) { for (auto&& elem : numbers) { std::getline(infile, buffer, ","); elem = 0; for (auto&& digit : buffer) { elem *= 10; elem += (digit - '0'); } } callback(numbers); } 

If you want to be sure you can check getline and return if its bad.

 while (infile.good()) { for (auto&& elem : numbers) { if (!std::getline(infile, buffer, ",")) { return; } elem = 0; for (auto&& digit : buffer) { elem *= 10; elem += (digit - '0'); } } callback(numbers); } 
\$\endgroup\$
4
  • \$\begingroup\$The values I pass are not actually the exact sizes I expect in the file but rather an estimate to avoid resizing of the buffers while it parses the file. If numbers is not cleared it could possibly contain old values that won't be overwritten if the next line is shorter than the previous one. At least that was my reasoning.\$\endgroup\$
    – yuri
    CommentedMar 9, 2018 at 19:37
  • \$\begingroup\$Upon further testing it seems supplying getline with a delimiter makes it read until that delimiter even across line endings. So this code will always read the last value of a given line and the first value of the next line because only then it finds the , delimiter again.\$\endgroup\$
    – yuri
    CommentedMar 10, 2018 at 10:10
  • \$\begingroup\$That is incorrect according to cplusplus.com/reference/string/string/getline So it seems there is something wrong with your line endings\$\endgroup\$
    – miscco
    CommentedMar 10, 2018 at 12:27
  • \$\begingroup\$Have you tested it? cplusplus.com has been wrong in the past before. Also cppreference does not corroborate the behavior you describe.\$\endgroup\$
    – yuri
    CommentedMar 10, 2018 at 12:58

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.