Parsing positive integers from CSV file

Question

I Needed to parse some positive integers with the following constraints:

If possible use C++ only (no C API)
Needs to be able to process parsed data after each newline
Should be reasonably fast
Memory usage should be kept low (i.e. don't read the entire file in one go)
CSV file is well formed so error checking can be mostly omitted
Unknown amount of lines in file
Should support both unix (\n) and windows (\r\n) line endings

This code runs fine and is faster than the standard approach of streaming via >>. On average it takes 290ms to parse 10,000,000 lines whereas the version with >> takes 1.1s which I think is not a bad improvement.

However I have some concerns about this:

Are there any obvious mistakes that make this slower than it should be?
Are there any side effects or undefined behavior?
Is the template part applied correctly or is there a better way to do it?
Does the naming make sense?

Please also point out anything else you notice.

Sample input:

1,22,333 4444,55555,666666

Code:

#include <algorithm> #include <fstream> #include <string> #include <vector> inline void parse_uints( const char* buffer_ptr, std::vector<uint_fast16_t>& numbers) { bool is_eol = false; while (!is_eol) { uint_fast16_t parsed_number = 0; for (;;) { if (*buffer_ptr == ',') { break; } if (*buffer_ptr == '\r' || *buffer_ptr == '\0') { is_eol = true; break; } parsed_number = (parsed_number * 10) + (*buffer_ptr++ - '0'); } // skip delimiter ++buffer_ptr; numbers.emplace_back(parsed_number); } } template<typename T> void read_line( const std::string& filename, const uint_fast8_t& line_length, const uint_fast8_t& values_per_line, T callback) { std::ifstream infile{filename}; if (!infile.good()) { return; } std::vector<uint_fast16_t> numbers; numbers.reserve(values_per_line); std::string buffer; buffer.reserve(line_length); while (infile.good() && std::getline(infile, buffer)) { parse_uints(buffer.data(), numbers); callback(numbers); numbers.clear(); } } int main(int argc, char** argv) { constexpr uint_fast8_t line_length = 25; constexpr uint_fast8_t values_per_line = 3; read_line(argv[1], line_length, values_per_line, [](auto& values) { // do something with the values here, for example get the max auto max = std::max_element(std::begin(values), std::end(values)); }); }

You can't rely on the '\r' character portably. You might need to expect a single '\n' as well. — πάντα ῥεῖ, CommentedMar 9, 2018 at 17:26
@πάνταῥεῖ Are there issues with this when running on Windows? I only intend to run this on Linux (haven't tested it on Windows) and it worked fine when I tried it. — yuri, CommentedMar 9, 2018 at 18:25
I'd expect issues the other way round. The sequence \r\nline endings are typical for windows: cs.toronto.edu/~krueger/csc209h/tut/line-endings.html — πάντα ῥεῖ, CommentedMar 9, 2018 at 18:30
@πάνταῥεῖ Yeah, my plan was to run this on Linux but also be able to parse CSV files coming from windows with their exotic line ending. — yuri, CommentedMar 9, 2018 at 18:37
Sorry, I didn't spot that you're using std::getline() to read the lines delimited with '\n' in 1st place. The way you have it your code is portable. — πάντα ῥεῖ, CommentedMar 9, 2018 at 18:52

miscco · Accepted Answer · 2018-03-09 19:26:43Z

Some couple nits.

You function read_line does not what it suggests. It reads a file line by line. So you should find an appropriate name.
Your choice of uint_fast8_t/uint_fast16_t is interesting given that your example csv table hold some larger values. Have you measured the impact vs unsigned int or std::size_t. Be aware that modern cpu can read multiple elements at once so you choice might actually have negative impact.
Every line you clear numbers and build it up again. It will be beneficial to just initialize it once via resize and then just overwrite the old value. That way you can omit the unnecessary clear
Getline takes a third argument which is the delemiter. Also getline stops at newline and end of file so you dont need all that error checking Given that you now the number of entries per line you can simply loop and accumulate
```
while (infile.good()) { for (auto&& elem : numbers) { std::getline(infile, buffer, ","); elem = 0; for (auto&& digit : buffer) { elem *= 10; elem += (digit - '0'); } } callback(numbers); } 
```

If you want to be sure you can check getline and return if its bad.

 while (infile.good()) { for (auto&& elem : numbers) { if (!std::getline(infile, buffer, ",")) { return; } elem = 0; for (auto&& digit : buffer) { elem *= 10; elem += (digit - '0'); } } callback(numbers); }

The values I pass are not actually the exact sizes I expect in the file but rather an estimate to avoid resizing of the buffers while it parses the file. If numbers is not cleared it could possibly contain old values that won't be overwritten if the next line is shorter than the previous one. At least that was my reasoning. — yuri, CommentedMar 9, 2018 at 19:37
Upon further testing it seems supplying getline with a delimiter makes it read until that delimiter even across line endings. So this code will always read the last value of a given line and the first value of the next line because only then it finds the , delimiter again. — yuri, CommentedMar 10, 2018 at 10:10
That is incorrect according to cplusplus.com/reference/string/string/getline So it seems there is something wrong with your line endings — miscco, CommentedMar 10, 2018 at 12:27
Have you tested it? cplusplus.com has been wrong in the past before. Also cppreference does not corroborate the behavior you describe. — yuri, CommentedMar 10, 2018 at 12:58

Stack Exchange Network

Parsing positive integers from CSV file

Sample input:

Code:

1 Answer 1

Linked

Hot Network Questions

Parsing positive integers from CSV file

Sample input:

Code:

1 Answer 1

Linked

Related

Hot Network Questions