Parsing a std::string to a dictionary (std::map)

Question

As always - the context !

Preliminaries

Recently I needed a way to transport data from one environment to another. Though the proper way (probably) to do this would be to use databases (which I don't know how to work with) I chose writing the data to a simple text file which in turn is being read and parsed when needed. So typical content of such a file (real example so the alignment) :

Ticket: 2 Type: ORDER_TYPE_SELL_STOP State: ORDER_STATE_CANCELED Reason: ORDER_REASON_EXPERT Start: 1507032000000 End: 1507140000000 Ticket: 3 Type: ORDER_TYPE_BUY_STOP State: ORDER_STATE_CANCELED Reason: ORDER_REASON_EXPERT Start: 1507723200000 End: 1507760140000 Ticket: 4 Type: ORDER_TYPE_SELL_STOP State: ORDER_STATE_FILLED Reason: ORDER_REASON_EXPERT Start: 1508375100000 End: 1508389780000 Ticket: 5 Type: ORDER_TYPE_BUY State: ORDER_STATE_FILLED Reason: ORDER_REASON_SL Start: 1508392000000 End: 1508392000000 Ticket: 6 Type: ORDER_TYPE_SELL_STOP State: ORDER_STATE_CANCELED Reason: ORDER_REASON_EXPERT Start: 1508522400000 End: 1508524540000 Ticket: 7 Type: ORDER_TYPE_BUY_STOP State: ORDER_STATE_CANCELED Reason: ORDER_REASON_EXPERT Start: 1509624000000 End: 1509638920000 Ticket: 8 Type: ORDER_TYPE_BUY_STOP State: ORDER_STATE_CANCELED Reason: ORDER_REASON_EXPERT Start: 1509671100000 End: 1509732000000

where each line is (must be) of the form :

 Type : ORDER_TYPE_SELL_STOP State : .... ^ ^ ^ ^ ^ ^ ^ ^ ^ WPs KEY WPs D WPs VALUE WPs KEY D

where

WPs - any number of whitespace characters D - delimiter character, separates KEY and VALUE KEY, VALUE - WORDs : which contain neither a whitespace character nor the delimiter one

Note that one line may contain several KEY-VALUE pairs.

The Algorithm

Take a string S Remove first WP characters if any (until the first non-WP is met) from S if MAX_INIT_WORD of S is NOT EMPTY KEY := MAX_INIT_WORD Remove MAX_INIT_WORD from S else break Remove first WP characters if any (until the first non-WP is met) from S if S starts with D Remove D from S else break Remove first WP characters if any ( until the first non-WP is met) from S if MAX_INIT_WORD of S is NOT EMPTY VALUE := MAX_INIT_WORD Remove MAX_INIT_WORD from S

where MAX_INIT_WORD is a maximal prefix of S that doesn't contain WP or D.

The C++ code

#include <map> #include <string> using dict_t = std::map<std::string, std::string>; dict_t GetDict( std::string s, char delim = ':', const std::string& white = " \n\t\v\r\f" ) { dict_t m; if( white.empty() ) return m; s += white[0];// necessary if s doesn't contain trailing spaces size_t pos = 0; auto removeLeading = [&](){ if( (pos = s.find_first_not_of( white )) != std::string::npos) s.erase( 0, pos ); }; auto maxInitWord = [&]() -> std::string { std::string word; if( (pos = s.find_first_of( white + delim )) != std::string::npos ) { word = s.substr( 0, pos ); s.erase( 0, pos ); } return word; }; while( true ) { std::string key; removeLeading(); if( (key = maxInitWord()).empty() ) break; removeLeading(); if( s.empty() or s[0] != delim ) break; s.erase( 0, 1 ); removeLeading(); if( (m[key] = maxInitWord()).empty() ) break; } return m; }

Example (C++ interpreter used)

root [1] GetDict( "....Am-I...:...Inside...A:...Simulation?", ':', "." ) (dict_t) { "A" => "Simulation?", "Am-I" => "Inside" }

Evidently (thanks to @JDługosz), I should show the results on some "illegal" inputs. Though there are infinitely many of them, they can be divided into two groups: those which produce an empty map, and those which produce a map with an empty value:

root [6] GetDict( "....:....an empty map", ':', "." ) (dict_t) { }

and

root [7] GetDict( "an empty value....:....", ':', "." ) (dict_t) { "an empty value" => "" }

Of course, you can see a "partially" correct content:

root [8] GetDict( "partially....:....correct...content...:....", ':', "." ) (dict_t) { "content" => "", "partially" => "correct" }

JDługosz · Accepted Answer · 2021-12-17 21:13:39Z

Tip: Use string_view !
It has a remove_prefix that is perfect for this kind of scanning from the front. It just adjusts a pointer and does not have to keep re-copying the entire string after reading each token.

More traditionally, parsing would be done by advancing an iterator over the input, not manipulating the owning container (the string in your case).

What do you do if there's a syntax error in the input?

How does #include <std::string> compile?

It's interesting how you used lambdas for helper functions, to prevent having to pass additional parameters into (separate) helper functions. That's something we don't see a lot of in C++ yet. But, it's also unusual to make something like this a single function call rather than an object with several separate calls, too. In particular, you are limited to reading the entire record with one string submission. If you had an object that maintained the dictionary and a member function to accept input, you could call it more than once and easily have a single dictionary that spanned multiple lines of input, or was split across separate sources (like a default string or environment string in addition to the explicit parameters). These helpers would be private member functions.

It would be better if the delim and the input were passed as string_view, so it could handle a wider variety of actual argument types efficiently. But as noted earlier, you are re-copying the string over and over and over as you parse it, so that would not help. If you switched to string_view you would gain an enormous performance increase.

Overall, the code is generally good and clean. So, keep it up! And good luck with your projects.

1) "Use string_view" - Cool suggestion. Didn't even know about this class :) I will. Thank you. 2)"... parsing would be done ... iterators" - Agreed. 3)"What do you do if there is a syntax error in the input?" - Now, I just don't mind. There are two possibilities in this case: either an empty map would be returned or a map with empty value. I should have added these examples. I will add now. 4"How does `#include <std::string> compile?" - It's a type, ofc :) — LRDPRDX, CommentedDec 18, 2021 at 10:35

Shigoto Shoujin · Accepted Answer · 2021-12-20 04:58:58Z

If your need is strictly to pass around data, such as save data somewhere and then load the data somewhere else, then the specific format of the saved data should not matter. As you said, you could also use a database but then it would be much less flexible then a simple file.

So unless you have a specific requirement for the save data to be readable, I would suggest simplifying the code a lot by not saving / reading in a less readable but simpler format.

This example defines structs to hold your data and overload the << and >> operators to allow for simple save/load.

It is also using enum class for your OrderType and OrderState, as well as converting those enums to/from int.

Need C++20

#include <fstream> #include <vector> namespace load_save_tools { // Write a templated value to a stream void write(std::ostream& out_stream, auto const& value) { out_stream.write(reinterpret_cast<char const*>(&value), sizeof(value)); }; // Write an std::string to a stream by writing first the length then the data void write(std::ostream& out_stream, std::string const& value) { auto length = value.length(); write(out_stream, length); out_stream.write(value.c_str(), length); }; //Read into buffer void read(std::istream& in_stream, char* buffer, size_t size) { in_stream.read(buffer, size); } //Read into the templated value void read(std::istream& in_stream, auto* value) { size_t size = sizeof(*value); std::vector<char> buffer(size); read(in_stream, buffer.data(), size); *value = *((decltype(value))buffer.data()); }; //Read into an std::string, first read the length then read the data. void read(std::istream& in_stream, std::string* value) { size_t length; read(in_stream, &length); std::vector<char> buffer(length + 1); read(in_stream, buffer.data(), length); value->assign(buffer.data()); }; //Convert an enum to/from an int template <typename T> union EnumInt { T ec; int value; EnumInt(T ec) : ec{ ec } {} }; } using namespace load_save_tools; enum class OrderType : int { Buy, Sell, BuyStop, SellStop }; enum class OrderState : int { Filled, Canceled }; struct Order { int Ticket; OrderType Type; OrderState State; std::string Reason; long long Start; long long End; friend std::ostream& operator<<(std::ostream& lhs, Order const& rhs) { EnumInt<OrderType> type(rhs.Type); EnumInt<OrderState> state(rhs.State); write(lhs, rhs.Ticket); write(lhs, type.value); write(lhs, state.value); write(lhs, rhs.Reason); write(lhs, rhs.Start); write(lhs, rhs.End); return lhs; } friend std::istream& operator>>(std::istream& lhs, Order& rhs) { EnumInt<OrderType> type(rhs.Type); EnumInt<OrderState> state(rhs.State); read(lhs, &rhs.Ticket); read(lhs, &type.value); read(lhs, &state.value); read(lhs, &rhs.Reason); read(lhs, &rhs.Start); read(lhs, &rhs.End); rhs.Type = type.ec; rhs.State = state.ec; return lhs; } }; void save_to_file(Order const& order) { std::ofstream ofs("order.dat", std::ios_base::out | std::ios_base::binary | std::ios_base::trunc); ofs << order; } Order load_from_file() { Order order; std::ifstream ofs("order.dat", std::ios_base::in | std::ios_base::binary); ofs >> order; return order; } int main() { Order order = { .Ticket = 2, .Type = OrderType::SellStop, .State = OrderState::Canceled, .Reason = "Hello World", .Start = 1507032000000, .End = 1507140000000 }; save_to_file(order); Order loaded_order = load_from_file(); return 0; }

Very comprehensive answer and OFC +1. But a bit of corrections for now : 1) "I would suggest simplifying the code ... by ... not (?) saving/reading ...". Typo? — LRDPRDX, CommentedDec 20, 2021 at 9:55
I remember that using designated initializers is more about C not C++. I don't know about C++20 but in the earlier standards AFAIK you cannot use bot them and user-defined ctors. And even in C++20 I see that they are a restricted version of those in C. So I would not use them at all. — LRDPRDX, CommentedDec 20, 2021 at 10:15
A plain binary dump of simple types is certainly simple, and can be applied to whole structures too. It can be the fastest/easiest thing for quick and dirty saving and loading. But it has drawbacks for serious use. A serious version would be more like TIFF or PNG, at least tagging the types and possibly both type and field, and defining the Endian. I agree that such tagging does not have to be verbose text! — JDługosz, CommentedDec 20, 2021 at 14:50
If compiled with some other compiler or on a different enough architecture, this code will break. Many companies have found that they’ve accidentally created two incompatible data formats, and often, it’s not possible to tell automatically which one a file is saved in. (Or, worse, a disk partition.) So you definitely want to at least define the exact width of each numeric field, and the endianness. Note that you can just use the native endianness and int32_t or uint32_t for int. These have no performance penalty in the compiler you’re using, and make your code much more portable. — Davislor, CommentedDec 20, 2021 at 23:33
Good point on using int32_t and so on. Note that the file is not in binary format but in text, so endianness would be no issue. — Shigoto Shoujin, CommentedDec 21, 2021 at 1:01

Stack Exchange Network

Parsing a std::string to a dictionary (std::map)

Preliminaries

The Algorithm

The C++ code

Example (C++ interpreter used)

2 Answers 2

Hot Network Questions

Parsing a std::string to a dictionary (std::map)

Preliminaries

The Algorithm

The C++ code

Example (C++ interpreter used)

2 Answers 2

Related

Hot Network Questions