4
\$\begingroup\$

As part of a larger project I'm writing to learn C++, I'm trying to read through some almost-XML files generated by another program. Unfortunately this program uses its own custom escaping logic, so some of these files aren't actually valid XML. As far as I can tell, its escape logic works like this:

  • Attribute values are always surrounded with double quotes. The only escape sequence is \", evaluating to a double quote. Backslashes elsewhere are just interpreted literally; it's impossible for a value to end with a backslash.
  • Nested elements always have the opening outer tag on one line, each inner tag on its own line, and the closing outer tag on another.
  • Tags with text content are always entirely on a single line, and interpret all characters between the opening tag and closing tag literally. It is impossible for a copy of the closing tag to be in the text content, but other tags may be.

In practice the files have a bit more structure than this, but I've been trying to work off of just these assumptions to keep the code flexible.

As an example, this is a possible, invalid XML, input:

<outer attr="&amp;value\""> <empty /> <inner><notnested></outer></inner> </outer> 

Which is equivalent to this valid XML:

<outer attr="&amp;amp;value&quot;"> <empty /> <inner>&lt;notnested&gt;&lt;/outer&gt;</inner> </outer> 

My approach to dealing with this is to do some simplified parsing over these files to convert them into valid XML, before feeding them into a more traditional XML parser (leaning towards pugixml, but open to change if it makes things easier). Here are the relevant functions in a small demo program:

#include <cctype> #include <fstream> #include <iostream> #include <sstream> #include <string> static void put_xml_escaped(char c, std::stringstream& stream) { switch (c) { case '"': stream << "&quot;"; break; case '\'': stream << "&apos;"; break; case '<': stream << "&lt;"; break; case '>': stream << "&gt;"; break; case '&': stream << "&amp;"; break; default: stream << c; break; } } static std::stringstream preprocess_file(std::string filename) { enum tag_parse_state { // " <tag attr="val">content</tag> " before_tag, // ^^^^ tag_name, // ^^^^ tag_body, // ^^^^^^^ ^ attr, // ^^^^ content, // ^^^^^^^ closing_tag, // ^^^^^^ closed // ^^^^ }; std::ifstream file{filename}; std::stringstream processed{}; for (std::string line; std::getline(file, line); ) { tag_parse_state state = before_tag; size_t tag_name_start; size_t closing_tag_start, closing_tag_end; for (size_t i = 0; i < line.size(); i++) { auto c = line[i]; if (state == before_tag) { if (c == '<') { processed << c; state = tag_name; tag_name_start = i + 1; } else if (!isspace(c)) { throw std::runtime_error( "Failed to parse line (text before inital tag): " + line ); } } else if (state == tag_name) { processed << c; if (c == ' ' || c == '>') { // Nice little coincidence: if we're parsing over a closing tag, this will // double up on the '/' and thus not find itself auto closing_tag_str = ( "</" + line.substr(tag_name_start, i - tag_name_start) + ">" ); closing_tag_start = line.rfind(closing_tag_str); closing_tag_end = ( closing_tag_start == std::string::npos ? std::string::npos : closing_tag_start + closing_tag_str.size() - 1 ); if (c == ' ') { state = tag_body; } else { state = closing_tag_start == std::string::npos ? closed : content; } } } else if (state == tag_body) { processed << c; if (c == '"') { state = attr; } else if (c == '>') { state = closing_tag_start == std::string::npos ? closed : content; } } else if (state == attr) { if (c == '"') { processed << c; state = tag_body; } else if (c == '\\' && i + 1 < line.size() && line[i + 1] == '"') { put_xml_escaped('"', processed); i++; } else { put_xml_escaped(c, processed); } } else if (state == content) { put_xml_escaped(c, processed); if (i + 1 >= closing_tag_start) { state = closing_tag; } } else if (state == closing_tag) { processed << c; if (i >= closing_tag_end) { state = closed; } } else if (state == closed) { if (!isspace(c)) { throw std::runtime_error( "Failed to parse line (text after closing tag): " + line ); } } } if (state != closed) { throw std::runtime_error("Failed to parse line (missed closing tag): " + line); } } return processed; } int main() { std::cout << preprocess_file("example.xml").str() << "\n"; return 0; } 

I'm interested in feedback on:

  • General best practices (minus code style)
  • Edge cases I may have missed
  • Algorithmic improvements - I don't think you can do much better than O(n) when using a state machine like this, but maybe there's a completely different approach.
\$\endgroup\$
2
  • \$\begingroup\$This isn't really an XML question. It's about a proprietary data syntax that has some resemblances to XML, but you can't use any XML tooling, so this doesn't help you.\$\endgroup\$CommentedJul 30, 2021 at 16:31
  • \$\begingroup\$Fair enough, removed that tag.\$\endgroup\$
    – apple1417
    CommentedJul 30, 2021 at 22:20

1 Answer 1

1
\$\begingroup\$

Use enum class

Consider using enum class instead of a plain enum for a little more type safety.

I would also recommend you use all-caps for the names of the enum options, as this is commonly done and is a strong hint that those are constants and not variables.

Prefer using switch-statements when enums are involved

Use a switch-statement instead of a chain of if-else-statements when checking the value of an enum variable. Most compilers will then be able to emit warnings if you forget a case statement. It also makes the code a little bit more readable.

Make preprocess_file() take istream and ostreams as parameters

Your preprocess_file() has a filename as an argument, and returns a std::stringstream. However, that means that now it is responsible for opening the file, and returning the result as a std::stringstream is inefficient if you are just going to write it to std::out afterwards. Consider having this function take two parameters: one for the input stream, and one for the output stream, like so:

void preprocess(std::istream &file, std::ostream &processed) { ... } 

Then in main() you can write:

preprocess(std::ifstream("example.xml"), std::cout); std::cout << "\n"; 

If you want to be able to write std::cout << preprocess(...), that is possible too, but then you have to make it a class and add a friend operator<<(std::ostream &, preprocess &) function, which might be a bit overkill for your application.

Missing error checking

I/O errors can happen at any time, both when reading from and writing to a file. You could add checks after every I/O operation, but that would quickly lead to a lot of messy code. Luckily, any error state persists, so what you can do is just add checks at the end of preprocess_file(), like so:

if (file.fail()) { throw std::runtime_error("Error while reading input"); } if (processed.fail()) { throw std::runtime_error("Error while writing output"); } 

Performance

You are going over each character individually, most of the time just copying it to the output. For each state you only have a few characters you want to match. Consider using std::string::find_first_of() to get the position of the first interesting character; the standard library might do that in a more optimal way.

\$\endgroup\$
1
  • 1
    \$\begingroup\$Lot of good tips, thanks. I realized I didn't mention it in the post, but for what it's worth I was looking at feeding pugixml from a stream, hence why I left it outputting the stringstream. I can definitely see how having the parent function manage ownership would be better though.\$\endgroup\$
    – apple1417
    CommentedJul 30, 2021 at 22:29

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.