As part of a larger project I'm writing to learn C++, I'm trying to read through some almost-XML files generated by another program. Unfortunately this program uses its own custom escaping logic, so some of these files aren't actually valid XML. As far as I can tell, its escape logic works like this:
- Attribute values are always surrounded with double quotes. The only escape sequence is
\"
, evaluating to a double quote. Backslashes elsewhere are just interpreted literally; it's impossible for a value to end with a backslash. - Nested elements always have the opening outer tag on one line, each inner tag on its own line, and the closing outer tag on another.
- Tags with text content are always entirely on a single line, and interpret all characters between the opening tag and closing tag literally. It is impossible for a copy of the closing tag to be in the text content, but other tags may be.
In practice the files have a bit more structure than this, but I've been trying to work off of just these assumptions to keep the code flexible.
As an example, this is a possible, invalid XML, input:
<outer attr="&value\""> <empty /> <inner><notnested></outer></inner> </outer>
Which is equivalent to this valid XML:
<outer attr="&amp;value""> <empty /> <inner><notnested></outer></inner> </outer>
My approach to dealing with this is to do some simplified parsing over these files to convert them into valid XML, before feeding them into a more traditional XML parser (leaning towards pugixml, but open to change if it makes things easier). Here are the relevant functions in a small demo program:
#include <cctype> #include <fstream> #include <iostream> #include <sstream> #include <string> static void put_xml_escaped(char c, std::stringstream& stream) { switch (c) { case '"': stream << """; break; case '\'': stream << "'"; break; case '<': stream << "<"; break; case '>': stream << ">"; break; case '&': stream << "&"; break; default: stream << c; break; } } static std::stringstream preprocess_file(std::string filename) { enum tag_parse_state { // " <tag attr="val">content</tag> " before_tag, // ^^^^ tag_name, // ^^^^ tag_body, // ^^^^^^^ ^ attr, // ^^^^ content, // ^^^^^^^ closing_tag, // ^^^^^^ closed // ^^^^ }; std::ifstream file{filename}; std::stringstream processed{}; for (std::string line; std::getline(file, line); ) { tag_parse_state state = before_tag; size_t tag_name_start; size_t closing_tag_start, closing_tag_end; for (size_t i = 0; i < line.size(); i++) { auto c = line[i]; if (state == before_tag) { if (c == '<') { processed << c; state = tag_name; tag_name_start = i + 1; } else if (!isspace(c)) { throw std::runtime_error( "Failed to parse line (text before inital tag): " + line ); } } else if (state == tag_name) { processed << c; if (c == ' ' || c == '>') { // Nice little coincidence: if we're parsing over a closing tag, this will // double up on the '/' and thus not find itself auto closing_tag_str = ( "</" + line.substr(tag_name_start, i - tag_name_start) + ">" ); closing_tag_start = line.rfind(closing_tag_str); closing_tag_end = ( closing_tag_start == std::string::npos ? std::string::npos : closing_tag_start + closing_tag_str.size() - 1 ); if (c == ' ') { state = tag_body; } else { state = closing_tag_start == std::string::npos ? closed : content; } } } else if (state == tag_body) { processed << c; if (c == '"') { state = attr; } else if (c == '>') { state = closing_tag_start == std::string::npos ? closed : content; } } else if (state == attr) { if (c == '"') { processed << c; state = tag_body; } else if (c == '\\' && i + 1 < line.size() && line[i + 1] == '"') { put_xml_escaped('"', processed); i++; } else { put_xml_escaped(c, processed); } } else if (state == content) { put_xml_escaped(c, processed); if (i + 1 >= closing_tag_start) { state = closing_tag; } } else if (state == closing_tag) { processed << c; if (i >= closing_tag_end) { state = closed; } } else if (state == closed) { if (!isspace(c)) { throw std::runtime_error( "Failed to parse line (text after closing tag): " + line ); } } } if (state != closed) { throw std::runtime_error("Failed to parse line (missed closing tag): " + line); } } return processed; } int main() { std::cout << preprocess_file("example.xml").str() << "\n"; return 0; }
I'm interested in feedback on:
- General best practices (minus code style)
- Edge cases I may have missed
- Algorithmic improvements - I don't think you can do much better than O(n) when using a state machine like this, but maybe there's a completely different approach.