6
\$\begingroup\$

This code is to solve a Hackerrank problem called Attribute Parser. It's my first time dealing with regex in cpp and I wonder if I managed to be expressive enough to tell with the code.

The HRML explanation from Hackerrank:

We have defined our own markup language HRML. In HRML, each element consists of a starting and ending tag, and there are attributes associated with each tag. Only starting tags can have attributes. We can call an attribute by referencing the tag, followed by a tilde, '~' and the name of the attribute. The tags may also be nested.

The opening tags follow the format:

<tag-name attribute1-name = "value1" attribute2-name = "value2" ... > 

The closing tags follow the format:

< /tag-name > 

For example:

<tag1 value = "HelloWorld"> <tag2 name = "Name1"> </tag2> </tag1> 

The attributes are referenced as:

tag1~value tag1.tag2~name 

You are given the source code in HRML format consisting of lines. You have to answer queries. Each query asks you to print the value of the attribute specified. Print "Not Found!" if there isn't any such attribute.

I've decided to solve this using the cpp regex engine from the standard library.

//! Hackerrank HRML Attribute Parser //! This program reads from an input file that passes a HRML document as explained in //! in the Hackerrank challenge "Attribute Parser". //! The first line of the input file include n and q, where n is the number of lines //! of the HRML documment that follows, and q is the number of querries that follow the //! HRML documment #include <iostream> #include <fstream> #include <string> #include <regex> #include <unordered_map> int main() { std::ifstream ifile("input"); std::smatch result; if (ifile.is_open()) { int n,q; ifile >> n >> q; ifile.ignore(); std::string document; for (;n>0;--n) { std::string line; std::getline(ifile, line); document.append(line); } using Tag_name = std::string; using Attribute_name = std::string; std::unordered_map<Tag_name, std::unordered_map<Attribute_name, std::string>> tag_map; Tag_name tag_name{}; std::regex tag_regex(R"(<[^>]*)"); auto tag_matches_begin = std::sregex_iterator(document.begin(), document.end(), tag_regex); auto tag_matches_end = std::sregex_iterator(); for (auto tag_it = tag_matches_begin; tag_it != tag_matches_end; ++tag_it) { std::smatch match = *tag_it; auto match_string = match.str(); // if beginig of tag <tag ... if (std::regex_search(match_string, result, std::regex(R"(<\s*([^/]\w*))"))) { std::string new_tag_name = result[1].str(); if (tag_name.empty()) { tag_name = new_tag_name; } else { tag_name = tag_name + "." + new_tag_name; } std::string search_string = match_string; while (std::regex_search(search_string, result, std::regex(R"re(([^=\s]*)\s*=\s*"([^"]*))re"))) { std::string attribute_name = result[1].str(); std::string attribute_value = result[2].str(); tag_map[tag_name][attribute_name] = attribute_value; search_string = result.suffix(); } } // if end of tag </tag> else if (std::regex_search(match_string, result, std::regex(R"(</\s*(\w*))"))) { std::string end_tag_name = result[1].str(); tag_name = std::regex_replace(tag_name, std::regex(end_tag_name), ""); tag_name = std::regex_replace(tag_name, std::regex(R"(\.$)"), ""); } } for (;q>0;--q) { std::string line; std::getline(ifile, line); std::regex_search(line, result, std::regex(R"((.*)~(.*))")); std::string tag_name = result[1].str(); std::string attribute_name = result[2].str(); if (tag_map[tag_name].count(attribute_name) > 0 ) { std::cout << tag_map[tag_name][attribute_name] << "\n"; } else { std::cout << "Not Found!" << "\n"; } } std::cout << std::flush; } else { std::cout << "Unable to open input file" << std::endl;; } return 0; } 

I find my regex expressions a little bit cryptic. I wonder how it feels for a third person who reads the code. Any suggestion on style and other tips?

The code works fine, and as an example, the following input:

7 10 <a value = "GoodVal"> <b value = "BadVal" size = "10"> <c height = "auto"> <d size = "3"> <e strength = "200%"> <f a1 = "1" a2 = "2" a3 = "3"> </f> </e> </d> </c> </b> </a> a.b.c.d.e.f~a1 a.b.f~a1 a.b~size a.b.c.d.e.f~a2 a.b.c.d.e.f~a3 a.c~height a.b.d.e~strength a.b.c.d.e~strength d~sze a.b.c.d~size 

Produces the following output:

1 Not Found! 10 2 3 Not Found! Not Found! 200% Not Found! 3 
\$\endgroup\$
3
  • 2
    \$\begingroup\$I wouldn't use regex. I would write a recursive descent parser. en.wikipedia.org/wiki/Recursive_descent_parser\$\endgroup\$CommentedDec 25, 2017 at 21:30
  • \$\begingroup\$@RobertAndrzejuk Interesting, I don't know much about parsers, any special motivation for implementing that particular one?\$\endgroup\$
    – Blasco
    CommentedDec 25, 2017 at 21:52
  • \$\begingroup\$@RobertAndrzejuk, I believe it is possible to solve with tweaking std::ctype<char>, and then imbuing that into the line stream. It will be a little bit awkward, but will be extremely easy to implement given some time to test for corner cases.\$\endgroup\$CommentedDec 26, 2017 at 12:37

1 Answer 1

7
\$\begingroup\$

Here are some things that may help you improve your program.

Fix the bug

When an end tag is found, we have this code:

else if (std::regex_search(match_string, result, std::regex(R"(</\s*(\w*))"))) { std::string end_tag_name = result[1].str(); std::cout << "End tag = [" << end_tag_name << "]\n"; tag_name = std::regex_replace(tag_name, std::regex(end_tag_name), ""); tag_name = std::regex_replace(tag_name, std::regex(R"(\.$)"), ""); } 

That might work for some inputs, but it's almost certainly not what you intend. Consider the following valid input:

<a value="first"><beta value="second"><e value="third"></e></beta></a> 

The problem is that when we get to the closing </e> tag, tag_name = "a.beta.e" but the result of the regex_replace calls alters it to become "a.bta" which is clearly wrong. In this case, I'd be inclined to use a simple queue of tags, pushing and popping them off the stack as needed, and constructing the lookup string on the fly and completely eliminating this use of regex within the program. Alternatively, since we're only looking for the tag at the end of the extended tag, one could use a single expression:

tag_name = std::regex_replace(tag_name, std::regex(R"(\.)" + end_tag_name + R"($)"), ""); 

But be aware that using user-supplied data as part of a regex is potentially perilous.

Think of the user

Having a hard-coded file name is usually a bad idea. It limits the flexibility of the user of the program and makes it impossible, for example, to redirect input from another program to use as input to this one. For all of these reasons, I'd suggest that either the program should read from std::cin or that the file name should be provided as a command line parameter. Also, although I know it's not under your control, the format of the input file which requires two line counts, is similarly terrible design (although very common with such code contests). Rather than forcing the user to count lines, in the real world we'd probably prefer to use special delimiter and let the computer do the counting. That makes it much easier for a human to make changes to the input file.

Fix spelling errors

The code has documment instead of document and querries instead of queries. These kinds of typos don't bother the compiler at all, of course, but they will bother human readers of the code and make it a little more difficult to understand and maintain. It's worth the small extra bit of time to do a spell check on comments.

Use C++ idioms

The code has if (ifile.is_open()) at the top of the program. I'd write that instead as if (ifile) to make it easier to read and also to allow for

Break up the code into smaller functions

Everything is being done in main which makes it harder to read and understand. Rather than having everything in one long function, it would be easier to read and maintain if each discrete step were its own function. For example, the steps here are to parse and store the data and then to run queries against it. Here's one possible main with these things applied:

int main(int argc, char *argv[]) { if (argc < 2) { std::cout << "Usage: attparse filename\n"; return 0; } std::ifstream ifile(argv[1]); if (!ifile) { std::cout << "Unable to open input file\n"; return 1; } int n,q; ifile >> n >> q; ifile.ignore(std::numeric_limits<std::streamsize>::max(), '\n'); auto tag_map{parselines(ifile, n)}; queries(ifile, q, tag_map); return 0; } 

Consider more robust error checking

If we replace the first line of your test file with this one:

<a malformed value = "aval" tricky = "></a"> 

we get "Not Found!" for everything. The cause, of course, is the "tricky" attribute that contains tag delimiting characters. Also, the "malformed" tag has no value. However, the program does not complain about either of these. Perhaps it should! Similarly, if the input file is malformed, (e.g. the number of lines is specified as "foo" for example), the program will issue no complaint.

Understand standard containers

In the part the looks up tags, the code includes this line:

if (tag_map[tag_name].count(attribute_name) > 0 ) { 

That's not necessarily an error, but you should be aware that by using tag_map[tag_name], you're actually inserting tag_name into tag_map if it didn't already exist. See the documentation for std::unordered_map::operator[] for details.

Here's a way to write a function that takes a query line and a const reference to the "database" and returns the appropriate string:

std::string query(const std::string &line, const TagMap &tag_map) { std::smatch result; std::string resultstring{"Not Found!"}; std::regex_search(line, result, std::regex(R"((.*)~(.*))")); std::string tag_name = result[1].str(); std::string attribute_name = result[2].str(); auto tag = tag_map.find(tag_name); if (tag != tag_map.end()) { auto attr = tag->second.find(attribute_name); if (attr != tag->second.end()) { resultstring = attr->second; } } return resultstring; } 

Use const where practical

Each of the regex expressions in the code are constants, so I'd probably declare and name them as such. For example, instead of this:

while (std::regex_search(search_string, result, std::regex(R"re(([^=\s]*)\s*=\s*"([^"]*))re"))) { 

I'd advocate writing this:

const auto attr_regex{std::regex(R"re(([^=\s]*)\s*=\s*"([^"]*)")re")}; while (std::regex_search(search_string, result, attr_regex)) { 

(Note also the addition of the closing " for the attribute value.)

Recheck your regular expressions

The closing > on a close tag is not actually explicitly checked for in the regex.

Consider an alternative approach

As an exercise to learn how to use regex, this is not bad, but I'd probably use the standard flex and bison tools if I were actually writing a parser. These tools have a somewhat steep learning curve but result in very nice fast parsers and scanners and the time invested in learning them will be repaid many times over if you continue to write software.

\$\endgroup\$
4
  • \$\begingroup\$Great feedback! Thank you Edward. What would be a good approach to avoid inserting a new value with "tag_map[tag_name]"? I think I could try "at(tag_name)" and catch the exception if it doesn't exsist. Is there a better approach?\$\endgroup\$
    – Blasco
    CommentedDec 26, 2017 at 17:25
  • 1
    \$\begingroup\$I've added an example query function to show one way to do it. Generally, it's better for performance to avoid exceptions if possible.\$\endgroup\$
    – Edward
    CommentedDec 26, 2017 at 17:31
  • \$\begingroup\$I don't understand the purpose of ifile.ignore() in your suggested main function. Could you please explain that? As far as I understand it, it reads and discards all the content until the end of the file, but this doesn't make sense to me given that parselines(ifile, ...) and queries(ifile, ...) are used subsequently. (Besides, you use both ifile and infile, it should be only one of them.)\$\endgroup\$
    – mkrieger1
    CommentedJan 18, 2018 at 17:36
  • \$\begingroup\$The call to ifile.ignore() was only intended to read to the end of the line rather than the end of the file. I've fixed my answer. Thanks!\$\endgroup\$
    – Edward
    CommentedJan 18, 2018 at 18:28

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.