5
\$\begingroup\$

I am trying to write a utility for transforming xml trees using single pass and specifying node handlers.
I used pugixml prior to this, and the library provides a rather convenient API to traverse child nodes based on XPath strings.
However, I don't quite like the idea of multiple invocations because it means traversing a xml document multiple times. What I would like to have is a utility to help one traverse and transform a xml document like AST using dfs node walker.

The main idea for the utility is to have a simple declarative/reactive API to specify node handlers before transform's invocation.
This is a part of small compilable example (full code at the end):

#include <pugixml.hpp> #include <iostream> namespace dummy { struct data { size_t nodes; std::vector<std::string> countries; }; } // namespace dummy class printing_walker { public: explicit printing_walker(const node_link<pugi::xml_node> &nodeLink, size_t *indent = nullptr) // : dfs_(nodeLink), indent_(indent) { indent_ && (*indent_ += 2); } printing_walker(printing_walker &&other) // : dfs_(other.dfs_.node_link()), indent_(other.indent_) { other.indent_ = nullptr; } template<typename Factory> std::optional<std::runtime_error> operator()(const Factory &factory, dummy::data &data) const { ++data.nodes; const auto &nodeType = dfs_.node_link().node.type(); const std::string indent = indent_ ? std::string(*indent_, ' ') : ""; if (pugi::xml_node_type::node_element == nodeType) { std::cout << indent << '<' << traits::name(dfs_.node_link().node) << '>' << std::endl; auto err = dfs_(factory, data); std::cout << indent << "</" << traits::name(dfs_.node_link().node) << '>' << std::endl; return err; } if (pugi::xml_node_type::node_pcdata == nodeType) { std::cout << indent << '"' << dfs_.node_link().node.text().as_string() << '"' << std::endl; return std::nullopt; } return dfs_(factory, data); } ~printing_walker() { indent_ && (*indent_ -= 2); } private: basic_dfs_walker<pugi::xml_node> dfs_; size_t *indent_; }; class node_country { public: constexpr explicit node_country(const node_link<pugi::xml_node> &nodeLink, size_t *indent = nullptr) : nodeLink_(nodeLink), indent_(indent) {} template<typename FactoryT> std::optional<std::runtime_error> operator()(const FactoryT &, dummy::data &data) const { ++data.nodes; std::cout << (indent_ ? std::string(*indent_, ' ') : "") << '<' << nodeLink_.node.child_value() << '>' << std::endl; data.countries.template emplace_back(nodeLink_.node.child_value()); return std::nullopt; } private: const node_link<pugi::xml_node> &nodeLink_; size_t *indent_; }; constexpr const char sample[] = R"(<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type='text/xsl'?> <CATALOG> <CD> <TITLE>dill diya galla</TITLE> <ARTIST>Arijit singh</ARTIST> <COUNTRY>India</COUNTRY> <COMPANY>tseries</COMPANY> <PRICE>10.90</PRICE> <YEAR>2018</YEAR> </CD> <CD> <TITLE>Saiyara</TITLE> <ARTIST>Atif Aslam</ARTIST> <COUNTRY>Uk</COUNTRY> <COMPANY>Records</COMPANY> <PRICE>9.90</PRICE> <YEAR>2015</YEAR> </CD> <CD> <TITLE>Khairiyat</TITLE> <ARTIST>Sonu nigam</ARTIST> <COUNTRY>india</COUNTRY> <COMPANY>radio</COMPANY> <PRICE>9.90</PRICE> <YEAR>2019</YEAR> </CD> <CD> <TITLE>all is well</TITLE> <ARTIST>Amir Khan</ARTIST> <COUNTRY>pak</COUNTRY> <COMPANY>Virgin records</COMPANY> <PRICE>10.20</PRICE> <YEAR>2012</YEAR> </CD> <CD> <TITLE>rockstar</TITLE> <ARTIST>kk</ARTIST> <COUNTRY>india</COUNTRY> <COMPANY>Eros</COMPANY> <PRICE>9.90</PRICE> <YEAR>2014</YEAR> </CD> </CATALOG>)"; template<> struct traits::_children<pugi::xml_node> { decltype(auto) operator()(const pugi::xml_node &node) const { return node.children(); } }; int main(int, char **) { pugi::xml_document document{}; document.load_string(sample); size_t indent = 0; const auto &onDefault = [&indent](const auto &nodeLink) { return printing_walker(nodeLink, &indent); }; const auto &onCountry = [&indent](const auto &nodeLink) { return node_country(nodeLink, &indent); }; const auto &transform = make_transform<pugi::xml_node>(onDefault, // on("COUNTRY", onCountry)); dummy::data data{}; std::optional<std::runtime_error> err = transform(document.root(), data); if (!err) std::cout << "transformed " << data.nodes << " nodes" << std::endl; std::cout << "Countries:\n"; for (const auto &country : data.countries) std::cout << country << std::endl; return 0; } 

This example prints each node by default and invokes a customized handler for each COUNTRY node. For now there's a complete but naive implementation that uses hash table to find a specialized node handler based on the node name. I believe the complexity to be O(1) on average accroding to cppreference for std::unordered_map.

Here is a part of an API that I intend to implement, but am not quite sure if I am going in a right direction:

const auto &transform = make_transform<pugi::xml_node>( // default node handler onDefault, // specialized handlers, order of precedence matters. More specialized go after less specialized // on each node with name "COUNTRY" on("COUNTRY", onCountry), // on each node with name "COUNTRY" and attribute "language on("COUNTRY", attribute("language"), onLanguage), // like above, but for "english" specifically on("COUNTRY", attribute("language", "english"), onEnglishLanguage)); 

What I don't like:

  1. My code "hijacks" the control flow. It requires the user to explicilty state that he wants to continue the dfs traversal by inheriting/aggregating basic dfs node walker. Customizing traversal behavior can be benificial for many cases (for example if one wants to stop at some point), but for now one has to write a lot of code and explicitly manage the control flow. I would like to be able to write less code -> only the code that provides business logic. Like simple lambda.
  2. The API does not feel like std::transform. The standard library algorithm prevents the user from mingling with the control flow by requesting an UnaryPredicate that returns an object of a type convertible to output iterator. The problem is, I don't know whether it even possible to represent an output tree by an output iterator. So for now I pass Data & as an argument for a walker's operator().
    I am thinking of requiesting a convertible object or a modificating callable from the handler. Something like: return [&toAdd](auto &data){ data.someMember += toAdd; }.
  3. As a workaround for 1. I decided to pass a walker factory to the walker's operator(), so the node walker can fetch a new walker from the factory for its child nodes. It doesn't feel right! The only reason for a node walker to get an instance of a factory is to provide the means to continue traversal.
  4. I tried to write the implementation with optimization and copy elision in mind. But I am not sure that the code is optimization friendly. The only thing I am sure is that variadic_callable is not copied. That makes me a little happier.

Here is the complete code with implementation and example: https://godbolt.org/z/ejqjK35P9

 #include <cstddef> #include <functional> #include <type_traits> #include <utility> #include <optional> #include <stdexcept> #include <tuple> #include <variant> #include <string_view> #include <unordered_map> namespace detail { template<typename... Impl> class variadic_callable { public: template<typename Factory, typename... Args> constexpr explicit variadic_callable(Factory f, Args &&...args) // : varCallable_(std::in_place_type<_in_place<std::invoke_result_t<Factory, Args...>>>, [&]() { return f(std::forward<Args>(args)...); }) {} variadic_callable(const variadic_callable &) = delete; variadic_callable(variadic_callable &&) = delete; template<typename... Args> constexpr decltype(auto) operator()(Args &&...args) const { return std::visit( [&](const auto &wrapped) { return wrapped.callable(std::forward<Args>(args)...); }, varCallable_); } private: template<typename T> struct _in_place { T callable; template<typename Factory> _in_place(Factory factory) // : callable(factory()) {} }; private: std::variant<_in_place<Impl>...> varCallable_; }; } // namespace detail namespace traits { template<typename NodeT> struct _children { constexpr decltype(auto) operator()(const NodeT &node) const { return std::invoke(&NodeT::children, node); } }; template<typename NodeT> decltype(auto) children(const NodeT &node) { return _children<NodeT>{}(node); } template<typename T> struct _name { constexpr decltype(auto) operator()(const T &t) const { return std::invoke(&T::name, t); } }; template<typename T> constexpr decltype(auto) name(const T &t) { return _name<T>{}(t); } } // namespace traits template<typename NodeT> struct node_link { using node_type = NodeT; node_type node; size_t depth = 0; const node_link *parent = nullptr; }; template<typename DefaultWalkerConstructor, typename... WalkerConstructors> class walker_factory { public: explicit walker_factory(DefaultWalkerConstructor defaultWalkerConstructor, std::pair<std::string_view, WalkerConstructors>... pairs) // : defaultConstructor_(defaultWalkerConstructor) //, { // TODO: assert keys for uniqueness (constructorsMap_.emplace(std::get<0>(pairs), std::move(std::get<1>(pairs))), ...); } walker_factory(const walker_factory &) = delete; walker_factory(walker_factory &&) = delete; template<typename NodeT> [[nodiscard]] auto get(const node_link<NodeT> &nodeLink) const { using variadic_node_walker = detail::variadic_callable<constructed_walker_type<NodeT, DefaultWalkerConstructor>, constructed_walker_type<NodeT, WalkerConstructors>...>; auto found = constructorsMap_.find(traits::name(nodeLink.node)); if (found == constructorsMap_.cend()) return variadic_node_walker(defaultConstructor_, nodeLink); return std::visit( [&nodeLink](const auto &constructor) { return variadic_node_walker(constructor, nodeLink); }, found->second); } private: template<typename NodeT, typename WConstructor> using constructed_walker_type = std::invoke_result_t<WConstructor, node_link<NodeT>>; private: DefaultWalkerConstructor defaultConstructor_; std::unordered_map<std::string_view, // variant's list can't be empty, hence DefaultWalkerConstructor std::variant<DefaultWalkerConstructor, WalkerConstructors...>> constructorsMap_; }; template</*template<typename, typename> class Pair,*/ typename DefaultWalkerConstructor, typename... WalkerConstructors> walker_factory(DefaultWalkerConstructor defaultWalkerConstructor, /*Pair*/ std::pair<std::string_view, WalkerConstructors>... pairs) -> walker_factory<DefaultWalkerConstructor, WalkerConstructors...>; template<typename NodeT, typename DefaultWalkerConstructor, typename... Pairs> constexpr auto make_transform(DefaultWalkerConstructor defaultWalkerConstructor, Pairs... pairs) { return [walkerFactory = walker_factory(defaultWalkerConstructor, pairs...)](const NodeT &node, auto &parseData) { const auto &walker = walkerFactory.get(node_link<NodeT>{ node }); return walker(walkerFactory, parseData); }; } // for composition use template<typename NodeT> class basic_dfs_walker { public: using node_type = NodeT; constexpr explicit basic_dfs_walker(const node_link<node_type> &nodeLink) // : nodeLink_(nodeLink) {} template<typename WalkerFactory, typename DataT> std::optional<std::runtime_error> operator()(const WalkerFactory &walkerFactory, DataT &parseData) const { for (const node_type &child : traits::children(nodeLink_.node)) { auto err = on_child(walkerFactory, child, parseData); if (!err) continue; return err; } return std::nullopt; } [[nodiscard]] const auto &node_link() const { return nodeLink_; } ~basic_dfs_walker() {} private: template<typename WalkerFactory, typename DataT> [[nodiscard]] std::optional<std::runtime_error> on_child(const WalkerFactory &walkerFactory, const node_type &child, DataT &parseData) const { const auto &nodeWalker = walkerFactory.get(::node_link<node_type>{ child, nodeLink_.depth + 1, &nodeLink_ }); return nodeWalker(walkerFactory, parseData); } protected: static void operator delete(void *) {} private: const ::node_link<node_type> nodeLink_; }; template<typename NodeT, typename... Pairs> constexpr auto make_transform(Pairs &&...pairs) { return make_transform<NodeT>( [](const auto &nodeLink) { return basic_dfs_walker<NodeT>{ nodeLink }; }, std::forward<Pairs>(pairs)...); } template<typename T> constexpr auto on_default() { return [](const auto &nodeLink) { return T(nodeLink); }; } template<typename T> constexpr auto on(std::string_view name) { return std::make_pair(name, [](const auto &nodeLink) { return T(nodeLink); }); } template<typename Constructor> constexpr auto on(std::string_view name, Constructor &&constructor) { return std::make_pair(name, std::forward<Constructor>(constructor)); } #include <pugixml.hpp> #include <iostream> namespace dummy { struct data { size_t nodes; std::vector<std::string> countries; }; } // namespace dummy class printing_walker { public: explicit printing_walker(const node_link<pugi::xml_node> &nodeLink, size_t *indent = nullptr) // : dfs_(nodeLink), indent_(indent) { indent_ && (*indent_ += 2); } template<typename Factory> std::optional<std::runtime_error> operator()(const Factory &factory, dummy::data &data) const { ++data.nodes; const auto &nodeType = dfs_.node_link().node.type(); const std::string indent = indent_ ? std::string(*indent_, ' ') : ""; if (pugi::xml_node_type::node_element == nodeType) { std::cout << indent << '<' << traits::name(dfs_.node_link().node) << '>' << std::endl; auto err = dfs_(factory, data); std::cout << indent << "</" << traits::name(dfs_.node_link().node) << '>' << std::endl; return err; } if (pugi::xml_node_type::node_pcdata == nodeType) { std::cout << indent << '"' << dfs_.node_link().node.text().as_string() << '"' << std::endl; return std::nullopt; } return dfs_(factory, data); } ~printing_walker() { indent_ && (*indent_ -= 2); } private: basic_dfs_walker<pugi::xml_node> dfs_; size_t *indent_; }; class node_country { public: constexpr explicit node_country(const node_link<pugi::xml_node> &nodeLink, size_t *indent = nullptr) : nodeLink_(nodeLink), indent_(indent) {} template<typename FactoryT> std::optional<std::runtime_error> operator()(const FactoryT &, dummy::data &data) const { ++data.nodes; std::cout << (indent_ ? std::string(*indent_, ' ') : "") << '<' << nodeLink_.node.child_value() << '>' << std::endl; data.countries.template emplace_back(nodeLink_.node.child_value()); return std::nullopt; } private: const node_link<pugi::xml_node> &nodeLink_; size_t *indent_; }; constexpr const char sample[] = R"(<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type='text/xsl'?> <CATALOG> <CD> <TITLE>dill diya galla</TITLE> <ARTIST>Arijit singh</ARTIST> <COUNTRY>India</COUNTRY> <COMPANY>tseries</COMPANY> <PRICE>10.90</PRICE> <YEAR>2018</YEAR> </CD> <CD> <TITLE>Saiyara</TITLE> <ARTIST>Atif Aslam</ARTIST> <COUNTRY>Uk</COUNTRY> <COMPANY>Records</COMPANY> <PRICE>9.90</PRICE> <YEAR>2015</YEAR> </CD> <CD> <TITLE>Khairiyat</TITLE> <ARTIST>Sonu nigam</ARTIST> <COUNTRY>india</COUNTRY> <COMPANY>radio</COMPANY> <PRICE>9.90</PRICE> <YEAR>2019</YEAR> </CD> <CD> <TITLE>all is well</TITLE> <ARTIST>Amir Khan</ARTIST> <COUNTRY>pak</COUNTRY> <COMPANY>Virgin records</COMPANY> <PRICE>10.20</PRICE> <YEAR>2012</YEAR> </CD> <CD> <TITLE>rockstar</TITLE> <ARTIST>kk</ARTIST> <COUNTRY>india</COUNTRY> <COMPANY>Eros</COMPANY> <PRICE>9.90</PRICE> <YEAR>2014</YEAR> </CD> </CATALOG>)"; // somehow doesn't work without it template<> struct traits::_children<pugi::xml_node> { decltype(auto) operator()(const pugi::xml_node &node) const { return node.children(); } }; int main(int, char **) { pugi::xml_document document{}; document.load_string(sample); size_t indent = 0; const auto &onDefault = [&indent](const auto &nodeLink) { return printing_walker(nodeLink, &indent); }; const auto &onCountry = [&indent](const auto &nodeLink) { return node_country(nodeLink, &indent); }; const auto &transform = make_transform<pugi::xml_node>(onDefault, // on("COUNTRY", onCountry)); dummy::data data{}; std::optional<std::runtime_error> err = transform(document.root(), data); if (!err) std::cout << "transformed " << data.nodes << " nodes" << std::endl; std::cout << "Countries:\n"; for (const auto &country : data.countries) std::cout << country << std::endl; return 0; } 

I would be also glad to know how could I refactor this code to logically separate concerns:

  • basic architectural units
  • user-provided lightweigh customization points (node handlers)
  • points for "heavy" customization for the algorithmic part
\$\endgroup\$

    1 Answer 1

    3
    \$\begingroup\$

    Consider using a SAX parser

    I am trying to write a utility for transforming xml trees using single pass [...]

    But you are using a DOM parser library. This means it will parse the whole XML document into a tree structure, and then you get to traverse it once again. That means that you are doing at least two passes over the data.

    A SAX parser would allow you to hook into it and perform actions while it is parsing the XML document. For something simple as just reading all the <COUNTRY> tags, that is more than sufficient.

    Your design is too complicated

    1. My code "hijacks" the control flow. It requires the user to explicilty state that he wants to continue the dfs traversal by inheriting/aggregating basic dfs node walker.

    There are several ways of solving that in a less complicated manner than you did. Consider having the user-provided code return a bool that indicates whether they want to recurse further or not.

    I would like to be able to write less code -> only the code that provides business logic. Like simple lambda.

    That's a great idea. Start by writing the example code the way you want it to be. For example:

    pugi::xml_document document{}; document.load_string(sample); std::set<std::string> countries; traverse(document, on("COUNTRY", [](auto& node) { countries.emplace(node.child_value()); }) ); 

    Then write the rest of the code to support that. That's of course easier said than done, but starting with how you want your interface to look is better than first creating something complicated that then turns out to be unwieldy.

    1. The API does not feel like std::transform. The standard library algorithm prevents the user from mingling with the control flow [...]

    Again, you can have the callback return some value that will steer how the tree is traversed.

    I don't know whether it even possible to represent an output tree by an output iterator.

    There might be some issues with that, but actually your code doesn't need an output iterator. You implemented something that is closer to std::for_each() than std::transform(). The former just visits each element, but if elements are mutable, the callback can change those elements.

    1. As a workaround for 1. I decided to pass a walker factory to the walker's operator(), so the node walker can fetch a new walker from the factory for its child nodes. It doesn't feel right!

    Indeed, this is way too complicated. You shouldn't need factories and variants at all.

    1. I tried to write the implementation with optimization and copy elision in mind. But I am not sure that the code is optimization friendly. The only thing I am sure is that variadic_callable is not copied. That makes me a little happier.

    But the callable are stored in a std::variant<>, which makes a copy. Also, the constructor of walker_factory takes its arguments by value, thus making more copies. Consider what happens if walker constructors are passed that store data by value.

    Throw exceptions or return std::error_codes

    It's weird to see std::optional<std::runtime_error> being returned. std::runtime is an exception type, this is something you normally want to throw. If you want to return an error code, std::error_code is more appropriate.

    \$\endgroup\$
    3
    • \$\begingroup\$it is complicated, that is true. But those are internals, and I am trying to reduce that complexity. The complexity of the code should not exceed the compexity of the business logic. Moreover, I do try to come up with a convenient API first, but implementing a simple case has lead me to the code I have... Also I for C++ I am not aware of decent SAX parsers (or any at all TBH)\$\endgroup\$CommentedSep 19, 2022 at 21:02
    • \$\begingroup\$Why the whole factory pattern and the variants though? You could just use a std::unordered_map<std::string, std::function<void(pugi::xml_node&)>> to store the walkers. That would still be able to support your simple case. As for C++ SAX parsers: there are several C++ bindings for expat that you could have a look at.\$\endgroup\$CommentedSep 19, 2022 at 21:09
    • \$\begingroup\$I'd say it is premature compile-time-ization. std::function does not guarantee it will not allocate data on the heap + it can throw exceptions\$\endgroup\$CommentedSep 19, 2022 at 21:14

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.