I am trying to write a utility for transforming xml trees using single pass and specifying node handlers.
I used pugixml
prior to this, and the library provides a rather convenient API to traverse child nodes based on XPath strings.
However, I don't quite like the idea of multiple invocations because it means traversing a xml document multiple times. What I would like to have is a utility to help one traverse and transform a xml document like AST using dfs node walker.
The main idea for the utility is to have a simple declarative/reactive API to specify node handlers before transform
's invocation.
This is a part of small compilable example (full code at the end):
#include <pugixml.hpp> #include <iostream> namespace dummy { struct data { size_t nodes; std::vector<std::string> countries; }; } // namespace dummy class printing_walker { public: explicit printing_walker(const node_link<pugi::xml_node> &nodeLink, size_t *indent = nullptr) // : dfs_(nodeLink), indent_(indent) { indent_ && (*indent_ += 2); } printing_walker(printing_walker &&other) // : dfs_(other.dfs_.node_link()), indent_(other.indent_) { other.indent_ = nullptr; } template<typename Factory> std::optional<std::runtime_error> operator()(const Factory &factory, dummy::data &data) const { ++data.nodes; const auto &nodeType = dfs_.node_link().node.type(); const std::string indent = indent_ ? std::string(*indent_, ' ') : ""; if (pugi::xml_node_type::node_element == nodeType) { std::cout << indent << '<' << traits::name(dfs_.node_link().node) << '>' << std::endl; auto err = dfs_(factory, data); std::cout << indent << "</" << traits::name(dfs_.node_link().node) << '>' << std::endl; return err; } if (pugi::xml_node_type::node_pcdata == nodeType) { std::cout << indent << '"' << dfs_.node_link().node.text().as_string() << '"' << std::endl; return std::nullopt; } return dfs_(factory, data); } ~printing_walker() { indent_ && (*indent_ -= 2); } private: basic_dfs_walker<pugi::xml_node> dfs_; size_t *indent_; }; class node_country { public: constexpr explicit node_country(const node_link<pugi::xml_node> &nodeLink, size_t *indent = nullptr) : nodeLink_(nodeLink), indent_(indent) {} template<typename FactoryT> std::optional<std::runtime_error> operator()(const FactoryT &, dummy::data &data) const { ++data.nodes; std::cout << (indent_ ? std::string(*indent_, ' ') : "") << '<' << nodeLink_.node.child_value() << '>' << std::endl; data.countries.template emplace_back(nodeLink_.node.child_value()); return std::nullopt; } private: const node_link<pugi::xml_node> &nodeLink_; size_t *indent_; }; constexpr const char sample[] = R"(<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type='text/xsl'?> <CATALOG> <CD> <TITLE>dill diya galla</TITLE> <ARTIST>Arijit singh</ARTIST> <COUNTRY>India</COUNTRY> <COMPANY>tseries</COMPANY> <PRICE>10.90</PRICE> <YEAR>2018</YEAR> </CD> <CD> <TITLE>Saiyara</TITLE> <ARTIST>Atif Aslam</ARTIST> <COUNTRY>Uk</COUNTRY> <COMPANY>Records</COMPANY> <PRICE>9.90</PRICE> <YEAR>2015</YEAR> </CD> <CD> <TITLE>Khairiyat</TITLE> <ARTIST>Sonu nigam</ARTIST> <COUNTRY>india</COUNTRY> <COMPANY>radio</COMPANY> <PRICE>9.90</PRICE> <YEAR>2019</YEAR> </CD> <CD> <TITLE>all is well</TITLE> <ARTIST>Amir Khan</ARTIST> <COUNTRY>pak</COUNTRY> <COMPANY>Virgin records</COMPANY> <PRICE>10.20</PRICE> <YEAR>2012</YEAR> </CD> <CD> <TITLE>rockstar</TITLE> <ARTIST>kk</ARTIST> <COUNTRY>india</COUNTRY> <COMPANY>Eros</COMPANY> <PRICE>9.90</PRICE> <YEAR>2014</YEAR> </CD> </CATALOG>)"; template<> struct traits::_children<pugi::xml_node> { decltype(auto) operator()(const pugi::xml_node &node) const { return node.children(); } }; int main(int, char **) { pugi::xml_document document{}; document.load_string(sample); size_t indent = 0; const auto &onDefault = [&indent](const auto &nodeLink) { return printing_walker(nodeLink, &indent); }; const auto &onCountry = [&indent](const auto &nodeLink) { return node_country(nodeLink, &indent); }; const auto &transform = make_transform<pugi::xml_node>(onDefault, // on("COUNTRY", onCountry)); dummy::data data{}; std::optional<std::runtime_error> err = transform(document.root(), data); if (!err) std::cout << "transformed " << data.nodes << " nodes" << std::endl; std::cout << "Countries:\n"; for (const auto &country : data.countries) std::cout << country << std::endl; return 0; }
This example prints each node by default and invokes a customized handler for each COUNTRY node. For now there's a complete but naive implementation that uses hash table to find a specialized node handler based on the node name. I believe the complexity to be O(1) on average accroding to cppreference for std::unordered_map
.
Here is a part of an API that I intend to implement, but am not quite sure if I am going in a right direction:
const auto &transform = make_transform<pugi::xml_node>( // default node handler onDefault, // specialized handlers, order of precedence matters. More specialized go after less specialized // on each node with name "COUNTRY" on("COUNTRY", onCountry), // on each node with name "COUNTRY" and attribute "language on("COUNTRY", attribute("language"), onLanguage), // like above, but for "english" specifically on("COUNTRY", attribute("language", "english"), onEnglishLanguage));
What I don't like:
- My code "hijacks" the control flow. It requires the user to explicilty state that he wants to continue the dfs traversal by inheriting/aggregating basic dfs node walker. Customizing traversal behavior can be benificial for many cases (for example if one wants to stop at some point), but for now one has to write a lot of code and explicitly manage the control flow. I would like to be able to write less code -> only the code that provides business logic. Like simple lambda.
- The API does not feel like
std::transform
. The standard library algorithm prevents the user from mingling with the control flow by requesting an UnaryPredicate that returns an object of a type convertible to output iterator. The problem is, I don't know whether it even possible to represent an output tree by an output iterator. So for now I passData &
as an argument for a walker'soperator()
.
I am thinking of requiesting a convertible object or a modificating callable from the handler. Something like:return [&toAdd](auto &data){ data.someMember += toAdd; }
. - As a workaround for 1. I decided to pass a walker factory to the walker's
operator()
, so the node walker can fetch a new walker from the factory for its child nodes. It doesn't feel right! The only reason for a node walker to get an instance of a factory is to provide the means to continue traversal. - I tried to write the implementation with optimization and copy elision in mind. But I am not sure that the code is optimization friendly. The only thing I am sure is that
variadic_callable
is not copied. That makes me a little happier.
Here is the complete code with implementation and example: https://godbolt.org/z/ejqjK35P9
#include <cstddef> #include <functional> #include <type_traits> #include <utility> #include <optional> #include <stdexcept> #include <tuple> #include <variant> #include <string_view> #include <unordered_map> namespace detail { template<typename... Impl> class variadic_callable { public: template<typename Factory, typename... Args> constexpr explicit variadic_callable(Factory f, Args &&...args) // : varCallable_(std::in_place_type<_in_place<std::invoke_result_t<Factory, Args...>>>, [&]() { return f(std::forward<Args>(args)...); }) {} variadic_callable(const variadic_callable &) = delete; variadic_callable(variadic_callable &&) = delete; template<typename... Args> constexpr decltype(auto) operator()(Args &&...args) const { return std::visit( [&](const auto &wrapped) { return wrapped.callable(std::forward<Args>(args)...); }, varCallable_); } private: template<typename T> struct _in_place { T callable; template<typename Factory> _in_place(Factory factory) // : callable(factory()) {} }; private: std::variant<_in_place<Impl>...> varCallable_; }; } // namespace detail namespace traits { template<typename NodeT> struct _children { constexpr decltype(auto) operator()(const NodeT &node) const { return std::invoke(&NodeT::children, node); } }; template<typename NodeT> decltype(auto) children(const NodeT &node) { return _children<NodeT>{}(node); } template<typename T> struct _name { constexpr decltype(auto) operator()(const T &t) const { return std::invoke(&T::name, t); } }; template<typename T> constexpr decltype(auto) name(const T &t) { return _name<T>{}(t); } } // namespace traits template<typename NodeT> struct node_link { using node_type = NodeT; node_type node; size_t depth = 0; const node_link *parent = nullptr; }; template<typename DefaultWalkerConstructor, typename... WalkerConstructors> class walker_factory { public: explicit walker_factory(DefaultWalkerConstructor defaultWalkerConstructor, std::pair<std::string_view, WalkerConstructors>... pairs) // : defaultConstructor_(defaultWalkerConstructor) //, { // TODO: assert keys for uniqueness (constructorsMap_.emplace(std::get<0>(pairs), std::move(std::get<1>(pairs))), ...); } walker_factory(const walker_factory &) = delete; walker_factory(walker_factory &&) = delete; template<typename NodeT> [[nodiscard]] auto get(const node_link<NodeT> &nodeLink) const { using variadic_node_walker = detail::variadic_callable<constructed_walker_type<NodeT, DefaultWalkerConstructor>, constructed_walker_type<NodeT, WalkerConstructors>...>; auto found = constructorsMap_.find(traits::name(nodeLink.node)); if (found == constructorsMap_.cend()) return variadic_node_walker(defaultConstructor_, nodeLink); return std::visit( [&nodeLink](const auto &constructor) { return variadic_node_walker(constructor, nodeLink); }, found->second); } private: template<typename NodeT, typename WConstructor> using constructed_walker_type = std::invoke_result_t<WConstructor, node_link<NodeT>>; private: DefaultWalkerConstructor defaultConstructor_; std::unordered_map<std::string_view, // variant's list can't be empty, hence DefaultWalkerConstructor std::variant<DefaultWalkerConstructor, WalkerConstructors...>> constructorsMap_; }; template</*template<typename, typename> class Pair,*/ typename DefaultWalkerConstructor, typename... WalkerConstructors> walker_factory(DefaultWalkerConstructor defaultWalkerConstructor, /*Pair*/ std::pair<std::string_view, WalkerConstructors>... pairs) -> walker_factory<DefaultWalkerConstructor, WalkerConstructors...>; template<typename NodeT, typename DefaultWalkerConstructor, typename... Pairs> constexpr auto make_transform(DefaultWalkerConstructor defaultWalkerConstructor, Pairs... pairs) { return [walkerFactory = walker_factory(defaultWalkerConstructor, pairs...)](const NodeT &node, auto &parseData) { const auto &walker = walkerFactory.get(node_link<NodeT>{ node }); return walker(walkerFactory, parseData); }; } // for composition use template<typename NodeT> class basic_dfs_walker { public: using node_type = NodeT; constexpr explicit basic_dfs_walker(const node_link<node_type> &nodeLink) // : nodeLink_(nodeLink) {} template<typename WalkerFactory, typename DataT> std::optional<std::runtime_error> operator()(const WalkerFactory &walkerFactory, DataT &parseData) const { for (const node_type &child : traits::children(nodeLink_.node)) { auto err = on_child(walkerFactory, child, parseData); if (!err) continue; return err; } return std::nullopt; } [[nodiscard]] const auto &node_link() const { return nodeLink_; } ~basic_dfs_walker() {} private: template<typename WalkerFactory, typename DataT> [[nodiscard]] std::optional<std::runtime_error> on_child(const WalkerFactory &walkerFactory, const node_type &child, DataT &parseData) const { const auto &nodeWalker = walkerFactory.get(::node_link<node_type>{ child, nodeLink_.depth + 1, &nodeLink_ }); return nodeWalker(walkerFactory, parseData); } protected: static void operator delete(void *) {} private: const ::node_link<node_type> nodeLink_; }; template<typename NodeT, typename... Pairs> constexpr auto make_transform(Pairs &&...pairs) { return make_transform<NodeT>( [](const auto &nodeLink) { return basic_dfs_walker<NodeT>{ nodeLink }; }, std::forward<Pairs>(pairs)...); } template<typename T> constexpr auto on_default() { return [](const auto &nodeLink) { return T(nodeLink); }; } template<typename T> constexpr auto on(std::string_view name) { return std::make_pair(name, [](const auto &nodeLink) { return T(nodeLink); }); } template<typename Constructor> constexpr auto on(std::string_view name, Constructor &&constructor) { return std::make_pair(name, std::forward<Constructor>(constructor)); } #include <pugixml.hpp> #include <iostream> namespace dummy { struct data { size_t nodes; std::vector<std::string> countries; }; } // namespace dummy class printing_walker { public: explicit printing_walker(const node_link<pugi::xml_node> &nodeLink, size_t *indent = nullptr) // : dfs_(nodeLink), indent_(indent) { indent_ && (*indent_ += 2); } template<typename Factory> std::optional<std::runtime_error> operator()(const Factory &factory, dummy::data &data) const { ++data.nodes; const auto &nodeType = dfs_.node_link().node.type(); const std::string indent = indent_ ? std::string(*indent_, ' ') : ""; if (pugi::xml_node_type::node_element == nodeType) { std::cout << indent << '<' << traits::name(dfs_.node_link().node) << '>' << std::endl; auto err = dfs_(factory, data); std::cout << indent << "</" << traits::name(dfs_.node_link().node) << '>' << std::endl; return err; } if (pugi::xml_node_type::node_pcdata == nodeType) { std::cout << indent << '"' << dfs_.node_link().node.text().as_string() << '"' << std::endl; return std::nullopt; } return dfs_(factory, data); } ~printing_walker() { indent_ && (*indent_ -= 2); } private: basic_dfs_walker<pugi::xml_node> dfs_; size_t *indent_; }; class node_country { public: constexpr explicit node_country(const node_link<pugi::xml_node> &nodeLink, size_t *indent = nullptr) : nodeLink_(nodeLink), indent_(indent) {} template<typename FactoryT> std::optional<std::runtime_error> operator()(const FactoryT &, dummy::data &data) const { ++data.nodes; std::cout << (indent_ ? std::string(*indent_, ' ') : "") << '<' << nodeLink_.node.child_value() << '>' << std::endl; data.countries.template emplace_back(nodeLink_.node.child_value()); return std::nullopt; } private: const node_link<pugi::xml_node> &nodeLink_; size_t *indent_; }; constexpr const char sample[] = R"(<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type='text/xsl'?> <CATALOG> <CD> <TITLE>dill diya galla</TITLE> <ARTIST>Arijit singh</ARTIST> <COUNTRY>India</COUNTRY> <COMPANY>tseries</COMPANY> <PRICE>10.90</PRICE> <YEAR>2018</YEAR> </CD> <CD> <TITLE>Saiyara</TITLE> <ARTIST>Atif Aslam</ARTIST> <COUNTRY>Uk</COUNTRY> <COMPANY>Records</COMPANY> <PRICE>9.90</PRICE> <YEAR>2015</YEAR> </CD> <CD> <TITLE>Khairiyat</TITLE> <ARTIST>Sonu nigam</ARTIST> <COUNTRY>india</COUNTRY> <COMPANY>radio</COMPANY> <PRICE>9.90</PRICE> <YEAR>2019</YEAR> </CD> <CD> <TITLE>all is well</TITLE> <ARTIST>Amir Khan</ARTIST> <COUNTRY>pak</COUNTRY> <COMPANY>Virgin records</COMPANY> <PRICE>10.20</PRICE> <YEAR>2012</YEAR> </CD> <CD> <TITLE>rockstar</TITLE> <ARTIST>kk</ARTIST> <COUNTRY>india</COUNTRY> <COMPANY>Eros</COMPANY> <PRICE>9.90</PRICE> <YEAR>2014</YEAR> </CD> </CATALOG>)"; // somehow doesn't work without it template<> struct traits::_children<pugi::xml_node> { decltype(auto) operator()(const pugi::xml_node &node) const { return node.children(); } }; int main(int, char **) { pugi::xml_document document{}; document.load_string(sample); size_t indent = 0; const auto &onDefault = [&indent](const auto &nodeLink) { return printing_walker(nodeLink, &indent); }; const auto &onCountry = [&indent](const auto &nodeLink) { return node_country(nodeLink, &indent); }; const auto &transform = make_transform<pugi::xml_node>(onDefault, // on("COUNTRY", onCountry)); dummy::data data{}; std::optional<std::runtime_error> err = transform(document.root(), data); if (!err) std::cout << "transformed " << data.nodes << " nodes" << std::endl; std::cout << "Countries:\n"; for (const auto &country : data.countries) std::cout << country << std::endl; return 0; }
I would be also glad to know how could I refactor this code to logically separate concerns:
- basic architectural units
- user-provided lightweigh customization points (node handlers)
- points for "heavy" customization for the algorithmic part