6
\$\begingroup\$

This is a followup to this question: link. As advised there, I've rewritten the State class, which also simplified some of the other code.

I'm writing a parser library. I would like to know if the basis of it is sound and whether it can be improved in any way. The entire code can be seen in this repo: link, in the frozen branch review-11-02-2018 (file core.py). I'll post relevant portions below.

The library is written around two things: a State class, and a notion of an effect. State objects represent current state of a parser chain, and keep track of the input that's left to parse and a portion of input parsed by the last parser. A parser is just a callable that takes a State object and returns a new one (State objects are immutable). Output of one parser is passed to the next parser in a chain. A parser may also fail by throwing a ParsingFailure exception. Any parser in the chain may register an effect - a callable that takes an arbitrary value as the first argument and a State object as the second. If the chain succeeds, all effects registered during the parsing run are applied sequentially to a seed (with the seed or return value of the previous effect being the first argument, and the state of the chain at the moment of effect registration being the second), and the return value of the last effect becomes the output of entire chain, along with the final state. Is the concept sane? It works, but is it a reasonable way to do this?

State class is just a named tuple with some extra methods and is defined as follows:

class State(namedtuple("State", "string effect left_start left_end parsed_start parsed_end")): """ State objects represent current state of a parser chain (or an individual parser). State objects provide two views over the input string: 'left', which spans a substring between 'left_start' and 'left_end' and represents unparsed input left from the previous parser, and 'parsed', which spans a substring between 'parsed_start' and 'parsed_end' and represents a portion of input the previous parser has parsed. Windows may overlap, and usually 'parsed_end' and 'left_start' are the same, but not always. A State object is immutable and has following fields: * string (str): the input the parser chain is supposed to parse. * effect ((value, state) -> value): if the chain is successful, this will be called in sequence with other effects from the chain to form the chain's output value. * left_start, left_end (int): see above about the 'left' window. * parsed_start, parser_end (int): see above about the 'parsed' window. State objects are just named tuples, so they support a very convenient '_replace' method. !Note!: to avoid duplicating effects accidentally, '_replace' treats lack of 'effect' in its arguments as 'effect=None'. So if you want to copy an effect from another parser, you have to do it explicitly. State objects' constructor takes the following arguments: 1. string - the input. 2. effect=None - the effect, transformation to be performed on success of the last parser. 3. start=0 - will be translated into 'left_start' 4. end=None - will be translated into 'left_end'. If set to None, 'left_end' will be set to the length of the input. State objects created via this constructor have both 'parsed_start' and 'parsed_end' set to 'left_start'. State objects have several properties: * left - returns a slice of input that's left to parse. * left_len - returns the length of the above slice without computing the slice itself. * parsed - returns a slice of input that's been parsed. * parsed_len - returns the length of the above slice, again without computing the slice. Finally, State objects have following public methods: * consume(how_many) - move 'how_many' characters from the left window into the parsed window. Raise ValueError if more input was consumed than left. * split(at) - split the State in two (and return them). The first keeps the input up to, but not including, 'at' as its 'left' window, the second gets the rest. Both have their 'parsed' windows reset to an empty string. The first gets 'effect' of the original, the second gets None. """ __slots__ = [] def __new__(cls, string, effect=None, start=0, end=None): if end is None: end = len(string) assert 0 <= start <= end <= len(string) return super().__new__(cls, string, effect, start, end, start, start) def _replace(self, **kwargs): if "effect" not in kwargs: return super()._replace(effect=None, **kwargs) return super()._replace(**kwargs) def consume(self, how_many): """ Return a new State object with 'how_many' characters consumed and moved to the 'parsed' window. Raise ValueError if 'how_many' is negative or if consuming more characters than left in the 'left' window. """ if how_many < 0: raise ValueError("Negative number of consumed characters") left_start = self.left_start + how_many parsed_start = self.left_start parsed_end = parsed_start + how_many if left_start > self.left_end: raise ValueError("Consumed more characters than fits in the 'left' window") return self._replace(left_start=left_start, parsed_start=parsed_start, parsed_end=parsed_end) def split(self, at): """ Split the State in two. The first one keeps a portion of input up to 'at'th character (exclusive), the second one gets the rest. Both have 'parsed' window reset to an empty string. First one gets the effect of the original, the second one gets None. """ split_point = self.left_start + at first = self._replace(effect=self.effect, left_end=split_point, parsed_start=self.left_start, parsed_end=self.left_start) second = self._replace(effect=None, left_start=split_point, parsed_start=split_point, parsed_end=split_point) return first, second @property def left(self): """ Return the portion of input the last parser hasn't consumed. """ return self.string[self.left_start:self.left_end] @property def left_len(self): """ Return the length of the portion of input the last parser hasn't consumed. """ return self.left_end - self.left_start @property def parsed(self): """ Return the string parsed by the last parser. """ return self.string[self.parsed_start:self.parsed_end] @property def parsed_len(self): """ Return the length of the string parsed by the last parser. """ return self.parsed_end - self.parsed_start 

Hopefully docstrings are descriptive enough. If any further clarification is needed, please tell me, I'll edit it in.

Another important thing is parse function, which the user is supposed to call with a parser instead of calling parsers directly. Here it is:

def parse(seed, state_or_string, parser, verbose=False): """ Run a given parser on a given state object or a string, then apply combined chain or parser's effects to 'seed' and return a tuple (seed after effects, final state). On failure, return None unless 'verbose' is truthy, in which case return the ParsingFailure exception that has terminated the parsing process. """ if isinstance(state_or_string, str): state = State(state_or_string) else: state = state_or_string try: after = parser(state) if after.effect is not None: return after.effect(seed, after), after return seed, after except ParsingFailure as failure: if verbose: return failure return None except ParsingEnd as end: if end.state.effect is not None: return end.state.effect(seed, end.state), end.state return seed, end.state 

Another important thing is chain parser generator, which performs the chaining logic described above, but I don't want to post it here, because a) the question is bloated already, b) it also deals with lookahead, which seems to me a matter worth asking a separate question.

If you've read this far, thank you! Any suggestions on improving the library?

\$\endgroup\$

    1 Answer 1

    1
    \$\begingroup\$

    Following up on the previous comments they made about the parser, I will focus on code legibility

    You may consider moving some paragraphs of the class docstring to the relevant part of the code, like the paragraph about A State object is immutable and has following fields may suit better on the __new__ method, documenting as well the parameter

    Also, don't forget to document the parameters of methods/functions, will help a lot knowing what your parser is doing when invoking them

    Parameters like at might be better named as index which is universally used for it.

    Parameters like how_many may be better named as characters_count or consumed_characters_count or similar, simply by reading it gives better insight on what the parameter refers to

    if verbose: return failure return None 

    This can be translated to ternary operator

    return failure if verbose else None 

    For the following small methods, I would consider better naming as well, like this one

    def left(self): """ Return the portion of input the last parser hasn't consumed. """ 

    Why not calling it portion_not_consumed or something on that way?

    Just imagine you don't know anything about a library, and you find a method called left_len. Unless you read the docstring, is hard to tell what is doing

    You can as well consider putting some spacing between the lines of a same function, to improve a bit readability between the sections of it

    Keep the good work!

    \$\endgroup\$

      Start asking to get answers

      Find the answer to your question by asking.

      Ask question

      Explore related questions

      See similar questions with these tags.