5
\$\begingroup\$

As the title suggests, I'm building a MySQL parser in python. The intention of this is to better support application development by managing migrations in a more declarative way. You can read about the underlying project at its github page if desired.

My immediate focus is on parsing CREATE TABLE, INSERT, and comments, with the goal of having the parser find any combination of those commands in the data it is passed (flagging syntax errors for anything it can't parse). Obviously I may expanded it to cover a larger subset of the MySQL dialect later (and I'm leaving room for other dialects of SQL).

The overall idea for my parser is to break everything down into small units that can be matched by simple rules. Rules are combined to match specific parts of queries, and then those rules are combined to parse the actual queries. There are four basic rules:

  1. Literal: rule_literal looks for an exact match at the current parsing location
  2. Regexp: rule_regexp attempts to match a regular expression at the current parsing location
  3. Delimited: rule_delimited looks for lists of values separated by a delimiter and optionally enclosed by a quote character
  4. Children: rule_children attempts to match any number of given rules against the current parsing location.

With these four building blocks I was able to build up a parser that can process some complicated CREATE TABLE commands, so I think I'm on the right track. I'm not entirely certain if these rules would be enough to parse complicated SELECT commands (think JOINS, UNIONS, sub-queries, etc), but since I have no need of SELECT commands for my immediate goals, that isn't too high on my list of concerns. At this point in time I'm looking for any and all feedback on the system, so let me have it!

On to some actual code! Here are the two of the rule parsers (others omitted for brevity, but follow the same basic concepts). docblocks removed for brevity:

Rules

rule_literal.py

from .rule_base import rule_base class rule_literal( rule_base ): # names are not required for literals require_name = False literal = '' def __init__( self, parser, rule, next_rule ): super().__init__( parser, rule, next_rule ) self.literal = self.rule['value'] # use the actual literal value as the name if we don't have one if not self.name: self.name = self.literal def parse( self, string ): # easy to check for a literal match at the beginning of string val_len = len( self.literal ) if string[:val_len].lower() != self.literal.lower(): self.result = '' self.leftovers = string return False # if we matched then clean! self.leftovers = string[val_len:].strip() self.result = self.literal return True 

rule_regexp.py

from .rule_base import rule_base import re class rule_regexp( rule_base ): regexp = '' def __init__( self, parser, rule, next_rule ): super().__init__( parser, rule, next_rule ) self.regexp = self.rule['value'] def parse( self, string ): # apply the regular expression! result = re.match( self.regexp, string, re.IGNORECASE ) # if it didn't match then nothing channged if not result: self.leftovers = string self.result = '' return False # otherwise we have a match and can return as such self.result = result.group(0) cleaned = string[len(self.result):].strip() # if there was a group in the regular expression then just keep that part if result.groups(): self.result = result.group(1) self.leftovers = cleaned return True 

And this is the rule base that both of those extend:

rule_base.py

class rule_base( object ): require_name = True require_value = True require_next = False parser_class = '' rule = {} next_rule = {} name = '' result = '' leftovers = '' def __init__( self, parser, rule, next_rule ): # we keep parser class around just for reference and better errors self.parser_class = parser.__class__ # store rule and next_rule self.rule = rule self.next_rule = next_rule self.name = self.rule['name'] if 'name' in self.rule else '' # name is requied more often than not if self.require_name and not self.name: raise ValueError( "name required for rule %s in class %s" % ( self.rule, self.parser_class ) ) # ditto rule if self.require_value: if not 'value' in self.rule or not self.rule['value']: raise ValueError( 'missing value in rule %s for class %s' % ( self.rule, self.parser_class ) ) def parse( self, string ): pass 

parsers

Actual parsers are then made from a combination of these rules. So for instance the parser that would read a KEY line (eg KEY jobhub_jobs_account_id (account_id),) from the CREATE TABLE command looks like this:

index_key.py

from mygrations.core.parse.parser import parser class index_key( parser ): definition_type = 'index' name = '' has_comma = False columns = [] # KEY account_id (account_id,name) rules = [ { 'type': 'literal', 'value': 'KEY' }, { 'type': 'regexp', 'name': 'name', 'value': '[^\(\s\)]+' }, { 'type': 'literal', 'value': '(' }, { 'type': 'delimited', 'name': 'columns', 'separator': ',', 'quote': '`' }, { 'type': 'literal', 'value': ')' }, { 'type': 'literal', 'value': ',', 'optional': True, 'name': 'ending_comma' } ] def __init__( self, rules = [] ): super().__init__( rules ) self.columns = [] def process( self ): self.name = self._values['name'].strip().strip( '`' ) self.columns = self._values['columns'] self.has_comma = True if 'ending_comma' in self._values else False if len( self.name ) > 64: self.errors.append( 'Key name %s is too long' % ( self.name ) ) 

The parser base which these guys extend will take their list of rules and then generate rule objects based off of the configuration and type. When given a string to match, it will then iteratively apply the list of rules, starting at the start of the string and working its way to the end, until it hits a required rule that it can't match (in which case it stops) or matches all of its rules (in which case it returns whatever is unmatched). It stores the results in the object and then calls self.process() so the extending parser can process the matched data and condense it into something more accessible for the rest of the system. Finally, the parser for each command is similarly built from these same parts. For instance, this is the parser for the CREATE TABLE command (I haven't yet written the process method for this guy:

create_parser.py

from mygrations.core.parse.parser import parser from .parsers import * class create_parser( parser ): string = '' rules = [ { 'type': 'literal', 'value': 'CREATE TABLE' }, { 'type': 'regexp', 'value': '\S+', 'name': 'table' }, { 'type': 'literal', 'value': '(' }, { 'type': 'children', 'name': 'definitions', 'classes': [ index_primary, index_key, index_unique, constraint_foreign, type_character, type_numeric, type_decimal, type_text, type_enum, type_plain ] }, { 'type': 'literal', 'value': ')' }, { 'type': 'children', 'name': 'table_options', 'classes': [ table_options ], 'optional': True }, { 'type': 'literal', 'value': ';', 'optional': True, 'name': 'closing_semicolon' } ] 

Finally the parser class is what brings it all together. creater_parser.py and its children all extend the parser class:

parser.py

import re from .rule_children import rule_children from .rule_delimited import rule_delimited from .rule_literal import rule_literal from .rule_regexp import rule_regexp class parser( object ): rule_types = { 'children': rule_children, 'delimited': rule_delimited, 'literal': rule_literal, 'regexp': rule_regexp } num_rules = 0; rules = [] _values = {} matched = False errors = [] warnings = [] def __init__( self, rules = [] ): self._values = {} self.errors = [] self.warnings = [] # rules should be defined by the subclass if rules: self.rules = rules if not self.rules: raise ValueError( "Cannot extend parser without providing rules in %s" % ( self.__class__ ) ) for ( rule_index, rule ) in enumerate( self.rules ): # we always need a type if not 'type' in rule: raise ValueError( 'Missing type for rule %s in %s' % ( rule, self.__class__ ) ) self.num_rules = len( self.rules ) def get_rule_parser( self, rule, next_rule ): rule_type = rule['type'] # keeping this simple for now if not rule_type in self.rule_types: raise ValueError( 'Unknown rule type %s for class %s' % ( rule_type, self.__class__ ) ) return self.rule_types[rule_type]( self, rule, next_rule ) def parse( self, string = '' ): # first thing first, some initial string cleaning. Clean spaces # from start and end and replace any multi-spaces with a single space. string = re.sub( '\s+', ' ', string ).strip() for ( rule_index, rule ) in enumerate( self.rules ): # do we have a next rule? next_rule = self.rules[rule_index+1] if rule_index < self.num_rules-1 else False # now we can parse rule_parser = self.get_rule_parser( rule, next_rule ) # does it match? Check for a lack of match and deal with that first if not rule_parser.parse( string ): # if this rule wasn't optional then we just don't match if not 'optional' in rule or not rule['optional']: self.matched = False return string # otherwise just keep going continue # we did match! Yeah! self._values[rule_parser.name] = rule_parser.result string = rule_parser.leftovers # we are all done if we have nothing left if not string: break # if we got here then we got to the end, but we may not be done. If we have more required # rules left that haven't been matched, then we don't match. # did we check every required rule? for check_index in range( rule_index+1, self.num_rules ): rule = self.rules[check_index] if not 'optional' in rule or not rule['optional']: self.matched = False return string # if we got here then we didn't match everything, but we fulfilled all of our # required rules. As a result, we are done! self.process() self.matched = True return string def process( self ): """ parser.process() Processes the results of the parsing process. Only called if a match is found. No input or output as it modifies the parser object in place, populating attributes as needed. """ pass 

I haven't included the full totality of files involved in parsing. I didn't include the source code for the children or delimiter parsers, simply because they follow the same basic patterns and there is already a lot of code. Also, the create_parser uses 11 different child parsers at the moment to do its job, and I only included one of those. Again, they all work the same way, so they don't change the big picture, and I obviously can't fit much code here.

I know it's a lot, so thanks to anyone who works through even part of it for me! I'm happy to hear everything from code nitpicks to "why don't you just use this other thing that does exactly this already?". For reference on the problem I'm trying to solve with this, and my motivation behind this project, just read the readme on the github page.

\$\endgroup\$

    2 Answers 2

    1
    \$\begingroup\$

    Code nitpicks

    • Your whitespace is way off. Keep it to a minimum and separate logical pieces of code with a blank line.

    • Leading and/or trailing spaces like these:

      def __init__( self, rules=[] ):

      Should be removed:

      def __init__(self, rules=[]):

      The same goes for these parts:

      for ( rule_index, rule ) in enumerate( self.rules ):

      class create_parser( parser ):

      .. amongst others ..

    • Don't align assignment operators (= or :) to match other lines:

       rule_types = { 'children': rule_children, 'delimited': rule_delimited, 'literal': rule_literal, 'regexp': rule_regexp } 

      Should become:

       rule_types = { 'children': rule_children, 'delimited': rule_delimited, 'literal': rule_literal, 'regexp': rule_regexp } 

    Other

    • Avoid using mutable objects (like a list in your constructor for parser( object ):) as keyword arguments. This can have very nasty consequences if you don't know what you're doing. You can easily fix this by putting in place a 'guard':

      def __init__(self, rules=None): if rules is None: rules = [] (...) 

      See also this StackOverflow post.

    • Go easy on commenting. A lot of your code would be much easier to understand if you left out the comments, and instead added some useful docstrings (see PEP257).

    • Python does not require semicolons at the end of statements and using them is frowned upon.

    \$\endgroup\$
    5
    • 1
      \$\begingroup\$I have run into that issue with mutable objects in other contexts (namely class parameters) but hadn't run into it for keyword argument defaults before. Good to know, thanks! I don't actually need that argument anymore anyway, and was planning on killing it soon. I also hadn't seen my semicolons. I specifically avoid them in python, but I spend most of my day writing in languages that require semicolons, so sometimes they leak in. Thanks!\$\endgroup\$CommentedJul 5, 2017 at 11:58
    • 1
      \$\begingroup\$When it comes to whitespace, I am apparently old and set in my ways. I am aware of the PEP recommendations, but hate looking at code like that myself. I always keep to the coding conventions of whatever project I'm working on, but when it is my project I figure I get to set the rules :) My standards are much closer to those for typescript/angular, which is what I mainly do these days. I may have to get over myself on this one though...\$\endgroup\$CommentedJul 5, 2017 at 12:04
    • 1
      \$\begingroup\$Steering away from PEP-8 isn't always a bad thing, but it really makes it a lot easier for others to skim through your code if you do stick to it-\$\endgroup\$
      – Daniel
      CommentedJul 5, 2017 at 14:25
    • 1
      \$\begingroup\$I had to give up tabs before. Don't make me give up my excessive spacing too!!!! :)\$\endgroup\$CommentedJul 5, 2017 at 14:31
    • \$\begingroup\$See, tabs are visually no different and most editors will automatically convert / detect them- whitespace on the other hand (...)\$\endgroup\$
      – Daniel
      CommentedJul 5, 2017 at 15:31
    1
    \$\begingroup\$

    The standard way to write a parser is to split it into two pieces: (i) a lexical analyzer converting the input to a stream of tokens; (ii) a parser converting a stream of tokens into a parse tree (possibly according to a formal grammar).

    The code in the post shows no clear separation between lexical analysis and parsing, and there don't seem to be data structures representing tokens or parse trees. This makes it hard to understand.

    Take a look at thesethreeanswers for some ideas on how to develop hand-written parsers in Python.

    Alternatively, instead of writing your own, you could use one of the many third-party parsing modules. I've used PyParsing, which results in very terse code (see this answer for an example) but it can be quite tricky to get exactly the behaviour you want.

    \$\endgroup\$
    1
    • \$\begingroup\$You like parsers, don't you :) I took a quick look at them and will have to go through it in more detail later, thanks.\$\endgroup\$CommentedJul 5, 2017 at 17:04

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.