As the title suggests, I'm building a MySQL parser in python. The intention of this is to better support application development by managing migrations in a more declarative way. You can read about the underlying project at its github page if desired.
My immediate focus is on parsing CREATE TABLE
, INSERT
, and comments, with the goal of having the parser find any combination of those commands in the data it is passed (flagging syntax errors for anything it can't parse). Obviously I may expanded it to cover a larger subset of the MySQL dialect later (and I'm leaving room for other dialects of SQL).
The overall idea for my parser is to break everything down into small units that can be matched by simple rules. Rules are combined to match specific parts of queries, and then those rules are combined to parse the actual queries. There are four basic rules:
- Literal:
rule_literal
looks for an exact match at the current parsing location - Regexp:
rule_regexp
attempts to match a regular expression at the current parsing location - Delimited:
rule_delimited
looks for lists of values separated by a delimiter and optionally enclosed by a quote character - Children:
rule_children
attempts to match any number of given rules against the current parsing location.
With these four building blocks I was able to build up a parser that can process some complicated CREATE TABLE
commands, so I think I'm on the right track. I'm not entirely certain if these rules would be enough to parse complicated SELECT
commands (think JOINS, UNIONS, sub-queries, etc), but since I have no need of SELECT
commands for my immediate goals, that isn't too high on my list of concerns. At this point in time I'm looking for any and all feedback on the system, so let me have it!
On to some actual code! Here are the two of the rule parsers (others omitted for brevity, but follow the same basic concepts). docblocks removed for brevity:
Rules
rule_literal.py
from .rule_base import rule_base class rule_literal( rule_base ): # names are not required for literals require_name = False literal = '' def __init__( self, parser, rule, next_rule ): super().__init__( parser, rule, next_rule ) self.literal = self.rule['value'] # use the actual literal value as the name if we don't have one if not self.name: self.name = self.literal def parse( self, string ): # easy to check for a literal match at the beginning of string val_len = len( self.literal ) if string[:val_len].lower() != self.literal.lower(): self.result = '' self.leftovers = string return False # if we matched then clean! self.leftovers = string[val_len:].strip() self.result = self.literal return True
rule_regexp.py
from .rule_base import rule_base import re class rule_regexp( rule_base ): regexp = '' def __init__( self, parser, rule, next_rule ): super().__init__( parser, rule, next_rule ) self.regexp = self.rule['value'] def parse( self, string ): # apply the regular expression! result = re.match( self.regexp, string, re.IGNORECASE ) # if it didn't match then nothing channged if not result: self.leftovers = string self.result = '' return False # otherwise we have a match and can return as such self.result = result.group(0) cleaned = string[len(self.result):].strip() # if there was a group in the regular expression then just keep that part if result.groups(): self.result = result.group(1) self.leftovers = cleaned return True
And this is the rule base that both of those extend:
rule_base.py
class rule_base( object ): require_name = True require_value = True require_next = False parser_class = '' rule = {} next_rule = {} name = '' result = '' leftovers = '' def __init__( self, parser, rule, next_rule ): # we keep parser class around just for reference and better errors self.parser_class = parser.__class__ # store rule and next_rule self.rule = rule self.next_rule = next_rule self.name = self.rule['name'] if 'name' in self.rule else '' # name is requied more often than not if self.require_name and not self.name: raise ValueError( "name required for rule %s in class %s" % ( self.rule, self.parser_class ) ) # ditto rule if self.require_value: if not 'value' in self.rule or not self.rule['value']: raise ValueError( 'missing value in rule %s for class %s' % ( self.rule, self.parser_class ) ) def parse( self, string ): pass
parsers
Actual parsers are then made from a combination of these rules. So for instance the parser that would read a KEY
line (eg KEY jobhub_jobs_account_id (account_id),
) from the CREATE TABLE
command looks like this:
index_key.py
from mygrations.core.parse.parser import parser class index_key( parser ): definition_type = 'index' name = '' has_comma = False columns = [] # KEY account_id (account_id,name) rules = [ { 'type': 'literal', 'value': 'KEY' }, { 'type': 'regexp', 'name': 'name', 'value': '[^\(\s\)]+' }, { 'type': 'literal', 'value': '(' }, { 'type': 'delimited', 'name': 'columns', 'separator': ',', 'quote': '`' }, { 'type': 'literal', 'value': ')' }, { 'type': 'literal', 'value': ',', 'optional': True, 'name': 'ending_comma' } ] def __init__( self, rules = [] ): super().__init__( rules ) self.columns = [] def process( self ): self.name = self._values['name'].strip().strip( '`' ) self.columns = self._values['columns'] self.has_comma = True if 'ending_comma' in self._values else False if len( self.name ) > 64: self.errors.append( 'Key name %s is too long' % ( self.name ) )
The parser base which these guys extend will take their list of rules and then generate rule objects based off of the configuration and type. When given a string to match, it will then iteratively apply the list of rules, starting at the start of the string and working its way to the end, until it hits a required rule that it can't match (in which case it stops) or matches all of its rules (in which case it returns whatever is unmatched). It stores the results in the object and then calls self.process()
so the extending parser can process the matched data and condense it into something more accessible for the rest of the system. Finally, the parser for each command is similarly built from these same parts. For instance, this is the parser for the CREATE TABLE
command (I haven't yet written the process
method for this guy:
create_parser.py
from mygrations.core.parse.parser import parser from .parsers import * class create_parser( parser ): string = '' rules = [ { 'type': 'literal', 'value': 'CREATE TABLE' }, { 'type': 'regexp', 'value': '\S+', 'name': 'table' }, { 'type': 'literal', 'value': '(' }, { 'type': 'children', 'name': 'definitions', 'classes': [ index_primary, index_key, index_unique, constraint_foreign, type_character, type_numeric, type_decimal, type_text, type_enum, type_plain ] }, { 'type': 'literal', 'value': ')' }, { 'type': 'children', 'name': 'table_options', 'classes': [ table_options ], 'optional': True }, { 'type': 'literal', 'value': ';', 'optional': True, 'name': 'closing_semicolon' } ]
Finally the parser class is what brings it all together. creater_parser.py and its children all extend the parser class:
parser.py
import re from .rule_children import rule_children from .rule_delimited import rule_delimited from .rule_literal import rule_literal from .rule_regexp import rule_regexp class parser( object ): rule_types = { 'children': rule_children, 'delimited': rule_delimited, 'literal': rule_literal, 'regexp': rule_regexp } num_rules = 0; rules = [] _values = {} matched = False errors = [] warnings = [] def __init__( self, rules = [] ): self._values = {} self.errors = [] self.warnings = [] # rules should be defined by the subclass if rules: self.rules = rules if not self.rules: raise ValueError( "Cannot extend parser without providing rules in %s" % ( self.__class__ ) ) for ( rule_index, rule ) in enumerate( self.rules ): # we always need a type if not 'type' in rule: raise ValueError( 'Missing type for rule %s in %s' % ( rule, self.__class__ ) ) self.num_rules = len( self.rules ) def get_rule_parser( self, rule, next_rule ): rule_type = rule['type'] # keeping this simple for now if not rule_type in self.rule_types: raise ValueError( 'Unknown rule type %s for class %s' % ( rule_type, self.__class__ ) ) return self.rule_types[rule_type]( self, rule, next_rule ) def parse( self, string = '' ): # first thing first, some initial string cleaning. Clean spaces # from start and end and replace any multi-spaces with a single space. string = re.sub( '\s+', ' ', string ).strip() for ( rule_index, rule ) in enumerate( self.rules ): # do we have a next rule? next_rule = self.rules[rule_index+1] if rule_index < self.num_rules-1 else False # now we can parse rule_parser = self.get_rule_parser( rule, next_rule ) # does it match? Check for a lack of match and deal with that first if not rule_parser.parse( string ): # if this rule wasn't optional then we just don't match if not 'optional' in rule or not rule['optional']: self.matched = False return string # otherwise just keep going continue # we did match! Yeah! self._values[rule_parser.name] = rule_parser.result string = rule_parser.leftovers # we are all done if we have nothing left if not string: break # if we got here then we got to the end, but we may not be done. If we have more required # rules left that haven't been matched, then we don't match. # did we check every required rule? for check_index in range( rule_index+1, self.num_rules ): rule = self.rules[check_index] if not 'optional' in rule or not rule['optional']: self.matched = False return string # if we got here then we didn't match everything, but we fulfilled all of our # required rules. As a result, we are done! self.process() self.matched = True return string def process( self ): """ parser.process() Processes the results of the parsing process. Only called if a match is found. No input or output as it modifies the parser object in place, populating attributes as needed. """ pass
I haven't included the full totality of files involved in parsing. I didn't include the source code for the children or delimiter parsers, simply because they follow the same basic patterns and there is already a lot of code. Also, the create_parser uses 11 different child parsers at the moment to do its job, and I only included one of those. Again, they all work the same way, so they don't change the big picture, and I obviously can't fit much code here.
I know it's a lot, so thanks to anyone who works through even part of it for me! I'm happy to hear everything from code nitpicks to "why don't you just use this other thing that does exactly this already?". For reference on the problem I'm trying to solve with this, and my motivation behind this project, just read the readme on the github page.