I recently learned Python using the book by Mark Summerfeld and I am now doing a few katas of mine when I learn a new language. This helps to get the “how the language works” and I wrote a simple lexer for Apache logs.
An example of log records are
64.242.88.10 - - [07/Mar/2004:16:05:49 -0800] "GET /twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 12846 64.242.88.10 - - [07/Mar/2004:16:06:51 -0800] "GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 HTTP/1.1" 200 4523 64.242.88.10 - - [07/Mar/2004:16:10:02 -0800] "GET /mailman/listinfo/hsdivision HTTP/1.1" 200 6291
and program I wrote reads all values as tokens and output them using |
as a separator, as in
64.242.88.10|-|-|07/Mar/2004:16:05:49 -0800|GET /twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1|401|12846 64.242.88.10|-|-|07/Mar/2004:16:06:51 -0800|GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 HTTP/1.1|200|4523 64.242.88.10|-|-|07/Mar/2004:16:10:02 -0800|GET /mailman/listinfo/hsdivision HTTP/1.1|200|6291
The lexer can be used for more general usage though.
Here is my code and I would love to read your comments and suggestions about how to improve it. I have also raised a few specific questions on certain topics.
import collections import io import sys LEXER_SKIP_WHITESPACE=0 LEXER_READ_QUOTED_STRING=1 LEXER_READ_BRACKETED_STRING=2 LEXER_READ_WORD=3 class Location: def __init__(self, name=None, pos=0, line=1, col=0): self.name = name or "<input>" self.line = line self.col = col self.pos = pos def update(self, c): self.pos += 1 if c == "\n": self.line += 1 self.col = 0 else: self.col += 1 def __repr__(self): return str.format("Location({}, {}, {}, {})", repr(self.name), repr(self.pos), repr(self.line), repr(self.col)) def __str__(self): return str.format("{}: {}: line {}, column {}", self.name, self.pos, self.line, self.col) def readchar(inputchannel, location): while True: maybechar = inputchannel.read(1) if maybechar == '': return None else: location.update(maybechar) yield maybechar def readtoken(inputchannel, location): state = LEXER_SKIP_WHITESPACE token = '' for nextchar in readchar(inputchannel, location): if state is LEXER_SKIP_WHITESPACE: if nextchar == "\n": yield "\n" continue elif nextchar.isspace(): continue elif nextchar == '"': state = LEXER_READ_QUOTED_STRING continue elif nextchar == '[': state = LEXER_READ_BRACKETED_STRING continue else: state = LEXER_READ_WORD token += nextchar continue elif state is LEXER_READ_QUOTED_STRING: if nextchar == '"': yield token token = '' state = LEXER_SKIP_WHITESPACE continue else: token += nextchar continue elif state is LEXER_READ_BRACKETED_STRING: if nextchar == ']': yield token token = '' state = LEXER_SKIP_WHITESPACE continue else: token += nextchar continue elif state is LEXER_READ_WORD: if nextchar == "\n": yield token token = '' state = LEXER_SKIP_WHITESPACE yield "\n" continue elif nextchar.isspace(): yield token token = '' state = LEXER_SKIP_WHITESPACE continue else: token += nextchar continue else: raise Error("Impossible lexer state.") if state is LEXER_SKIP_WHITESPACE: return None elif state is LEXER_READ_QUOTED_STRING: raise Error("End of character stream in quoted string.") elif state is LEXER_READ_BRACKETED_STRING: raise Error("End of character stream in quoted string.") elif state is LEXER_READ_WORD: yield token return None else: raise Error("Impossible lexer state.") class Lexer: def __init__(self, inputchannel, _location=None): self.location = _location or Location("<input>", 0, 1, 0) self.inputchannel = inputchannel self.buf = '' self.state = LEXER_SKIP_WHITESPACE def __iter__(self): return readtoken(self.inputchannel, self.location) if __name__ == "__main__": sep = '' for token in Lexer(sys.stdin): if token == '\n': sys.stdout.write(token) sep = '' else: sys.stdout.write(sep) sys.stdout.write(token) sep = '|'
Question 1 – How to write lexers in Python?
I am well aware that there are tools similar to lex and yacc for Python that could have been used here, but that would defeat the purpose of learning how to write such a program in Python. I found it surprisingly hard.
My first misfortune is that Python does not do tail-elimination, so it is essentially forbidden to write a lexer as a set of mutually recursive functions. Mutually recursive functions are one of my favourite tools to do so, because it clearly singles out a specific state of the lexer (the recursive function it is in) and the transitions from this state to the other states, and makes it easy to test-case individually each of the transitions.
Since maybe not everybody is familiar with lexers based on mutually recursive functions here is the equivalent of the readtoken generator in OCaml. The beginning of read_token
reads like "If you read a double quote, discard it and do read_quotedstring
" and that function itself is defined later and does what on expects: it aggregates characters in a buffer until the next double quote and bless the result as a token.
let rec read_token f ((ax,buffer) as state) s = match Stream.peek s with | Some('"') -> Stream.junk s; read_quotedstring f state s | Some('[') -> Stream.junk s; read_timestamp f state s | Some(' ') | Some('\t')-> Stream.junk s; read_token f state s | Some(c) -> read_simpleword f state s | None -> ax and read_simpleword f ((ax,buffer) as state) s = match Stream.peek s with | Some('"') | Some('[') | Some(' ') | Some('\t') -> let current_token = Buffer.contents buffer in Buffer.clear buffer; read_token f (f ax current_token, buffer) s | Some(c) -> Buffer.add_char buffer c; Stream.junk s; read_simpleword f state s | None -> ax and read_quotedstring f ((ax,buffer) as state) s = match Stream.peek(s) with | Some('"') -> Stream.junk s; let current_token = Buffer.contents buffer in Buffer.clear buffer; read_token f (f ax current_token, buffer) s | Some(c) -> Stream.junk s; Buffer.add_char buffer c; read_quotedstring f state s | None -> failwith "End of stream within a quoted string" and read_timestamp f ((ax,buffer) as state) s = match Stream.peek(s) with | Some(']') -> Stream.junk s; let current_token = Buffer.contents buffer in Buffer.clear buffer; read_token f (f ax (timestamp_to_iso current_token), buffer) s | Some(c) -> Stream.junk s; Buffer.add_char buffer c; read_timestamp f state s | None -> failwith "End of stream within a bracketed string"
My Python version defines a few states as symbolic constants as I would have done in C and then build a huge while loop keeping track of the transitions. I am not please with that implementation, because it does not give me any tool to manage the complexity of the lexer. Working with functions is very unpleasant because of the details of the scope of variables in Python. So what would be the idiomatic way to break down the lexer in small testable pieces, which would be important if I would decide to write a more complicate parser? Could the idea to represent lexer-states by objects be interesting?
Question 2. Is it right to treat stdin as a character stream?
Obviously I somehow did that but Python has no real character type and makes them look like length 1 strings. It feels to me that the approach of reading my input in chunks into “circular expandable buffers” and making copies of substrings of these chunks to generate my tokens would fit much more nicely in the language. Am I right?
Question 3. What is the all-purpose exception in Python?
This seems like a rather basic question but I am really failing to find a suitable answer in the documentation The Exception
seems a poor choice, since it very general and make error identification and error handling rather complicated.
Question 4. Do generators return none?
Is it good style to return None
in generators? Would pass also do?