Lexer for Apache logs in Python

Question

I recently learned Python using the book by Mark Summerfeld and I am now doing a few katas of mine when I learn a new language. This helps to get the “how the language works” and I wrote a simple lexer for Apache logs.

An example of log records are

64.242.88.10 - - [07/Mar/2004:16:05:49 -0800] "GET /twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 12846 64.242.88.10 - - [07/Mar/2004:16:06:51 -0800] "GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 HTTP/1.1" 200 4523 64.242.88.10 - - [07/Mar/2004:16:10:02 -0800] "GET /mailman/listinfo/hsdivision HTTP/1.1" 200 6291

and program I wrote reads all values as tokens and output them using | as a separator, as in

64.242.88.10|-|-|07/Mar/2004:16:05:49 -0800|GET /twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1|401|12846 64.242.88.10|-|-|07/Mar/2004:16:06:51 -0800|GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 HTTP/1.1|200|4523 64.242.88.10|-|-|07/Mar/2004:16:10:02 -0800|GET /mailman/listinfo/hsdivision HTTP/1.1|200|6291

The lexer can be used for more general usage though.

Here is my code and I would love to read your comments and suggestions about how to improve it. I have also raised a few specific questions on certain topics.

import collections import io import sys LEXER_SKIP_WHITESPACE=0 LEXER_READ_QUOTED_STRING=1 LEXER_READ_BRACKETED_STRING=2 LEXER_READ_WORD=3 class Location: def __init__(self, name=None, pos=0, line=1, col=0): self.name = name or "<input>" self.line = line self.col = col self.pos = pos def update(self, c): self.pos += 1 if c == "\n": self.line += 1 self.col = 0 else: self.col += 1 def __repr__(self): return str.format("Location({}, {}, {}, {})", repr(self.name), repr(self.pos), repr(self.line), repr(self.col)) def __str__(self): return str.format("{}: {}: line {}, column {}", self.name, self.pos, self.line, self.col) def readchar(inputchannel, location): while True: maybechar = inputchannel.read(1) if maybechar == '': return None else: location.update(maybechar) yield maybechar def readtoken(inputchannel, location): state = LEXER_SKIP_WHITESPACE token = '' for nextchar in readchar(inputchannel, location): if state is LEXER_SKIP_WHITESPACE: if nextchar == "\n": yield "\n" continue elif nextchar.isspace(): continue elif nextchar == '"': state = LEXER_READ_QUOTED_STRING continue elif nextchar == '[': state = LEXER_READ_BRACKETED_STRING continue else: state = LEXER_READ_WORD token += nextchar continue elif state is LEXER_READ_QUOTED_STRING: if nextchar == '"': yield token token = '' state = LEXER_SKIP_WHITESPACE continue else: token += nextchar continue elif state is LEXER_READ_BRACKETED_STRING: if nextchar == ']': yield token token = '' state = LEXER_SKIP_WHITESPACE continue else: token += nextchar continue elif state is LEXER_READ_WORD: if nextchar == "\n": yield token token = '' state = LEXER_SKIP_WHITESPACE yield "\n" continue elif nextchar.isspace(): yield token token = '' state = LEXER_SKIP_WHITESPACE continue else: token += nextchar continue else: raise Error("Impossible lexer state.") if state is LEXER_SKIP_WHITESPACE: return None elif state is LEXER_READ_QUOTED_STRING: raise Error("End of character stream in quoted string.") elif state is LEXER_READ_BRACKETED_STRING: raise Error("End of character stream in quoted string.") elif state is LEXER_READ_WORD: yield token return None else: raise Error("Impossible lexer state.") class Lexer: def __init__(self, inputchannel, _location=None): self.location = _location or Location("<input>", 0, 1, 0) self.inputchannel = inputchannel self.buf = '' self.state = LEXER_SKIP_WHITESPACE def __iter__(self): return readtoken(self.inputchannel, self.location) if __name__ == "__main__": sep = '' for token in Lexer(sys.stdin): if token == '\n': sys.stdout.write(token) sep = '' else: sys.stdout.write(sep) sys.stdout.write(token) sep = '|'

Question 1 – How to write lexers in Python?

I am well aware that there are tools similar to lex and yacc for Python that could have been used here, but that would defeat the purpose of learning how to write such a program in Python. I found it surprisingly hard.

My first misfortune is that Python does not do tail-elimination, so it is essentially forbidden to write a lexer as a set of mutually recursive functions. Mutually recursive functions are one of my favourite tools to do so, because it clearly singles out a specific state of the lexer (the recursive function it is in) and the transitions from this state to the other states, and makes it easy to test-case individually each of the transitions.

Since maybe not everybody is familiar with lexers based on mutually recursive functions here is the equivalent of the readtoken generator in OCaml. The beginning of read_token reads like "If you read a double quote, discard it and do read_quotedstring" and that function itself is defined later and does what on expects: it aggregates characters in a buffer until the next double quote and bless the result as a token.

let rec read_token f ((ax,buffer) as state) s = match Stream.peek s with | Some('"') -> Stream.junk s; read_quotedstring f state s | Some('[') -> Stream.junk s; read_timestamp f state s | Some(' ') | Some('\t')-> Stream.junk s; read_token f state s | Some(c) -> read_simpleword f state s | None -> ax and read_simpleword f ((ax,buffer) as state) s = match Stream.peek s with | Some('"') | Some('[') | Some(' ') | Some('\t') -> let current_token = Buffer.contents buffer in Buffer.clear buffer; read_token f (f ax current_token, buffer) s | Some(c) -> Buffer.add_char buffer c; Stream.junk s; read_simpleword f state s | None -> ax and read_quotedstring f ((ax,buffer) as state) s = match Stream.peek(s) with | Some('"') -> Stream.junk s; let current_token = Buffer.contents buffer in Buffer.clear buffer; read_token f (f ax current_token, buffer) s | Some(c) -> Stream.junk s; Buffer.add_char buffer c; read_quotedstring f state s | None -> failwith "End of stream within a quoted string" and read_timestamp f ((ax,buffer) as state) s = match Stream.peek(s) with | Some(']') -> Stream.junk s; let current_token = Buffer.contents buffer in Buffer.clear buffer; read_token f (f ax (timestamp_to_iso current_token), buffer) s | Some(c) -> Stream.junk s; Buffer.add_char buffer c; read_timestamp f state s | None -> failwith "End of stream within a bracketed string"

My Python version defines a few states as symbolic constants as I would have done in C and then build a huge while loop keeping track of the transitions. I am not please with that implementation, because it does not give me any tool to manage the complexity of the lexer. Working with functions is very unpleasant because of the details of the scope of variables in Python. So what would be the idiomatic way to break down the lexer in small testable pieces, which would be important if I would decide to write a more complicate parser? Could the idea to represent lexer-states by objects be interesting?

Question 2. Is it right to treat stdin as a character stream?

Obviously I somehow did that but Python has no real character type and makes them look like length 1 strings. It feels to me that the approach of reading my input in chunks into “circular expandable buffers” and making copies of substrings of these chunks to generate my tokens would fit much more nicely in the language. Am I right?

Question 3. What is the all-purpose exception in Python?

This seems like a rather basic question but I am really failing to find a suitable answer in the documentation The Exception seems a poor choice, since it very general and make error identification and error handling rather complicated.

Question 4. Do generators return none?

Is it good style to return None in generators? Would pass also do?

Community · Accepted Answer · 2020-06-10 13:24:26Z

In addition to the other answers, I wanted to add my own thoughts on Q1 and Q4.

Question 1

Break up `readtoken` into multiple functions.

Right now, you are using a state machine in a single recursive function to try and read different types of tokens. As mentioned in another answer, recursion is not very "pythonic". Additionally, it is hard to follow the logic in this function since the state changes in each recursion.

This logic would be easier to understand if you break the different states up into separate functions, as it appears you have done in your OCaml example. You can even use similar high-level function names. We can then change our definition of readtoken to to route to other helper functions based on the first character we read from the stream.

def readtoken(inputchannel, location): for nextchar in readchar(inputchannel, location): if nextchar == '\n': yield '\n' elif nextchar.isspace(): continue elif nextchar == '"': yield read_quoted_string(inputchannel, location) elif nextchar == '[': yield read_bracketed_string(inputchannel, location) else: yield read_token(nextchar, inputchannel, location)

Note how the function immediately becomes easier to read. We are delegating our responsibilities out to other named functions, just like in your OCaml example. The purpose of readtoken is now simply that of a router, where we are delegating work out to other functions based on the first character we read from the stream.

Each of these helper functions now has a single responsibility - read their given token type from the stream, and return the result. Note that these functions should actually return a result, and not yield it - these functions are not generators, the only generator here is the readtoken function.

Helper function implementations:

def read_quoted_string(inputchannel, location): token = '' for nextchar in readchar(inputchannel, location): if next_char == '"': return token else: token += nextchar raise Exception("End of character stream in quoted string.") def read_bracketed_string(inputchannel, location): token = '' for nextchar in readchar(inputchannel, location): if nextchar == ']': return token else: token += nextchar raise Exception("End of character stream in bracketed string.") def read_token(token, inputchannel, location): for nextchar in readchar(inputchannel, location): if nextchar.isspace(): return token else: token += nextchar return token # if we reach the end of the stream

Take note of a few things with this implementation:

Each function becomes much more readable. We no longer have to trace through a single recursive function and keep track of the state it is in.
We can now see that read_quoted_string and read_bracketed_string have a lot of common logic. Perhaps we should combine these into a single function read_string with a string_delimiter as a parameter.
read_token takes an initial token as its first parameter - this is the character that we read from the top-level readtokens function. If we don't like this implementation, we could try to solve it by using peek at the top-level instead.

Don't make `readtoken` a generator function.

In your example, the Lexer class has implemented __iter__ which treats it as an iterable. This means that your readtoken does not need to be a generator function (i.e you can replace all yield statements with return), because the Lexer class is already a generator that wraps around it.

Question 4

When a generator function exits, it automatically raises a StopIteration. for-loops and while-loops will automatically handle this StopIteration by ending their loops and continuing on.

Therefore, the "normal" way to end a generator is to simply end the function. return None does accomplish this, but a more traditional method would be to simply reach the end of the function and return naturally. You can accomplish this by breaking out of the while-loop in your readchar function.

MaLiN2223 · Accepted Answer · 2018-09-18 10:17:13Z

Note: I'm assuming that this code works and you seek only 'code quality' improvements. Also, you seem like experienced programmer so I will try not to explain trivial things but please, do ask if you think I didn't state my opinion clearly enough.

Code notes:

General

No docs on any method or class. While methods are descriptive one might have trouble with for example, what is being updated in Location.update, it might be a name update as well.

In my opinion both readchar and readtoken should be wrapped into some class - a Reader maybe? Advantages of that are (at least) two: you do not clutter global namespace with functions and you could store both inputchannel, state and location as a class variables.

Imports

import collections import io import sys

I wouldn't import whole packages. Try to using from X import Y pattern, it is both more descriptive (reader can see why are you using those packages) and also you don't have to repeat yourself (as you do with sys.stdout). However, this is just a matter of preference.

Global (file) variables

I tend to move them into separate file and import them. It is mainly because of python's zen:

Namespaces are one honking great idea -- let's do more of those!

Moving your constant's away from the usage helps in the future if you would need to make them configurable - it separates 'implementation' from the usage.

Also, (if you happen to use python 3.4+) it might be beneficial to use enums instead. In addition to creating new 'namespace' it also provides a very important feature- your globals will be constant.

Location's init

I must say that I love a fact that you used name or "<input>", people tend to check for None with if too much.

Location's update

The c variable is not very descriptive, what you probably meant was 'character'/'char' ?

Location's repr and str

I personally don't like when people use str.format method it has no advantage over "Location({}, {}, {}, {})".format but is longer. However, if you target your classes for python 3.6+ why not using something even more readable: string interpolation.

Location's readchar

~~Instead of returning None I'd raise StopIterationError. It is because it is the usual thing to do when you want to stop a generator.~~

As @Maarten Fabré pointed out, this is no longer valid. Instead, just use return (see Explanation of generators, iterators, and StopIteration )

readtoken

In my opinion, this method is too long, I'd split it based on state, after each state is ... you'd call different function.
Depending on your needs it might be useful to use linesp instead of "\n".
Do not raise generic Exception, more on that later.
As previously, do not return none - raise an exception instead.

Lexer's init

In this case a documentation would be lovely, particularly to understand what inputchannel is and what API must it comply to.
In Location("<input>", 0, 1, 0) it would be good to name the parameters that you pass so reader wouldn't need to search for signature: Location(name="<input>", pos=0, line=1, col=0).
What does for token in Lexer(sys.stdin): mean? Will the reader understand what does it mean to 'iterate over lexer' ? My preference would be to rearrange your class to something like for token in lexer.Tokenize(sys.stdin).

Questions

Q1

You are right, recursion is generally discouraged in python. In general, what I saw in similar use cases people do tend to scan char by char as you did. I don't have much experience in writing lexers so I am not sure if my advice will be at all helpful but let's try:

Small functions are your friend - the more logic you abstract the more complexity you can put in. For example, instead of doing x == '[' try isLeftBracket(x), in the future you might introduce isBrachet which will call isLeftBracet or isRightBracket. This is just a silly example but I hope you get the point.
Instead of recursion how about using a stack of states? It is a general practice to change recursive code to iterative one.
You could abstract your class to a state machine (basically wrapping everything as I mentioned at the beginning), this will help in managing the state and include many events as functions.
First consume and then validate multiple characters given a state. For example, if you know that if current state is A and you encounter character [, next few characters are going to be some specific data

you could do this:

if state == A and current_character == '[': consume_dataX() def consume_dataX(): a = '' while condition(current_character): a += readchar() current_character = readchar() validate(a)

Real life example would be parsing a function (foo(int x, int y)):

if state == ParsingFunction and current_character == "(": argument_vector = consume_until(')')

Q2

It is completely fine to think of characters as a length 1 strings. You can think of it like this: for a given string a operator [] is a slicing operator which means that a[1] slices the string and is short for a[1:2]. Also, keep in mind that in reality char type is just a byte (or something similar) therefore, there is no need for a special treatment.

Q3

I cannot stress this enough, in my opinion: DO NOT RAISE GENERIC EXCEPTIONS unless you have completely no idea what happened.

Person using your class will have to do the following:

try: # some code except Exception as ex: # handle

which is inconvenient for many reason. If you want to know more see this SO question.

Q4

Already answered above.

about the StopIteration: pep-479: This PEP proposes a change to generators: when StopIteration is raised inside a generator, it is replaced it with RuntimeError. (More precisely, this happens when the exception is about to bubble out of the generator's stack frame.) — Maarten Fabré, CommentedSep 18, 2018 at 8:36
@MaartenFabré thank you for that, I guess we learn something new every day :) — MaLiN2223, CommentedSep 18, 2018 at 10:17

Stack Exchange Network

Lexer for Apache logs in Python

Question 1 – How to write lexers in Python?

Question 2. Is it right to treat stdin as a character stream?

Question 3. What is the all-purpose exception in Python?

Question 4. Do generators return none?

2 Answers 2

Question 1

Break up `readtoken` into multiple functions.

Don't make `readtoken` a generator function.

Question 4

Code notes:

General

Imports

Global (file) variables

Location's init

Location's update

Location's repr and str

Location's readchar

readtoken

Lexer's init

Questions

Q1

Q2

Q3

Q4

Hot Network Questions

Lexer for Apache logs in Python

Question 1 – How to write lexers in Python?

Question 2. Is it right to treat stdin as a character stream?

Question 3. What is the all-purpose exception in Python?

Question 4. Do generators return none?

2 Answers 2

Question 1

Break up readtoken into multiple functions.

Don't make readtoken a generator function.

Question 4

Code notes:

General

Imports

Global (file) variables

Location's init

Location's update

Location's repr and str

Location's readchar

readtoken

Lexer's init

Questions

Q1

Q2

Q3

Q4

Related

Hot Network Questions

Break up `readtoken` into multiple functions.

Don't make `readtoken` a generator function.