Simple Python text file Parser and Converter

Question

I wrote a simple text parser and a converter using Python 3.12.1.

The purpose is to read a text file that contains 20000 words one per line, and create an output file that contains all the queries.

Here is a snippet of the input file with fake values:

aaaaaaaa bbbbbbbb cccccccc dddddddd

I got this file from an external source and I need to save those words in a database. For that reason, I need to read the input file and create the SQL queries. I can directly run the queries on the DB, but for simplicity, here I published a version with a text file for output.

I tried to implement the separation of concerns and the single responsability concept.

Here's my code:

#!/usr/bin/env python3 from abc import ABC, abstractmethod class ParserInterface(ABC): @abstractmethod def parse(self) -> None: pass class ConverterInterface(ABC): @abstractmethod def convert(self, tablename: str = '', columnname: str = '') -> None: pass @abstractmethod def writeOnFile(self, filename: str = '') -> None: pass class Parser(ParserInterface): def __init__(self, filename: str = '') -> None: if not filename: raise Exception('Filename cannot be empty!') self.filename = filename self.tokens = set() def parse(self) -> None: with open(self.filename, 'r') as file: while line := file.readline(): self.tokens.add(line.rstrip()) class TextToMySQLConverter(ConverterInterface): def __init__(self) -> None: self.tokens = set() self.queries = set() def setTokens(self, tokens: set = set()) -> None: self.tokens = tokens def convert(self, tablename: str = '', columnname: str = '') -> None: if not tablename: raise Exception('Tablename cannot be empty!') if not columnname: raise Exception('Columnname cannot be empty!') for token in self.tokens: query = 'INSERT INTO {0} ({1}) VALUES ("{2}");'.format(tablename, columnname, token) self.queries.add(query) def writeOnFile(self, filename: str = '') -> None: if not filename: raise Exception('Filename cannot be empty!') with open(filename,'w') as file: for query in self.queries: file.write('{0}\n'.format(query)) class Manager: def __init__(self, parser: ParserInterface = None, converter: ConverterInterface = None) -> None: if not parser: raise Exception('Parser cannot be None!') if not converter: raise Exception('Converter cannot be None!') self.parser = parser self.converter = converter self.outputFilename = '' self.tableName = '' self.columnName = '' def setOutputFilename(self, filename: str = '') -> None: if not filename: raise Exception('Filename cannot be empty!') self.outputFilename = filename def setTableName(self, tableName: str = '') -> None: if not tableName: raise Exception('Filename cannot be empty!') self.tableName = tableName def setColumnName(self, columnName: str = '') -> None: if not columnName: raise Exception('Filename cannot be empty!') self.columnName = columnName def process(self): self.parser.parse() tokens = self.parser.tokens converter.setTokens(tokens) converter.convert(self.tableName, self.columnName) converter.writeOnFile(self.outputFilename) if __name__ == "__main__": parser = Parser('original.txt') converter = TextToMySQLConverter() # We can implement a strategy here manager = Manager(parser, converter) manager.setOutputFilename('queries.txt') manager.setTableName("tablename") manager.setColumnName("columname") manager.process()

As I wrote in the description, the input file is a file that contains 20000 words one per line. — friday-json, CommentedFeb 13, 2024 at 15:48

C.Nivs · Accepted Answer · 2024-02-13 20:51:32Z

Naming

Method/function names, parameters, and variable names should be snake_case

`Parser.parse`

This doesn't have to be a while loop, you can just use set.update on an iterable:

class Parser(ParserInterface): ~snip~ def parse(self) -> None: # no need for the 'r' mode, as it's the default with open(self.filename) as fh: self.tokens.update((line.rstrip() for line in fh))

`TextToMySQLConverter.write_on_file`

To write the set to the file, simply use str.join:

 def write_on_file(self, filename: str = '') -> None: if not filename: raise Exception('Filename cannot be empty!') with open(filename, 'w') as file: file.write('\n'.join(self.queries))

Bug: `converter`

I think you meant to have self.converter in Manager.process, otherwise it's relying on a global:

 def process(self): self.parser.parse() tokens = self.parser.tokens # Here converter.setTokens(tokens) converter.convert(self.tableName, self.columnName) converter.writeOnFile(self.outputFilename)

Getters and Setters

The Manager class should be able to take output_filename, table_name, and column_name as parameters, since you seem to know these ahead of time. This way you don't need the getter and setter methods:

class Manager: def __init__( self, output_filename: str, table_name: str, column_name: str, parser: ParserInterface, # Make these required converter: ConverterInterface ) -> None: self.parser = parser self.converter = converter self.output_filename = output_filename self.table_name = table_name self.column_name = column_name def process(self): self.parser.parse() self.converter.tokens = self.parser.tokens self.converter.convert(self.table_name, self.column_name) self.converter.write_file(self.output_filename)

However, now we have a class with two methods, one of which is __init__. This means that we should be able to safely convert this into a standalone function:

def parse_and_write( parser: ParserInterface, converter: ConverterInterface, output_filename: str, table_name: str, column_name: str ): parser.parse() converter.tokens = parser.tokens converter.convert(table_name, column_name) converter.write_file(output_filename)

`converter.tokens = parser.tokens`

IMO, parser.tokens should just be an argument to converter.convert:

class TextToMySQLConverter(ConverterInterface): def __init__(self): self.tokens = set() self.queries = set() def convert(self, tokens: set[str], tablename: str, columnname: str) -> None: for token in tokens: query = 'INSERT INTO {0} ({1}) VALUES ("{2}");'.format(tablename, columnname, token) self.queries.add(query)

Now, this can also be refactored in the same way. Instead of returning None, return queries:

def convert_tokens(self, tokens: set[str], tablename: str, columnname: str) -> set[str]: queries = set() for token in tokens: query = 'INSERT INTO {0} ({1}) VALUES ("{2}");'.format(tablename, columnname, token) queries.add(query) return queries

What about write_to_file? That is a function as well, we aren't carrying around state, and it's just as easy to provide queries and output_filename as arguments:

def write_queries(file: str, queries: set[str]) -> None: with open(file, 'w') as fh: fh.write('\n'.join(queries))

`Parser`

As above, so below. Parser is also just a single function:

def parse(filename: str) -> set[str]: with open(filename) as file: return set((line.rstrip() for line in file))

`main` function

Now, your process function is really just a main:

def parse(filename: str) -> set[str]: with open(filename, 'r') as file: return set((line.rstrip() for line in file)) def convert_tokens(tokens: set[str], tablename: str, columnname: str) -> set[str]: queries = set() for token in tokens: query = 'INSERT INTO {0} ({1}) VALUES ("{2}");'.format(tablename, columnname, token) queries.add(query) return queries def write_queries(file: str, queries: set[str]) -> None: with open(file, 'w') as fh: fh.write('\n'.join(queries)) def main(input_file: str, output_file: str, table_name: str, column_name: str): tokens = parse(input_file) queries = convert_tokens(tokens) write_queries(queries=queries, file=output_file) return if __name__ == "__main__": main( input_file='original.txt', output_file='queries.txt', table_name='tablename', column_name='columnname' ) ```

J_H · Accepted Answer · 2024-02-15 23:33:38Z

pass

class ParserInterface(ABC): @abstractmethod def parse(self) -> None: pass

At the end there we had to write pass, something we generally avoid unless it's syntactically necessary.

You twice passed up the opportunity to """describe""" the responsibility of the class or method. Now for the class, no big deal, it seems pretty self explanatory. But the method is a little surprising.

Typically I would expect a "parse" verb to consume some ugly text and return it as a beautiful data structure. But here the helpful annotation is explaining that we must be evaluating for side effects. Ok, fair enough. But then explain those side effects in a docstring, please. And having done that, there's no longer a necessity for that pass.

Similar remarks for convert().

default args

 def convert(self, tablename: str = '', columnname: str = '') -> None: ... def writeOnFile(self, filename: str = '') -> None:

The type annotation is helpful, thank you. (I am sad the file name is of str rather than the more natural Path.)

I can't imagine how those empty-string defaults would be helpful to a caller. Seems more likely that someone would use your Public API incorrectly, then be lulled into a sense of false confidence in the results due to lack of diagnostic exception being raised. Consider leaving them as type str, with no default, so caller is forced to supply a value.

Similarly with the Parser ctor. Testing with if not filename: and then raising is just silly, elide that part. What you might sensibly do is insist on Path, and raise unless filename.exists().

Similarly down in TextToMySQLConverter and Manager.

And if you're going to raise, either define an app-specific exception or choose an appropriate one such as ValueError. Avoid raising Exception, as that prevents callers from focusing a try block on an expected issue that they know how to recover from.

lint

Pep-8 asks that you name it write_on_file().

Similarly for set_tokens().

open() defaults

 with open(self.filename, 'r') as file:

Using a context handler for auto-close is beautiful, thank you.

nit: The 'r' mode is default and might be elided. If you're keen to tack on open() options, you might consider , encoding='utf8'. Personally I wouldn't, but some folks assert it should always be specified for fear of the system default being Latin1 or some crazy thing. The systems I work on have a sensible unicode system default, so the concern seldom arises.

 while line := file.readline(): self.tokens.add(line.rstrip())

Sure, it works, and it's walrussy. But please don't do that. Stick to idiomatic consumption of the iterable:

 for line in file: ...

Hmmm, we have a lot of calls to .add():

 self.tokens.add(line.rstrip())

Consuming the iterable could simply be done by set():

 with open(self.filename) as file: self.tokens = set(map(str.rstrip, file))

mutable default parameter

Please don't do this:

 def setTokens(self, tokens: set = set()) -> None:

This is valid syntax, but it has surprising results likely to trip up a future maintenance engineer. Therefore we don't write code like this, not ever, without adding a helpful # comment that offers a warning to the unwary.

SO supplies answers describing the effect.

The idiomatic way to write this would be:

 def set_tokens(self, tokens: set | None = None) -> None: self.tokens = tokens or set()

So instead of evaluating set() just once, "at compile time" when the class is being parsed into bytecode, for N calls we will evaluate set() N times, "at run time". And yes, I know, the Optional aspect is annoying and ugly, sorry about it, that's life.

little Bobby Tables

 query = 'INSERT INTO {0} ({1}) VALUES ("{2}");'.format(tablename, columnname, token)

Ugghh! Say it isn't so!

Maybe you have a Requirements Document that patiently explains your input file shall only contain lowercase "a" .. "z" characters and newlines. I don't care. Eventually some maintenance engineer will trip over this, hard. Don't do it. Use bind variables, as the DB vendor intended. Do not attempt to merge this feature branch to main in its current state.

(BTW, ever wonder why so many URLs include base64 encoded args? Because someone's requirements document said that inputs shall never include & ampersand, ? question mark, # hash, and similar punctuation. Until things changed. So we defensively build for robustness, anticipating changes.)

nit: An INSERT is a DML "statement" with side effects, not a side effect free "query". If you don't like statement, then a vague but accurate sql identifier would suffice.

print

 file.write('{0}\n'.format(query))

nit: That's a curiously long way to write print(query, file=file). And when you feel the need to .format(), usually it's better to write an f-string: `f"{query}\n"

setters

 def setOutputFilename(self, filename: str = '') -> None: ... self.outputFilename = filename

I don't understand the rationale, here. This isn't java, we don't need to define a set_output_filename() routine, when we have a publicself.output_filename attribute. Just assign to it.

If we change our mind, we can always interpose a setter routine later, perhaps for logging or for rejecting non-existent filenames.

Similarly for those other public attributes.

Aaannnddd, given that we can't really do anything with attributes missing, wouldn't you like the __init__ ctor to insist that caller provides them from the get go?

batch size

It wasn't obvious to me how often you issue a COMMIT to the DB backend, and the Review Context didn't comment on performance.

Sometimes a single BEGIN; INSERT; ... INSERT; END transaction suffices. For a large ETL task we sometimes prefer to send a thousand or ten thousand rows at a time, with a COMMIT between batches.

What you definitely want to avoid is an ETL that does COMMIT after each INSERT. That will slow things down without offering any app-level benefits. An auto-commit setting of True will sometimes arrange for more COMMITs than a code author intended.

SQLAlchemy

There's more than one ORM that could help you with this INSERT task. Consider using SQL Alchemy for it.

summary

Write injection-free code, always.
Avoid mutable default parameters.
Think carefully about whether a class or method is truly so well-named, so self-explanatory, that it needs no docstring. (Sometimes we don't need one, but less often than you think.)
Gather mandatory parameter values up front.
Embrace Path.
Choose snake_case identifiers.
Only write {get,set}ters for _private attributes.
Raise specific errors rather than generic Exception.

This codebase appears to accomplish many of its design goals.

There are enough design issues and code defects in the code that I would not be willing to delegate or accept maintenance tasks for it in its current form.

Stack Exchange Network

Simple Python text file Parser and Converter

2 Answers 2

Naming

`Parser.parse`

`TextToMySQLConverter.write_on_file`

Bug: `converter`

Getters and Setters

`converter.tokens = parser.tokens`

`Parser`

`main` function

pass

default args

lint

open() defaults

mutable default parameter

little Bobby Tables

print

setters

batch size

SQLAlchemy

summary

Hot Network Questions

Simple Python text file Parser and Converter

2 Answers 2

Naming

Parser.parse

TextToMySQLConverter.write_on_file

Bug: converter

Getters and Setters

converter.tokens = parser.tokens

Parser

main function

pass

default args

lint

open() defaults

mutable default parameter

little Bobby Tables

print

setters

batch size

SQLAlchemy

summary

Related

Hot Network Questions

`Parser.parse`

`TextToMySQLConverter.write_on_file`

Bug: `converter`

`converter.tokens = parser.tokens`

`Parser`

`main` function