3
\$\begingroup\$

I wrote a simple text parser and a converter using Python 3.12.1.

The purpose is to read a text file that contains 20000 words one per line, and create an output file that contains all the queries.

Here is a snippet of the input file with fake values:

aaaaaaaa bbbbbbbb cccccccc dddddddd 

I got this file from an external source and I need to save those words in a database. For that reason, I need to read the input file and create the SQL queries. I can directly run the queries on the DB, but for simplicity, here I published a version with a text file for output.

I tried to implement the separation of concerns and the single responsability concept.

Here's my code:

#!/usr/bin/env python3 from abc import ABC, abstractmethod class ParserInterface(ABC): @abstractmethod def parse(self) -> None: pass class ConverterInterface(ABC): @abstractmethod def convert(self, tablename: str = '', columnname: str = '') -> None: pass @abstractmethod def writeOnFile(self, filename: str = '') -> None: pass class Parser(ParserInterface): def __init__(self, filename: str = '') -> None: if not filename: raise Exception('Filename cannot be empty!') self.filename = filename self.tokens = set() def parse(self) -> None: with open(self.filename, 'r') as file: while line := file.readline(): self.tokens.add(line.rstrip()) class TextToMySQLConverter(ConverterInterface): def __init__(self) -> None: self.tokens = set() self.queries = set() def setTokens(self, tokens: set = set()) -> None: self.tokens = tokens def convert(self, tablename: str = '', columnname: str = '') -> None: if not tablename: raise Exception('Tablename cannot be empty!') if not columnname: raise Exception('Columnname cannot be empty!') for token in self.tokens: query = 'INSERT INTO {0} ({1}) VALUES ("{2}");'.format(tablename, columnname, token) self.queries.add(query) def writeOnFile(self, filename: str = '') -> None: if not filename: raise Exception('Filename cannot be empty!') with open(filename,'w') as file: for query in self.queries: file.write('{0}\n'.format(query)) class Manager: def __init__(self, parser: ParserInterface = None, converter: ConverterInterface = None) -> None: if not parser: raise Exception('Parser cannot be None!') if not converter: raise Exception('Converter cannot be None!') self.parser = parser self.converter = converter self.outputFilename = '' self.tableName = '' self.columnName = '' def setOutputFilename(self, filename: str = '') -> None: if not filename: raise Exception('Filename cannot be empty!') self.outputFilename = filename def setTableName(self, tableName: str = '') -> None: if not tableName: raise Exception('Filename cannot be empty!') self.tableName = tableName def setColumnName(self, columnName: str = '') -> None: if not columnName: raise Exception('Filename cannot be empty!') self.columnName = columnName def process(self): self.parser.parse() tokens = self.parser.tokens converter.setTokens(tokens) converter.convert(self.tableName, self.columnName) converter.writeOnFile(self.outputFilename) if __name__ == "__main__": parser = Parser('original.txt') converter = TextToMySQLConverter() # We can implement a strategy here manager = Manager(parser, converter) manager.setOutputFilename('queries.txt') manager.setTableName("tablename") manager.setColumnName("columname") manager.process() 
\$\endgroup\$
1
  • \$\begingroup\$As I wrote in the description, the input file is a file that contains 20000 words one per line.\$\endgroup\$CommentedFeb 13, 2024 at 15:48

2 Answers 2

2
\$\begingroup\$

Naming

Method/function names, parameters, and variable names should be snake_case

Parser.parse

This doesn't have to be a while loop, you can just use set.update on an iterable:

class Parser(ParserInterface): ~snip~ def parse(self) -> None: # no need for the 'r' mode, as it's the default with open(self.filename) as fh: self.tokens.update((line.rstrip() for line in fh)) 

TextToMySQLConverter.write_on_file

To write the set to the file, simply use str.join:

 def write_on_file(self, filename: str = '') -> None: if not filename: raise Exception('Filename cannot be empty!') with open(filename, 'w') as file: file.write('\n'.join(self.queries)) 

Bug: converter

I think you meant to have self.converter in Manager.process, otherwise it's relying on a global:

 def process(self): self.parser.parse() tokens = self.parser.tokens # Here converter.setTokens(tokens) converter.convert(self.tableName, self.columnName) converter.writeOnFile(self.outputFilename) 

Getters and Setters

The Manager class should be able to take output_filename, table_name, and column_name as parameters, since you seem to know these ahead of time. This way you don't need the getter and setter methods:

class Manager: def __init__( self, output_filename: str, table_name: str, column_name: str, parser: ParserInterface, # Make these required converter: ConverterInterface ) -> None: self.parser = parser self.converter = converter self.output_filename = output_filename self.table_name = table_name self.column_name = column_name def process(self): self.parser.parse() self.converter.tokens = self.parser.tokens self.converter.convert(self.table_name, self.column_name) self.converter.write_file(self.output_filename) 

However, now we have a class with two methods, one of which is __init__. This means that we should be able to safely convert this into a standalone function:

def parse_and_write( parser: ParserInterface, converter: ConverterInterface, output_filename: str, table_name: str, column_name: str ): parser.parse() converter.tokens = parser.tokens converter.convert(table_name, column_name) converter.write_file(output_filename) 

converter.tokens = parser.tokens

IMO, parser.tokens should just be an argument to converter.convert:

class TextToMySQLConverter(ConverterInterface): def __init__(self): self.tokens = set() self.queries = set() def convert(self, tokens: set[str], tablename: str, columnname: str) -> None: for token in tokens: query = 'INSERT INTO {0} ({1}) VALUES ("{2}");'.format(tablename, columnname, token) self.queries.add(query) 

Now, this can also be refactored in the same way. Instead of returning None, return queries:

def convert_tokens(self, tokens: set[str], tablename: str, columnname: str) -> set[str]: queries = set() for token in tokens: query = 'INSERT INTO {0} ({1}) VALUES ("{2}");'.format(tablename, columnname, token) queries.add(query) return queries 

What about write_to_file? That is a function as well, we aren't carrying around state, and it's just as easy to provide queries and output_filename as arguments:

def write_queries(file: str, queries: set[str]) -> None: with open(file, 'w') as fh: fh.write('\n'.join(queries)) 

Parser

As above, so below. Parser is also just a single function:

def parse(filename: str) -> set[str]: with open(filename) as file: return set((line.rstrip() for line in file)) 

main function

Now, your process function is really just a main:

def parse(filename: str) -> set[str]: with open(filename, 'r') as file: return set((line.rstrip() for line in file)) def convert_tokens(tokens: set[str], tablename: str, columnname: str) -> set[str]: queries = set() for token in tokens: query = 'INSERT INTO {0} ({1}) VALUES ("{2}");'.format(tablename, columnname, token) queries.add(query) return queries def write_queries(file: str, queries: set[str]) -> None: with open(file, 'w') as fh: fh.write('\n'.join(queries)) def main(input_file: str, output_file: str, table_name: str, column_name: str): tokens = parse(input_file) queries = convert_tokens(tokens) write_queries(queries=queries, file=output_file) return if __name__ == "__main__": main( input_file='original.txt', output_file='queries.txt', table_name='tablename', column_name='columnname' ) ```
\$\endgroup\$
    1
    \$\begingroup\$

    pass

    class ParserInterface(ABC): @abstractmethod def parse(self) -> None: pass 

    At the end there we had to write pass, something we generally avoid unless it's syntactically necessary.

    You twice passed up the opportunity to """describe""" the responsibility of the class or method. Now for the class, no big deal, it seems pretty self explanatory. But the method is a little surprising.

    Typically I would expect a "parse" verb to consume some ugly text and return it as a beautiful data structure. But here the helpful annotation is explaining that we must be evaluating for side effects. Ok, fair enough. But then explain those side effects in a docstring, please. And having done that, there's no longer a necessity for that pass.

    Similar remarks for convert().

    default args

     def convert(self, tablename: str = '', columnname: str = '') -> None: ... def writeOnFile(self, filename: str = '') -> None: 

    The type annotation is helpful, thank you. (I am sad the file name is of str rather than the more natural Path.)

    I can't imagine how those empty-string defaults would be helpful to a caller. Seems more likely that someone would use your Public API incorrectly, then be lulled into a sense of false confidence in the results due to lack of diagnostic exception being raised. Consider leaving them as type str, with no default, so caller is forced to supply a value.

    Similarly with the Parser ctor. Testing with if not filename: and then raising is just silly, elide that part. What you might sensibly do is insist on Path, and raise unless filename.exists().

    Similarly down in TextToMySQLConverter and Manager.

    And if you're going to raise, either define an app-specific exception or choose an appropriate one such as ValueError. Avoid raising Exception, as that prevents callers from focusing a try block on an expected issue that they know how to recover from.

    lint

    Pep-8 asks that you name it write_on_file().

    Similarly for set_tokens().

    open() defaults

     with open(self.filename, 'r') as file: 

    Using a context handler for auto-close is beautiful, thank you.

    nit: The 'r' mode is default and might be elided. If you're keen to tack on open() options, you might consider , encoding='utf8'. Personally I wouldn't, but some folks assert it should always be specified for fear of the system default being Latin1 or some crazy thing. The systems I work on have a sensible unicode system default, so the concern seldom arises.

     while line := file.readline(): self.tokens.add(line.rstrip()) 

    Sure, it works, and it's walrussy. But please don't do that. Stick to idiomatic consumption of the iterable:

     for line in file: ... 

    Hmmm, we have a lot of calls to .add():

     self.tokens.add(line.rstrip()) 

    Consuming the iterable could simply be done by set():

     with open(self.filename) as file: self.tokens = set(map(str.rstrip, file)) 

    mutable default parameter

    Please don't do this:

     def setTokens(self, tokens: set = set()) -> None: 

    This is valid syntax, but it has surprising results likely to trip up a future maintenance engineer. Therefore we don't write code like this, not ever, without adding a helpful # comment that offers a warning to the unwary.

    SO supplies answers describing the effect.

    The idiomatic way to write this would be:

     def set_tokens(self, tokens: set | None = None) -> None: self.tokens = tokens or set() 

    So instead of evaluating set() just once, "at compile time" when the class is being parsed into bytecode, for N calls we will evaluate set() N times, "at run time". And yes, I know, the Optional aspect is annoying and ugly, sorry about it, that's life.

    little Bobby Tables

     query = 'INSERT INTO {0} ({1}) VALUES ("{2}");'.format(tablename, columnname, token) 

    Ugghh! Say it isn't so!

    Maybe you have a Requirements Document that patiently explains your input file shall only contain lowercase "a" .. "z" characters and newlines. I don't care. Eventually some maintenance engineer will trip over this, hard. Don't do it. Use bind variables, as the DB vendor intended. Do not attempt to merge this feature branch to main in its current state.

    (BTW, ever wonder why so many URLs include base64 encoded args? Because someone's requirements document said that inputs shall never include & ampersand, ? question mark, # hash, and similar punctuation. Until things changed. So we defensively build for robustness, anticipating changes.)

    nit: An INSERT is a DML "statement" with side effects, not a side effect free "query". If you don't like statement, then a vague but accurate sql identifier would suffice.

    print

     file.write('{0}\n'.format(query)) 

    nit: That's a curiously long way to write print(query, file=file). And when you feel the need to .format(), usually it's better to write an f-string: `f"{query}\n"

    setters

     def setOutputFilename(self, filename: str = '') -> None: ... self.outputFilename = filename 

    I don't understand the rationale, here. This isn't java, we don't need to define a set_output_filename() routine, when we have a publicself.output_filename attribute. Just assign to it.

    If we change our mind, we can always interpose a setter routine later, perhaps for logging or for rejecting non-existent filenames.

    Similarly for those other public attributes.

    Aaannnddd, given that we can't really do anything with attributes missing, wouldn't you like the __init__ ctor to insist that caller provides them from the get go?

    batch size

    It wasn't obvious to me how often you issue a COMMIT to the DB backend, and the Review Context didn't comment on performance.

    Sometimes a single BEGIN; INSERT; ... INSERT; END transaction suffices. For a large ETL task we sometimes prefer to send a thousand or ten thousand rows at a time, with a COMMIT between batches.

    What you definitely want to avoid is an ETL that does COMMIT after each INSERT. That will slow things down without offering any app-level benefits. An auto-commit setting of True will sometimes arrange for more COMMITs than a code author intended.

    SQLAlchemy

    There's more than one ORM that could help you with this INSERT task. Consider using SQL Alchemy for it.

    summary

    1. Write injection-free code, always.
    2. Avoid mutable default parameters.
    3. Think carefully about whether a class or method is truly so well-named, so self-explanatory, that it needs no docstring. (Sometimes we don't need one, but less often than you think.)
    4. Gather mandatory parameter values up front.
    5. Embrace Path.
    6. Choose snake_case identifiers.
    7. Only write {get,set}ters for _private attributes.
    8. Raise specific errors rather than generic Exception.

    This codebase appears to accomplish many of its design goals.

    There are enough design issues and code defects in the code that I would not be willing to delegate or accept maintenance tasks for it in its current form.

    \$\endgroup\$

      Start asking to get answers

      Find the answer to your question by asking.

      Ask question

      Explore related questions

      See similar questions with these tags.