pass
class ParserInterface(ABC): @abstractmethod def parse(self) -> None: pass
At the end there we had to write pass
, something we generally avoid unless it's syntactically necessary.
You twice passed up the opportunity to """describe""" the responsibility of the class or method. Now for the class, no big deal, it seems pretty self explanatory. But the method is a little surprising.
Typically I would expect a "parse" verb to consume some ugly text and return it as a beautiful data structure. But here the helpful annotation is explaining that we must be evaluating for side effects. Ok, fair enough. But then explain those side effects in a docstring, please. And having done that, there's no longer a necessity for that pass
.
Similar remarks for convert()
.
default args
def convert(self, tablename: str = '', columnname: str = '') -> None: ... def writeOnFile(self, filename: str = '') -> None:
The type annotation is helpful, thank you. (I am sad the file name is of str
rather than the more natural Path
.)
I can't imagine how those empty-string defaults would be helpful to a caller. Seems more likely that someone would use your Public API incorrectly, then be lulled into a sense of false confidence in the results due to lack of diagnostic exception being raised. Consider leaving them as type str
, with no default, so caller is forced to supply a value.
Similarly with the Parser
ctor. Testing with if not filename:
and then raising is just silly, elide that part. What you might sensibly do is insist on Path
, and raise
unless filename.exists()
.
Similarly down in TextToMySQLConverter
and Manager
.
And if you're going to raise
, either define an app-specific exception or choose an appropriate one such as ValueError
. Avoid raising Exception
, as that prevents callers from focusing a try
block on an expected issue that they know how to recover from.
lint
Pep-8 asks that you name it write_on_file()
.
Similarly for set_tokens()
.
open() defaults
with open(self.filename, 'r') as file:
Using a context handler for auto-close is beautiful, thank you.
nit: The 'r'
mode is default and might be elided. If you're keen to tack on open() options, you might consider , encoding='utf8'
. Personally I wouldn't, but some folks assert it should always be specified for fear of the system default being Latin1 or some crazy thing. The systems I work on have a sensible unicode system default, so the concern seldom arises.
while line := file.readline(): self.tokens.add(line.rstrip())
Sure, it works, and it's walrussy. But please don't do that. Stick to idiomatic consumption of the iterable:
for line in file: ...
Hmmm, we have a lot of calls to .add()
:
self.tokens.add(line.rstrip())
Consuming the iterable could simply be done by set()
:
with open(self.filename) as file: self.tokens = set(map(str.rstrip, file))
mutable default parameter
Please don't do this:
def setTokens(self, tokens: set = set()) -> None:
This is valid syntax, but it has surprising results likely to trip up a future maintenance engineer. Therefore we don't write code like this, not ever, without adding a helpful #
comment that offers a warning to the unwary.
SO supplies answers describing the effect.
The idiomatic way to write this would be:
def set_tokens(self, tokens: set | None = None) -> None: self.tokens = tokens or set()
So instead of evaluating set()
just once, "at compile time" when the class is being parsed into bytecode, for N calls we will evaluate set()
N times, "at run time". And yes, I know, the Optional aspect is annoying and ugly, sorry about it, that's life.
little Bobby Tables
query = 'INSERT INTO {0} ({1}) VALUES ("{2}");'.format(tablename, columnname, token)
Ugghh! Say it isn't so!
Maybe you have a Requirements Document that patiently explains your input file shall only contain lowercase "a" .. "z" characters and newlines. I don't care. Eventually some maintenance engineer will trip over this, hard. Don't do it. Use bind variables, as the DB vendor intended. Do not attempt to merge this feature branch to main
in its current state.
(BTW, ever wonder why so many URLs include base64 encoded args? Because someone's requirements document said that inputs shall never include &
ampersand, ?
question mark, #
hash, and similar punctuation. Until things changed. So we defensively build for robustness, anticipating changes.)
nit: An INSERT is a DML "statement" with side effects, not a side effect free "query". If you don't like statement
, then a vague but accurate sql
identifier would suffice.
print
file.write('{0}\n'.format(query))
nit: That's a curiously long way to write print(query, file=file)
. And when you feel the need to .format()
, usually it's better to write an f-string: `f"{query}\n"
setters
def setOutputFilename(self, filename: str = '') -> None: ... self.outputFilename = filename
I don't understand the rationale, here. This isn't java, we don't need to define a set_output_filename()
routine, when we have a publicself.output_filename
attribute. Just assign to it.
If we change our mind, we can always interpose a setter routine later, perhaps for logging or for rejecting non-existent filenames.
Similarly for those other public attributes.
Aaannnddd, given that we can't really do anything with attributes missing, wouldn't you like the __init__
ctor to insist that caller provides them from the get go?
batch size
It wasn't obvious to me how often you issue a COMMIT to the DB backend, and the Review Context didn't comment on performance.
Sometimes a single BEGIN; INSERT; ... INSERT; END transaction suffices. For a large ETL task we sometimes prefer to send a thousand or ten thousand rows at a time, with a COMMIT between batches.
What you definitely want to avoid is an ETL that does COMMIT after each INSERT. That will slow things down without offering any app-level benefits. An auto-commit setting of True will sometimes arrange for more COMMITs than a code author intended.
SQLAlchemy
There's more than one ORM that could help you with this INSERT task. Consider using SQL Alchemy for it.
summary
- Write injection-free code, always.
- Avoid mutable default parameters.
- Think carefully about whether a class or method is truly so well-named, so self-explanatory, that it needs no docstring. (Sometimes we don't need one, but less often than you think.)
- Gather mandatory parameter values up front.
- Embrace
Path
. - Choose snake_case identifiers.
- Only write {get,set}ters for
_private
attributes. - Raise specific errors rather than generic
Exception
.
This codebase appears to accomplish many of its design goals.
There are enough design issues and code defects in the code that I would not be willing to delegate or accept maintenance tasks for it in its current form.