7
\$\begingroup\$

I've implemented a minimal external sort of text file using heapq python module.

On the few tests I did it seems to works well, but I would like to have some advice to have a cleaner and faster code. I do not know much of good practices and I want to learn (May wish to go from academics to industry one day). All remarks, advice and suggestions are warmly welcome.

There are 3 functions: one that splits the big file in smaller files, one that does the merge, and one main function.

import os import tempfile import heapq import sys import shutil # Algorithm based on # https://github.com/melvilgit/external-Merge-Sort def split_large_file(starting_file, my_temp_dir, max_line=1000000): """ :param starting_file: input file to be splitted :param my_temp_dir: temporary directory :param max_line: number of line to put in each smaller file (ram usage) :return: a list with all TemporaryFile """ liste_file = [] line_holder = [] cpt = 0 with open(starting_file, 'rb') as f_in: for line in f_in: line_holder.append(line) cpt += 1 if cpt % max_line == 0: cpt = 0 line_holder.sort(key=lambda x: x.split(b"\t")[0]) temp_file = tempfile.NamedTemporaryFile(dir=my_temp_dir, delete=False) temp_file.writelines(line_holder) temp_file.seek(0) line_holder = [] liste_file.append(temp_file) if line_holder: line_holder.sort(key=lambda x: x.split(b"\t")[0]) temp_file = tempfile.NamedTemporaryFile(dir=my_temp_dir, delete=False) temp_file.writelines(line_holder) temp_file.seek(0) liste_file.append(temp_file) return liste_file def merged(liste_file, out_file, col): """ :param liste_file: a list with all temporary file opened :param out_file: the output file :param col: the column where to perform the sort, being minimal the script will fail if one column is shorter than this value :return: path to output file """ my_heap = [] for elem in liste_file: line = elem.readline() spt = line.split(b"\t") heapq.heappush(my_heap, [int.from_bytes(spt[col], "big"), line, elem]) with open(out_file, "wb") as out: while True: minimal = my_heap[0] if minimal[0] == sys.maxsize: break out.write(minimal[1]) file_temp = minimal[2] line = file_temp.readline() if not line: my_heap[0] = [sys.maxsize, None, None] os.remove(file_temp.name) else: spt = line.split(b"\t") my_heap[0] = [int.from_bytes(spt[col], "big"), line, file_temp] heapq.heapify(my_heap) return out_file def main(big_file, outfile, tmp_dir=None, max_line=1000000, column=0): if not tmp_dir: tmp_dir = os.getcwd() with tempfile.TemporaryDirectory(dir=tmp_dir) as my_temp_dir: temp_dir_file_list = split_large_file(big_file, my_temp_dir, max_line) print("splitted") merged(liste_file=temp_dir_file_list, out_file=outfile, col=column) print("file merged, sorting done") 
\$\endgroup\$

    1 Answer 1

    2
    \$\begingroup\$

    Nit: running isort would organize your imports, to reduce git merge conflicts.

    The split_large_file() signature might default my_temp_dir. Consider renaming it to simply temp_dir. For compatibility with /usr/bin/sort, os.environ['TMPDIR'] would be a fair default.

    Consider renaming starting_file to in_file (or input_file).

    Docstrings start with an English sentence, e.g. "Splits a file into pieces." If the intent is "...into sorted pieces," then say so, and make the function name match. Splitting a file is by itself a valuable service, even without sorting.

    Please say "a list of" rather than "a list with all".

    It appears liste_file has a typo, plus it should simply be files. Simplify line_holder to lines.

    algorithm

    1. line budget vs RAM budget
    2. chunking
    3. loop style

    You impose a million-line max on each piece. But the true goal was to cap RAM consumption, which is indirectly related through the average chars-per-line for a given input file. Consider adding up line lengths rather than incrementing by one for each line.

    Related to this, the line-by-line iteration is on the slow side, especially for input files comprised of short lines. Consider doing one or more binary .read()'s, and then reading until newline so only complete lines go to the outfile. (Or scan backwards through the chunk to find a newline, then seek() to that position before issuing the next read.)

    The if line_holder: line is a code smell, it suggests that you didn't quite find a form of while loop that matched your application. Perhaps a boolean done flag would help to remove the copy-n-pasted 2nd chunk of code. If for some reason you feel it still needs to remain, then definitely extract both copies into a small helper method. DRY: https://en.wikipedia.org/wiki/Don%27t_repeat_yourself

    This is correct, but weird:

     if cpt % max_line == 0: cpt = 0 

    If line number cpt (don't know what that stands for) hits limit, we reset the line number and emit a chunk. Computing modulo % is more expensive than a comparison >. Usually modulo would be used if we're not going to reset the line number.

    Three comments about the lambda:

    1. This is the first we've heard about the input file being a .tsv. Improve the docstring.
    2. The lambda is fine, but consider breaking it out as a named function.
    3. The .split() copies out the entire line, but it only needs to copy as far as first TAB.

    We evaluate merged() for side effects, so we're looking for a verb name rather than an adjective. Just call it merge(), or perhaps external_merge(). Consider defaulting col=0 in the signature.

    Each docstring should start with an English sentence.

    The comment "being minimal the script will fail if one column is shorter than this value" is accurate. But, better to rewrite as a pre-condition on valid inputs: "must exist in each input row."

     my_heap = [] 

    Yes, it is your heap. But that's not what is interesting about it. Better to just call it heap.

     for elem in liste_file: 

    This would more naturally be: for file in files:

    You used a hardcoded column zero in split_large_file(), but now you are apparently processing unsorted input since col might be positive.

    Please assign sentinel = sys.maxsize, if that is how you're trying to use it. Or at least add a comment. (It is 0x7FFFFFFFFFFFFFFF, is that what you want?) The break apparently should be raising a fatal exception instead, since the input file apparently is not allowed to contain that sentinel (not allowed to contain arbitrary contents).

    The references to minimal[0], [1], and [2] are obscure. Please use tuple unpack to put them in val, line, file_temp or similarly informative variables.

    The early explicit .remove() is fine. If you had specified delete=True (or let it default) the temp files would be deleted when the interpreter does a normal exit.

    Depending on platform, defaulting tmp_dir to something like /tmp might be more appropriate than to CWD (which might be unwritable).

    Overall, looks pretty good, pretty readable. One can always improve.

    \$\endgroup\$
    1
    • \$\begingroup\$Thank you a lot! It exactly what I needed. There is a lot of small things to improve that I never really paid attention to.\$\endgroup\$
      – RomainL.
      CommentedMay 29, 2019 at 11:07

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.