4
\$\begingroup\$

I am writing a script that takes in many sorted .dat files. I have included sample data but the dataset is quite large. The desired outcome is to have one file with an alphabetically sorted list of words.

I want to know if I am missing anything with regards to dealing with large amounts of words, mainly dealing with memory usage and catching errors, logging and tests but still continuing to iterate over the files without interruption.

Using the data then * 100000 allows me to sort 11,000,000 or so lines no problem. When dealing with the larger sets I want to sort them but without crashing.

Does Python's sort() work well for this operation or should I be looking at anything else that could be faster or more efficient?

Is it worth employing the multiprocessing module to help handle these tasks? If so, what is the best implementation? Researching, I have found this article, which, whilst dealing with large files, might be a similar process to sorting the large list at the end of the function.

The 'cost' of these operations is very important for the task.

REPL

 dat1 ='allotment', 'amortization', 'ampules', 'antitheses', 'aquiline', 'barnacle', 'barraged', 'bayonet', 'beechnut', 'bereavements', 'billow', 'boardinghouses', 'broadcasted', 'cheeseburgers', 'civil', 'concourse', 'coy', 'cranach', 'cratered', 'creameries', 'cubbyholes', 'cues', 'dawdle', 'director', 'disallowed', 'disgorged', 'disguise', 'dowries', 'emissions', 'epilogs', 'evict', 'expands', 'extortion', 'festoons', 'flexible', 'flukey', 'flynn', 'folksier', 'gave', 'geological', 'gigglier', 'glowered', 'grievous', 'grimm', 'hazards', 'heliotropes', 'holds', 'infliction', 'ingres', 'innocently', 'inquiries', 'intensification', 'jewelries', 'juicier', 'kathiawar', 'kicker', 'kiel', 'kinswomen', 'kit', 'kneecaps', 'kristie', 'laggards', 'libel', 'loggerhead', 'mailman', 'materials', 'menorahs', 'meringues', 'milquetoasts', 'mishap', 'mitered', 'mope', 'mortgagers', 'mumps', 'newscasters', 'niggling', 'nowhere', 'obtainable', 'organization', 'outlet', 'owes', 'paunches', 'peanuts', 'pie', 'plea', 'plug', 'predators', 'priestly', 'publish', 'quested', 'rallied', 'recumbent', 'reminiscence', 'reveal', 'reversals', 'ripples', 'sacked', 'safest', 'samoset', 'satisfy', 'saucing', 'scare', 'schoolmasters', 'scoundrels', 'scuzziest', 'shoeshine', 'shopping', 'sideboards', 'slate', 'sleeps', 'soaping', 'southwesters', 'stubbly', 'subscribers', 'sulfides', 'taxies', 'tillable', 'toastiest', 'tombstone', 'train', 'truculent', 'underlie', 'unsatisfying', 'uptight', 'wannabe', 'waugh', 'workbooks', 'allotment', 'amortization', 'ampules', 'antitheses', 'aquiline', 'barnacle', 'barraged', 'bayonet', 'beechnut', 'bereavements', 'billow', 'boardinghouses', 'broadcasted', 'cheeseburgers', 'civil', 'concourse', 'coy', 'cranach', 'cratered', 'creameries', 'cubbyholes', 'cues', 'dawdle', 'director', 'disallowed', 'disgorged', 'disguise', 'dowries', 'emissions', 'epilogs', 'evict', 'expands', 'extortion', 'festoons', 'flexible', 'flukey', 'flynn', 'folksier', 'gave', 'geological', 'gigglier', 'glowered', 'grievous', 'grimm', 'hazards', 'heliotropes', 'holds', 'infliction', 'ingres', 'innocently', 'inquiries', 'intensification', 'jewelries', 'juicier', 'kathiawar', 'kicker', 'kiel', 'kinswomen', 'kit', 'kneecaps', 'kristie', 'laggards', 'libel', 'loggerhead', 'mailman', 'materials', 'menorahs', 'meringues', 'milquetoasts', 'mishap', 'mitered', 'mope', 'mortgagers', 'mumps', 'newscasters', 'niggling', 'nowhere', 'obtainable', 'organization', 'outlet', 'owes', 'paunches', 'peanuts', 'pie', 'plea', 'plug', 'predators', 'priestly', 'publish', 'quested', 'rallied', 'recumbent', 'reminiscence', 'reveal', 'reversals', 'ripples', 'sacked', 'safest', 'samoset', 'satisfy', 'saucing', 'scare', 'schoolmasters', 'scoundrels', 'scuzziest', 'shoeshine', 'shopping', 'sideboards', 'slate', 'sleeps', 'soaping', 'southwesters', 'stubbly', 'subscribers', 'sulfides', 'taxies', 'tillable', 'toastiest', 'tombstone', 'train', 'truculent', 'underlie', 'unsatisfying', 'uptight', 'wannabe', 'waugh', 'workbooks' dat2 = 'abut', 'actuators', 'advert', 'altitude', 'animals', 'aquaplaned', 'battlement', 'bedside', 'bludgeoning', 'boeing', 'bubblier', 'calendaring', 'callie', 'cardiology', 'caryatides', 'chechnya', 'coffey', 'collage', 'commandos', 'defensive', 'diagnosed', 'doctor', 'elaborate', 'elbow', 'enlarged', 'evening', 'flawed', 'glowers', 'guested', 'handel', 'homogenized', 'husbands', 'hypermarket', 'inge', 'inhibits', 'interloper', 'iowan', 'junco', 'junipers', 'keen', 'logjam', 'lonnie', 'louver', 'low', 'marcelo', 'marginalia', 'matchmaker', 'mold', 'monmouth', 'nautilus', 'noblest', 'north', 'novelist', 'oblations', 'official', 'omnipresent', 'orators', 'overproduce', 'passbooks', 'penalizes', 'pisses', 'precipitating', 'primness', 'quantity', 'quechua', 'rama', 'recruiters', 'recurrent', 'remembrance', 'rumple', 'saguaro', 'sailboard', 'salty', 'scherzo', 'seafarer', 'settles', 'sheryl', 'shoplifter', 'slavs', 'snoring', 'southern', 'spottiest', 'squawk', 'squawks', 'thievish', 'tightest', 'tires', 'tobacconist', 'tripling', 'trouper', 'tyros', 'unmistakably', 'unrepresentative', 'waviest' dat3 = 'administrated', 'aggressively', 'albee', 'amble', 'announcers', 'answers', 'arequipa', 'artichoke', 'awed', 'bacillus', 'backslider', 'bandier', 'bellow', 'beset', 'billfolds', 'boneless', 'braziers', 'brick', 'budge', 'cadiz', 'calligrapher', 'clip', 'confining', 'coronets', 'crispier', 'dardanelles', 'daubed', 'deadline', 'declassifying', 'delegating', 'despairs', 'disembodying', 'dumbly', 'dynamically', 'eisenhower', 'encryption', 'estes', 'etiologies', 'evenness', 'evillest', 'expansions', 'fireproofed', 'florence', 'forcing', 'ghostwritten', 'hakluyt', 'headboards', 'hegel', 'hibernate', 'honeyed', 'hope', 'horus', 'inedible', 'inflammation', 'insincerity', 'intuitions', 'ironclads', 'jeffrey', 'knobby', 'lassoing', 'loewi', 'madwoman', 'maurois', 'mechanistic', 'metropolises', 'modified', 'modishly', 'mongols', 'motivation', 'mudslides', 'negev', 'northward', 'outperforms', 'overseer', 'passport', 'pathway', 'physiognomy', 'pi', 'platforming', 'plodder', 'pools', 'poussin', 'pragmatically', 'premeditation', 'punchier', 'puncture', 'raul', 'readjusted', 'reflectors', 'reformat', 'rein', 'relives', 'reproduces', 'restraining', 'resurrection', 'revving', 'rosily', 'sadr', 'scolloped', 'shrubbery', 'side', 'simulations', 'slashing', 'speculating', 'subsidization', 'teaser', 'tourism', 'transfers', 'transnationals', 'triple', 'undermining', 'upheavals', 'vagina', 'victims', 'weird', 'whereabouts', 'wordiness' # lines = open(combined_file_name + file_type, 'r').readlines() # output = open("intermediate_alphabetical_order.dat", 'w') # for line in sorted(lines, key=lambda line: line.split()[0]): # output.write(line) # output.close() import datetime from functools import wraps from time import time datetme = datetime.datetime.now() date = datetme.strftime('%d %b %Y %H:%M:%S ').upper() # tuples are used to read in the data to save cost of memory usage combined_dat = dat1, dat2, dat3 # * 100000 results = [] log = () #TUPLE # decorator for speed test. def speed_test(f): @wraps(f) def wrapper(*a, **kw): start = time() result = f(*a, **kw) end = time() print('Elapsed time: {} s'.format(round(end - start, 8))) return result return wrapper @speed_test def merge_sort_lists(list_of_lists, *a, **kw): """takes in a list of lists/tuples and returns a sorted list""" try: for f in list_of_lists: try: for c in f: # recursion for lists in the list of lists... if isinstance(c, list): merge_sort_lists([c]) else: results.append(c) except: datetme, ":: Item: {} not added".format(c) # Logging # log.append("file {} not found".format(f)) except: "file {} not found".format(f) # Logging # log.append("file {} not found".format(f)) results.sort() with open('log.txt', 'a') as f: for line in log: f.write(line) merge_sort_lists(combined_dat) # Tests def combined_length(combined): """calculates the length of a list of lists""" com_len = 0 for i in combined: # if isinstance(i, list): # combined_length(i) # else: com_len += int(len(i)) return com_len com_length = (combined_length(combined_dat)) res_len = len(results) print('\nResult Length: ', res_len, '\nCombined Lists: ', com_length) assert(res_len == com_length) 
\$\endgroup\$
6
  • 2
    \$\begingroup\$Timsort is remarkably fast for merge operations.\$\endgroup\$
    – aghast
    CommentedFeb 14, 2018 at 3:43
  • 1
    \$\begingroup\$Is there likely redundancy within each single file? Would a unique-ify stage make sense?\$\endgroup\$
    – aghast
    CommentedFeb 14, 2018 at 3:44
  • \$\begingroup\$dat1, dat2, dat3 are typical file sizes.. my main issue will be processing the sort. Excuse me for not understanding.. but what is a unique-ify stage?\$\endgroup\$
    – johnashu
    CommentedFeb 14, 2018 at 3:50
  • 1
    \$\begingroup\$I was wondering if the individual files might contain repeats of the same word: dog, dog, dog, dog, elephant, elephant, elephant.\$\endgroup\$
    – aghast
    CommentedFeb 14, 2018 at 3:53
  • 1
    \$\begingroup\$@AustinHastings I just checked and timsort is the algorithm used in python sort()\$\endgroup\$
    – johnashu
    CommentedFeb 14, 2018 at 3:58

1 Answer 1

3
\$\begingroup\$

It occurs to me that your problem description doesn't match your code.

You state the problem as "takes in many sorted .dat files" and needing to "have one file with an alphabetically sorted list of words".

This is a merge, not even a merge sort. You appear to be trying to process all of the data in memory, which is also not necessary. However, Timsort is scary-fast. So it may be that loading everything into memory, sorting it, and unloading it is the fastest option. If the total data size isn't > 1 Gb, or > available RAM, whichever is less, this is likely true.

Option 1: Timsort FTW!

def merge_files(outfile, *filenames): words = [] for filename in filenames: with open(filename) as f: words.extend(f.readlines()) words.sort() # NB: in-place sort! Don't call sorted(words)! with open(outfile, 'w') as out: map(out.write, words) 

Option 2: ZOMG! How much data do you have??!!?!!

def merge(*iters): if len(iters) == 1: return iters[0] iterlist = [] values = [] # Initialize values[] and also convert iters tuple to list (for del) for i, it in enumerate(iters): try: values.append(next(it)) iterlist.append(it) # Order matter: next() might throw. except StopIteration: pass iters = iterlist while values: nextvalue = min(values) yield nextvalue try: i = values.index(nextvalue) values[i] = next(iters[i]) except StopIteration: del values[i], iters[i] def merge_files(outfile, *filenames): iters = [iter(open(fn)) for fn in filenames] with open(outfile, 'w') as out: map(out.write, merge(iters)) 

This looks like it might actually make a nice function for itertools. It just needs a key= option, I guess.

\$\endgroup\$
3
  • 1
    \$\begingroup\$Interesting, I did not know that the try block would leak variables defined inside. Not only to the except block (where you are able to access i for the del), but also outside of it...\$\endgroup\$
    – Graipher
    CommentedFeb 14, 2018 at 7:08
  • 3
    \$\begingroup\$For merging multiple sorted iterables, Python has heapq.merge. You might also want to look at this answer.\$\endgroup\$CommentedFeb 14, 2018 at 9:14
  • \$\begingroup\$@GarethRees Good to know. I had flagged to myself that min() was likely a time sink, and perhaps the values could be stored in a heapq. Nice that someone already wrote that code for me!\$\endgroup\$
    – aghast
    CommentedFeb 14, 2018 at 9:55

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.