I am writing a script that takes in many sorted .dat files. I have included sample data but the dataset is quite large. The desired outcome is to have one file with an alphabetically sorted list of words.
I want to know if I am missing anything with regards to dealing with large amounts of words, mainly dealing with memory usage and catching errors, logging and tests but still continuing to iterate over the files without interruption.
Using the data then * 100000 allows me to sort 11,000,000 or so lines no problem. When dealing with the larger sets I want to sort them but without crashing.
Does Python's sort()
work well for this operation or should I be looking at anything else that could be faster or more efficient?
Is it worth employing the multiprocessing module to help handle these tasks? If so, what is the best implementation? Researching, I have found this article, which, whilst dealing with large files, might be a similar process to sorting the large list at the end of the function.
The 'cost' of these operations is very important for the task.
dat1 ='allotment', 'amortization', 'ampules', 'antitheses', 'aquiline', 'barnacle', 'barraged', 'bayonet', 'beechnut', 'bereavements', 'billow', 'boardinghouses', 'broadcasted', 'cheeseburgers', 'civil', 'concourse', 'coy', 'cranach', 'cratered', 'creameries', 'cubbyholes', 'cues', 'dawdle', 'director', 'disallowed', 'disgorged', 'disguise', 'dowries', 'emissions', 'epilogs', 'evict', 'expands', 'extortion', 'festoons', 'flexible', 'flukey', 'flynn', 'folksier', 'gave', 'geological', 'gigglier', 'glowered', 'grievous', 'grimm', 'hazards', 'heliotropes', 'holds', 'infliction', 'ingres', 'innocently', 'inquiries', 'intensification', 'jewelries', 'juicier', 'kathiawar', 'kicker', 'kiel', 'kinswomen', 'kit', 'kneecaps', 'kristie', 'laggards', 'libel', 'loggerhead', 'mailman', 'materials', 'menorahs', 'meringues', 'milquetoasts', 'mishap', 'mitered', 'mope', 'mortgagers', 'mumps', 'newscasters', 'niggling', 'nowhere', 'obtainable', 'organization', 'outlet', 'owes', 'paunches', 'peanuts', 'pie', 'plea', 'plug', 'predators', 'priestly', 'publish', 'quested', 'rallied', 'recumbent', 'reminiscence', 'reveal', 'reversals', 'ripples', 'sacked', 'safest', 'samoset', 'satisfy', 'saucing', 'scare', 'schoolmasters', 'scoundrels', 'scuzziest', 'shoeshine', 'shopping', 'sideboards', 'slate', 'sleeps', 'soaping', 'southwesters', 'stubbly', 'subscribers', 'sulfides', 'taxies', 'tillable', 'toastiest', 'tombstone', 'train', 'truculent', 'underlie', 'unsatisfying', 'uptight', 'wannabe', 'waugh', 'workbooks', 'allotment', 'amortization', 'ampules', 'antitheses', 'aquiline', 'barnacle', 'barraged', 'bayonet', 'beechnut', 'bereavements', 'billow', 'boardinghouses', 'broadcasted', 'cheeseburgers', 'civil', 'concourse', 'coy', 'cranach', 'cratered', 'creameries', 'cubbyholes', 'cues', 'dawdle', 'director', 'disallowed', 'disgorged', 'disguise', 'dowries', 'emissions', 'epilogs', 'evict', 'expands', 'extortion', 'festoons', 'flexible', 'flukey', 'flynn', 'folksier', 'gave', 'geological', 'gigglier', 'glowered', 'grievous', 'grimm', 'hazards', 'heliotropes', 'holds', 'infliction', 'ingres', 'innocently', 'inquiries', 'intensification', 'jewelries', 'juicier', 'kathiawar', 'kicker', 'kiel', 'kinswomen', 'kit', 'kneecaps', 'kristie', 'laggards', 'libel', 'loggerhead', 'mailman', 'materials', 'menorahs', 'meringues', 'milquetoasts', 'mishap', 'mitered', 'mope', 'mortgagers', 'mumps', 'newscasters', 'niggling', 'nowhere', 'obtainable', 'organization', 'outlet', 'owes', 'paunches', 'peanuts', 'pie', 'plea', 'plug', 'predators', 'priestly', 'publish', 'quested', 'rallied', 'recumbent', 'reminiscence', 'reveal', 'reversals', 'ripples', 'sacked', 'safest', 'samoset', 'satisfy', 'saucing', 'scare', 'schoolmasters', 'scoundrels', 'scuzziest', 'shoeshine', 'shopping', 'sideboards', 'slate', 'sleeps', 'soaping', 'southwesters', 'stubbly', 'subscribers', 'sulfides', 'taxies', 'tillable', 'toastiest', 'tombstone', 'train', 'truculent', 'underlie', 'unsatisfying', 'uptight', 'wannabe', 'waugh', 'workbooks' dat2 = 'abut', 'actuators', 'advert', 'altitude', 'animals', 'aquaplaned', 'battlement', 'bedside', 'bludgeoning', 'boeing', 'bubblier', 'calendaring', 'callie', 'cardiology', 'caryatides', 'chechnya', 'coffey', 'collage', 'commandos', 'defensive', 'diagnosed', 'doctor', 'elaborate', 'elbow', 'enlarged', 'evening', 'flawed', 'glowers', 'guested', 'handel', 'homogenized', 'husbands', 'hypermarket', 'inge', 'inhibits', 'interloper', 'iowan', 'junco', 'junipers', 'keen', 'logjam', 'lonnie', 'louver', 'low', 'marcelo', 'marginalia', 'matchmaker', 'mold', 'monmouth', 'nautilus', 'noblest', 'north', 'novelist', 'oblations', 'official', 'omnipresent', 'orators', 'overproduce', 'passbooks', 'penalizes', 'pisses', 'precipitating', 'primness', 'quantity', 'quechua', 'rama', 'recruiters', 'recurrent', 'remembrance', 'rumple', 'saguaro', 'sailboard', 'salty', 'scherzo', 'seafarer', 'settles', 'sheryl', 'shoplifter', 'slavs', 'snoring', 'southern', 'spottiest', 'squawk', 'squawks', 'thievish', 'tightest', 'tires', 'tobacconist', 'tripling', 'trouper', 'tyros', 'unmistakably', 'unrepresentative', 'waviest' dat3 = 'administrated', 'aggressively', 'albee', 'amble', 'announcers', 'answers', 'arequipa', 'artichoke', 'awed', 'bacillus', 'backslider', 'bandier', 'bellow', 'beset', 'billfolds', 'boneless', 'braziers', 'brick', 'budge', 'cadiz', 'calligrapher', 'clip', 'confining', 'coronets', 'crispier', 'dardanelles', 'daubed', 'deadline', 'declassifying', 'delegating', 'despairs', 'disembodying', 'dumbly', 'dynamically', 'eisenhower', 'encryption', 'estes', 'etiologies', 'evenness', 'evillest', 'expansions', 'fireproofed', 'florence', 'forcing', 'ghostwritten', 'hakluyt', 'headboards', 'hegel', 'hibernate', 'honeyed', 'hope', 'horus', 'inedible', 'inflammation', 'insincerity', 'intuitions', 'ironclads', 'jeffrey', 'knobby', 'lassoing', 'loewi', 'madwoman', 'maurois', 'mechanistic', 'metropolises', 'modified', 'modishly', 'mongols', 'motivation', 'mudslides', 'negev', 'northward', 'outperforms', 'overseer', 'passport', 'pathway', 'physiognomy', 'pi', 'platforming', 'plodder', 'pools', 'poussin', 'pragmatically', 'premeditation', 'punchier', 'puncture', 'raul', 'readjusted', 'reflectors', 'reformat', 'rein', 'relives', 'reproduces', 'restraining', 'resurrection', 'revving', 'rosily', 'sadr', 'scolloped', 'shrubbery', 'side', 'simulations', 'slashing', 'speculating', 'subsidization', 'teaser', 'tourism', 'transfers', 'transnationals', 'triple', 'undermining', 'upheavals', 'vagina', 'victims', 'weird', 'whereabouts', 'wordiness' # lines = open(combined_file_name + file_type, 'r').readlines() # output = open("intermediate_alphabetical_order.dat", 'w') # for line in sorted(lines, key=lambda line: line.split()[0]): # output.write(line) # output.close() import datetime from functools import wraps from time import time datetme = datetime.datetime.now() date = datetme.strftime('%d %b %Y %H:%M:%S ').upper() # tuples are used to read in the data to save cost of memory usage combined_dat = dat1, dat2, dat3 # * 100000 results = [] log = () #TUPLE # decorator for speed test. def speed_test(f): @wraps(f) def wrapper(*a, **kw): start = time() result = f(*a, **kw) end = time() print('Elapsed time: {} s'.format(round(end - start, 8))) return result return wrapper @speed_test def merge_sort_lists(list_of_lists, *a, **kw): """takes in a list of lists/tuples and returns a sorted list""" try: for f in list_of_lists: try: for c in f: # recursion for lists in the list of lists... if isinstance(c, list): merge_sort_lists([c]) else: results.append(c) except: datetme, ":: Item: {} not added".format(c) # Logging # log.append("file {} not found".format(f)) except: "file {} not found".format(f) # Logging # log.append("file {} not found".format(f)) results.sort() with open('log.txt', 'a') as f: for line in log: f.write(line) merge_sort_lists(combined_dat) # Tests def combined_length(combined): """calculates the length of a list of lists""" com_len = 0 for i in combined: # if isinstance(i, list): # combined_length(i) # else: com_len += int(len(i)) return com_len com_length = (combined_length(combined_dat)) res_len = len(results) print('\nResult Length: ', res_len, '\nCombined Lists: ', com_length) assert(res_len == com_length)