I am working on a practice problem found here: https://www.hackinscience.org/exercises/sequence-mining
Overview
Sequence mining is useful for analyzing list of events when order matters. It's used for example for DNA and natural language analysis.
Here, you'll implement an algorithm to extract common patterns among a set of sequences.
Exercise
You must provide a function seq_mining that takes as argument:
A list of strings (representing the sequences), such as: ['CFEDS', 'SKDJFGKSJDFG', 'OITOER']
The minimum proportion of the number of sequences that must have this pattern for being taken into account (float between 0 and 1): if 0.34, at least a third of the sequences must have a given pattern for the function to return it.
The maximum pattern length that must be considered (int)
If a given pattern occurs several time in a sequence, it must be counted only once.
The function seq_mining must return a Counter containing:
The found patterns as keys The number of sequences containing this pattern as values.
In ["ABC", "BCD"] there are three patterns common to both sequences:
B C BC
(A and AB occurs only in the first string, and D and CD only in the last one).
So seq_mining(["ABC", "BCD"], 0.66, 2) (searching patterns of length 2 maximum, must appear on at lest 66% of sequences) should return:
Counter({'B': 2, 'C': 2, 'BC': 2})
(because as we have only two sequences, 66% already means the two of them)
while seq_mining(["ABC", "BCD"], 0.33, 2) (searching patterns of length 2 maximum, must appear on at lest 33% of sequences) should return:
Counter({'C': 2, 'B': 2, 'BC': 2, 'A': 1, 'D': 1, 'AB': 1, 'CD': 1})
This tells us that BC is found in two sequences while AB is found in a single one. (because as we have only two sequences, 33% allows a pattern to be found in a single sequence).
In this problem, I need to build a function which goes over a list of strings, finds patterns of up to a specific length that are present in at least a specific proportion of the given strings: The function takes 3 arguments: a list of strings, a float which will be the required proportion of strings that the pattern appears in to be counted, and the maximum length of the possible patterns. The function must return a Counter()
object as well.
I built a function which seems to work, although I know it's not pretty or efficient. When I initially submitted the solution, it says I failed because the run time is too long. I'm not too familiar with optimizing code yet because I'm still a beginner, but I did some reading and tried to implement some of the changes, such as removing a for loop in exchange for a while loop and using list comprehension where possible. I think this made the code much more unreadable and it seems to be running even slower. I think my function is poorly designed overall but before I start over I'd like to know if there are any obvious changes that could be made that I'm not seeing. I just think I'm heading in the wrong direction with the code and could use some expertise here.
import itertools, collections def seq_mining(data, prop, max_int): list_of_sequences=enumerate(data) pos_dict = collections.defaultdict(list) for seq_index, seq in list_of_sequences: for step in range(1, (max_int+1)): for sub_seq in range(0, len((seq))): if len(seq[sub_seq:sub_seq+step]) >= step: pos_dict[seq_index].append(seq[sub_seq:sub_seq+step]) num_occurrence={k: sum(1 for l in pos_dict.values() if k in l) for k in set(sum(pos_dict.values(), []))} final_dict={} for l, t in num_occurrence.items(): if t/len(data) >= float(prop): final_dict[l]=t c=collections.Counter(final_dict) return c
The above code was the initial one I submitted which failed. I did some reading and attempted to make it better, but I think I made it worse: My second attempt is below
def seq_mining(data, prop, max_int): list_of_sequences=enumerate(data) pos_dict = collections.defaultdict(list) for seq_index, seq in list_of_sequences: step = 1 while step <=max_int: [pos_dict[seq_index].append(seq[sub_seq:sub_seq + step]) for sub_seq in range(0, len((seq))) if len(seq[sub_seq:sub_seq + step]) >= step] step+=1 num_occurrence = {k: sum(1 for l in pos_dict.values() if k in l) for k in set(sum(pos_dict.values(), []))} final_dict = dict([(l, t) for l, t in num_occurrence.items() if t / len(data) >= float(prop)]) c = collections.Counter(final_dict) return c
This is significantly more confusing to me. I read about a module called cProfile which can help locate areas that can be improved. The one below is what I got after running the second attempt.
958 function calls in 0.001 seconds 1 0.000 0.000 0.001 0.001 <string>:1(<module>) 1 0.000 0.000 0.000 0.000 __init__.py:581(__init__) 1 0.000 0.000 0.000 0.000 __init__.py:649(update) 1 0.000 0.000 0.000 0.000 abc.py:96(__instancecheck__) 10 0.000 0.000 0.000 0.000 scratch_5.py:12(<listcomp>) 1 0.000 0.000 0.001 0.001 scratch_5.py:14(<dictcomp>) 254 0.000 0.000 0.000 0.000 scratch_5.py:14(<genexpr>) 1 0.000 0.000 0.000 0.000 scratch_5.py:15(<listcomp>) 1 0.000 0.000 0.001 0.001 scratch_5.py:6(pos_func) 1 0.000 0.000 0.000 0.000 {built-in method _abc._abc_instancecheck} 1 0.000 0.000 0.001 0.001 {built-in method builtins.exec} 1 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance} 293 0.000 0.000 0.000 0.000 {built-in method builtins.len} 124 0.000 0.000 0.000 0.000 {built-in method builtins.sum} 1 0.000 0.000 0.000 0.000 {function Counter.update at 0x000001E9D0C0F430} 140 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 1 0.000 0.000 0.000 0.000 {method 'items' of 'dict' objects} 124 0.000 0.000 0.000 0.000 {method 'values' of 'dict' objects}
The cProfile below is for my first attempt at the function.
953 function calls (951 primitive calls) in 0.001 seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 0.001 0.001 <string>:1(<module>) 1 0.000 0.000 0.000 0.000 __init__.py:581(__init__) 1 0.000 0.000 0.000 0.000 __init__.py:649(update) 2 0.000 0.000 0.000 0.000 _collections_abc.py:405(__subclasshook__) 2/1 0.000 0.000 0.000 0.000 abc.py:100(__subclasscheck__) 1 0.000 0.000 0.000 0.000 abc.py:96(__instancecheck__) 1 0.000 0.000 0.000 0.000 scratch_4.py:13(<dictcomp>) 254 0.000 0.000 0.000 0.000 scratch_4.py:13(<genexpr>) 1 0.000 0.000 0.001 0.001 scratch_4.py:5(pos_func) 1 0.000 0.000 0.000 0.000 {built-in method _abc._abc_instancecheck} 2/1 0.000 0.000 0.000 0.000 {built-in method _abc._abc_subclasscheck} 1 0.000 0.000 0.001 0.001 {built-in method builtins.exec} 1 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance} 293 0.000 0.000 0.000 0.000 {built-in method builtins.len} 124 0.000 0.000 0.000 0.000 {built-in method builtins.sum} 1 0.000 0.000 0.000 0.000 {function Counter.update at 0x000001F12206D430} 140 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 1 0.000 0.000 0.000 0.000 {method 'items' of 'dict' objects} 124 0.000 0.000 0.000 0.000 {method 'values' of 'dict' objects}
Why does my function that returns a Counter()
object have a very long run time? Looking at these results with an inexperienced eye, I would think the len()
function is being run too many times and is slowing me down a lot. Is this correct?
[pos_dict[seq_index].append(seq[sub_seq:sub_seq + step]) for sub_seq in range(0, len((seq))) if len(seq[sub_seq:sub_seq + step]) >= step]
. If I need to start puzzling to see what the code should be doing...\$\endgroup\$seq_mining( ['CFEDS', 'SKDJFGKSJDFG', 'OITOER'])
from the question. It doesn't run. Can you post the full code to run your example?\$\endgroup\$