HackInScience exercise: Sequence Mining

Question

I am working on a practice problem found here: https://www.hackinscience.org/exercises/sequence-mining

Overview
Sequence mining is useful for analyzing list of events when order matters. It's used for example for DNA and natural language analysis.
Here, you'll implement an algorithm to extract common patterns among a set of sequences.
Exercise
You must provide a function seq_mining that takes as argument:
A list of strings (representing the sequences), such as: ['CFEDS', 'SKDJFGKSJDFG', 'OITOER']
The minimum proportion of the number of sequences that must have this pattern for being taken into account (float between 0 and 1): if 0.34, at least a third of the sequences must have a given pattern for the function to return it.
The maximum pattern length that must be considered (int)
If a given pattern occurs several time in a sequence, it must be counted only once.
The function seq_mining must return a Counter containing:
The found patterns as keys The number of sequences containing this pattern as values.
In ["ABC", "BCD"] there are three patterns common to both sequences:
B C BC
(A and AB occurs only in the first string, and D and CD only in the last one).
So seq_mining(["ABC", "BCD"], 0.66, 2) (searching patterns of length 2 maximum, must appear on at lest 66% of sequences) should return:
Counter({'B': 2, 'C': 2, 'BC': 2})
(because as we have only two sequences, 66% already means the two of them)
while seq_mining(["ABC", "BCD"], 0.33, 2) (searching patterns of length 2 maximum, must appear on at lest 33% of sequences) should return:
Counter({'C': 2, 'B': 2, 'BC': 2, 'A': 1, 'D': 1, 'AB': 1, 'CD': 1})
This tells us that BC is found in two sequences while AB is found in a single one. (because as we have only two sequences, 33% allows a pattern to be found in a single sequence).

In this problem, I need to build a function which goes over a list of strings, finds patterns of up to a specific length that are present in at least a specific proportion of the given strings: The function takes 3 arguments: a list of strings, a float which will be the required proportion of strings that the pattern appears in to be counted, and the maximum length of the possible patterns. The function must return a Counter() object as well.

I built a function which seems to work, although I know it's not pretty or efficient. When I initially submitted the solution, it says I failed because the run time is too long. I'm not too familiar with optimizing code yet because I'm still a beginner, but I did some reading and tried to implement some of the changes, such as removing a for loop in exchange for a while loop and using list comprehension where possible. I think this made the code much more unreadable and it seems to be running even slower. I think my function is poorly designed overall but before I start over I'd like to know if there are any obvious changes that could be made that I'm not seeing. I just think I'm heading in the wrong direction with the code and could use some expertise here.

import itertools, collections def seq_mining(data, prop, max_int): list_of_sequences=enumerate(data) pos_dict = collections.defaultdict(list) for seq_index, seq in list_of_sequences: for step in range(1, (max_int+1)): for sub_seq in range(0, len((seq))): if len(seq[sub_seq:sub_seq+step]) >= step: pos_dict[seq_index].append(seq[sub_seq:sub_seq+step]) num_occurrence={k: sum(1 for l in pos_dict.values() if k in l) for k in set(sum(pos_dict.values(), []))} final_dict={} for l, t in num_occurrence.items(): if t/len(data) >= float(prop): final_dict[l]=t c=collections.Counter(final_dict) return c

The above code was the initial one I submitted which failed. I did some reading and attempted to make it better, but I think I made it worse: My second attempt is below

def seq_mining(data, prop, max_int): list_of_sequences=enumerate(data) pos_dict = collections.defaultdict(list) for seq_index, seq in list_of_sequences: step = 1 while step <=max_int: [pos_dict[seq_index].append(seq[sub_seq:sub_seq + step]) for sub_seq in range(0, len((seq))) if len(seq[sub_seq:sub_seq + step]) >= step] step+=1 num_occurrence = {k: sum(1 for l in pos_dict.values() if k in l) for k in set(sum(pos_dict.values(), []))} final_dict = dict([(l, t) for l, t in num_occurrence.items() if t / len(data) >= float(prop)]) c = collections.Counter(final_dict) return c

This is significantly more confusing to me. I read about a module called cProfile which can help locate areas that can be improved. The one below is what I got after running the second attempt.

958 function calls in 0.001 seconds 1 0.000 0.000 0.001 0.001 <string>:1(<module>) 1 0.000 0.000 0.000 0.000 __init__.py:581(__init__) 1 0.000 0.000 0.000 0.000 __init__.py:649(update) 1 0.000 0.000 0.000 0.000 abc.py:96(__instancecheck__) 10 0.000 0.000 0.000 0.000 scratch_5.py:12(<listcomp>) 1 0.000 0.000 0.001 0.001 scratch_5.py:14(<dictcomp>) 254 0.000 0.000 0.000 0.000 scratch_5.py:14(<genexpr>) 1 0.000 0.000 0.000 0.000 scratch_5.py:15(<listcomp>) 1 0.000 0.000 0.001 0.001 scratch_5.py:6(pos_func) 1 0.000 0.000 0.000 0.000 {built-in method _abc._abc_instancecheck} 1 0.000 0.000 0.001 0.001 {built-in method builtins.exec} 1 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance} 293 0.000 0.000 0.000 0.000 {built-in method builtins.len} 124 0.000 0.000 0.000 0.000 {built-in method builtins.sum} 1 0.000 0.000 0.000 0.000 {function Counter.update at 0x000001E9D0C0F430} 140 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 1 0.000 0.000 0.000 0.000 {method 'items' of 'dict' objects} 124 0.000 0.000 0.000 0.000 {method 'values' of 'dict' objects}

The cProfile below is for my first attempt at the function.

953 function calls (951 primitive calls) in 0.001 seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 0.001 0.001 <string>:1(<module>) 1 0.000 0.000 0.000 0.000 __init__.py:581(__init__) 1 0.000 0.000 0.000 0.000 __init__.py:649(update) 2 0.000 0.000 0.000 0.000 _collections_abc.py:405(__subclasshook__) 2/1 0.000 0.000 0.000 0.000 abc.py:100(__subclasscheck__) 1 0.000 0.000 0.000 0.000 abc.py:96(__instancecheck__) 1 0.000 0.000 0.000 0.000 scratch_4.py:13(<dictcomp>) 254 0.000 0.000 0.000 0.000 scratch_4.py:13(<genexpr>) 1 0.000 0.000 0.001 0.001 scratch_4.py:5(pos_func) 1 0.000 0.000 0.000 0.000 {built-in method _abc._abc_instancecheck} 2/1 0.000 0.000 0.000 0.000 {built-in method _abc._abc_subclasscheck} 1 0.000 0.000 0.001 0.001 {built-in method builtins.exec} 1 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance} 293 0.000 0.000 0.000 0.000 {built-in method builtins.len} 124 0.000 0.000 0.000 0.000 {built-in method builtins.sum} 1 0.000 0.000 0.000 0.000 {function Counter.update at 0x000001F12206D430} 140 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 1 0.000 0.000 0.000 0.000 {method 'items' of 'dict' objects} 124 0.000 0.000 0.000 0.000 {method 'values' of 'dict' objects}

Why does my function that returns a Counter() object have a very long run time? Looking at these results with an inexperienced eye, I would think the len() function is being run too many times and is slowing me down a lot. Is this correct?

Just a general remark: for me the code is just too dense to understand easily. My mind just starts to complain when trying to read stuff like [pos_dict[seq_index].append(seq[sub_seq:sub_seq + step]) for sub_seq in range(0, len((seq))) if len(seq[sub_seq:sub_seq + step]) >= step]. If I need to start puzzling to see what the code should be doing... — Maarten Bodewes, CommentedSep 12, 2021 at 15:13
I honestly completely agree. That's why I was asking for help because my attempt to decrease the use of for loops left me with increasingly uglier code haha. I was able to figure out a way to get this to work without such an messy answer though. I think I just needed to step away from it for a while and recheck with fresh eyes. — Tommyp, CommentedSep 12, 2021 at 16:51
I pasted your code and seq_mining( ['CFEDS', 'SKDJFGKSJDFG', 'OITOER']) from the question. It doesn't run. Can you post the full code to run your example? — C. Harley, CommentedSep 12, 2021 at 17:27

C.Nivs · Accepted Answer · 2021-09-13 00:01:10Z

Any time you see nested for loops like this, you can visually clean them up (or at least separate them) with generators:

def create_substrings(data, max_int): # This function cleans up the seq mining function # by just creating your key, slice pairs for seq_index, seq in enumerate(data): for step in range(1, max_int+1): for sub_seq in range(len(seq)): slice_ = seq[sub_seq:sub_seq+step] if len(slice_) < step: continue yield seq_index, slice_ def seq_mining(data, prop, max_int): positions = defaultdict(list) # Now this looks a lot less crowded for seq, slice_ in create_substrings(data, max_int): positions[seq].append(slice_) num_occurrence={k: sum(1 for l in positions.values() if k in l) for k in set(sum(positions.values(), []))} final_dict={} for l, t in num_occurrence.items(): if t/len(data) >= float(prop): final_dict[l]=t c=collections.Counter(final_dict) return c

It doesn't necessarily give you a performance benefit, but it makes the code look less dense, and is easier to read and digest later.

There's a big red flag here. The num_occurrence dictionary comprehension is probably your biggest bottleneck. the sum(1 for l in values if k in l) is slow because the k in l test is of O(N) performance... nested inside a loop. Now, that's worst-case, but it's still slow. You're also iterating over values when using set(sum(...)). To break this up, let's consider the problem first:

The minimum proportion of the number of sequences that must have this pattern for being taken into account (float between 0 and 1): if 0.34, at least a third of the sequences must have a given pattern for the function to return it. The maximum pattern length that must be considered (int) If a given pattern occurs several time in a sequence, it must be counted only once.

With this in mind, your default dictionary is inside-out. You wind up trying to count the number of times the unique values in the dictionary occur instead of what you should be doing, which is counting how many sequences share a sub-sequence. Let's use a defaultdict(set) for that instead, where the sets contain the indices of the parent sequences:

def create_substrings(data, max_int): # This function cleans up the seq mining function # by just creating your key, slice pairs for seq_index, seq in enumerate(data): for step in range(1, max_int+1): # The (step - 1) will make it so that you don't have # to test the length of the slice anymore for sub_seq in range(len(seq) - (step - 1)): slice_ = seq[sub_seq:sub_seq+step] yield seq_index, slice_ def seq_mining(data, prop, max_int): positions = defaultdict(set) for i, seq in create_substrings(data, max_int): # each member now gets counted only once positions[seq].add(i) # data doesn't change size, so compute its length once data_len = len(data) # don't cast this, just make it a counter to begin with counter = Counter() for l, t in positions.items(): t_len = len(t) if t_len/data_len < prop: continue counter[l] = t return counter

A few things I've cleaned up:

data_len. Don't re-compute constants over and over. Look them up, define them as constants, and move on
counter: Avoid unnecessary re-casting. Counter is a pretty similar object to a dictionary, so use it like one
positions: Is now a defaultdict(set) which allows you to just see how many sequences share the dna pattern, regardless of how many times it actually shows up.
if t_len...: continue: The eye has to stay closer to the left side of the screen if we explicitly say what we want to ignore. It implies the else and leads to less deeply-nested loop structures
for sub_seq in range(len(seq) - (step - 1): This eliminates the slice length testing

Now, this certainly looks better, and even has a boost in performance, but it's still missing some pieces to be complete.

Checking proportion shared

Rather than comparing to a float that we calculate in every iteration, it would be much better to compare to a pre-defined integer. You know how many need to share a subsequence to be considered valid, it can be defined outside of your loop. Compute the proportion and use a ceiling function to make it an integer:

from math import ceil def common_subsequences(sequences, sequence_len, proportion): positions = defaultdict(set) for i, seq in create_substrings(data, max_int): positions[seq].add(i) # this is the number of samples that need to share the subsequence num_required = ceil(len(sequences) * proportion)

Now, putting this into a loop to fill a Counter object:

def common_subsequences(sequences, sequence_len, proportion): positions = defaultdict(set) for i, seq in create_substrings(data, max_int): # each member now gets counted only once positions[seq].add(i) # this is the number of samples that need to share the subsequence num_required = ceil(len(sequences) * proportion) counter = Counter() for subsequence, members in positions.items(): num_members = len(members) # now we're testing an integer, much simpler if num_members < num_required: continue counter[subsequence] = num_members return counter

Docstrings

Docstrings would help very much in terms of readability for any user. As it stands, it's tough to reason out what the code is actually doing. The names aren't clear, so a docstring giving a helpful description and some examples would go an exceptionally long way.

Type Hints

These are useful for communicating what types your variables are to the reader/user/maintainer

from typing import List, Iterator, Tuple def create_substrings(dna_sequences, sequence_len=1) -> Iterator[Tuple[int, str]]: """Creates slices of each member of `dna_sequences` up to `sequence_len` in size Yields tuples of the slices' sequence number and the slices themselves """ def common_subsequences(sequences: List[str], sequence_len: int, proportion: float) -> Counter: """Finds common dna patterns shared between a list of strands. It returns a Counter of patterns up to `sequence_len` in length that are shared by at least `proportion` of the population """

Naming

l, t, prop, and max_int are not clear variable names. They are certainly valid python, but are shorthand that is only useful to the one writing the code. But what happens when you go back a month or two from now to update it? It might be much more difficult to figure out what each of these names are. Pick descriptive (but not overly verbose) names, they communicate what values are.

no comment · Accepted Answer · 2021-09-13 01:18:07Z

I submitted your first code there and got this failure message:

seq_mining(['ABCD', 'ABABC', 'BCAABCD'], 0.34, 1000000) is too slow (took 4.203785117715597s).

Seeing that 1000000 is far larger than useful, I reduced it to the largest actual string length by adding this at the top of your function:

max_int = min(max_int, max(map(len, data)))

Submitting again gave this success message:

Nice!! Your exercise is OK.

Moral of the story: Don't ignore feedback.

Yep this is exactly what I realized and how I fixed it, just limited the length of the maximum string. Thank you! — Tommyp, CommentedSep 13, 2021 at 11:18

RootTwo · Accepted Answer · 2021-09-13 07:19:42Z

Some observations:

import itertools, collections

It don't see itertools used--don't import it

list_of_sequences=enumerate(data)

enumerate() is a generator that yields a sequence of tuples. It doesn't return a list, so list_of_sequences is a misnomer.

for seq_index, seq in list_of_sequences:

To understand the for loop, you need to remember that list_of_sequences is an enumeration. It would be more readable to just put enumerate(data) in the for loop.

Solving the problem does not require keeping track of which sequence had which subsequences. The index isn't needed, so the enumerate isn't needed either.

for step in range(1, (max_int+1)):

substring_length or width might be a more descriptive name than step.

sub_seq is the starting index. If the loop ends at len(seq) - step + 1, the next if statement isn't needed.

pos_dict[seq_index].append(seq[sub_seq:sub_seq+step])

This line builds lists of all subsequences of a sequence.

num_occurrence={k: sum(1 for l in pos_dict.values() if k in l) for k in set(sum(pos_dict.values(), []))}

Then set(sum(pos_dict.values(), [])) builds a set of subsequences so that sum(1 for l in pos_dict.values() if k in l) can count the number of sequence contain each subsequence. The line is too complicated and unnecessary.

When you need to get a collection of items without duplicates, consider using a set. When you need to count things, consider using Counter().

def seq_mining(data, ratio, maxlen): sub_seq_counts = Counter() for seq in data: patterns = set() # collect the set of subsequences for start in range(len(seq)): for width in range(1, min(len(seq) - start, maxlen) + 1): pattern = seq[start:start + width] patterns.add(pattern) print(patterns) # update the count of sequences that contain each # subsequence sub_seq_counts.update(patterns) # filter the subsequences that meet the minimum ratio of matches threshold = len(data) * ratio result = Counter({k:v for k,v in sub_seq_counts.items() if v >= threshold}) return result

Stack Exchange Network

HackInScience exercise: Sequence Mining

3 Answers 3

Checking proportion shared

Docstrings

Type Hints

Naming

Hot Network Questions

HackInScience exercise: Sequence Mining

3 Answers 3

Checking proportion shared

Docstrings

Type Hints

Naming

Related

Hot Network Questions