6
\$\begingroup\$

I'm writing a script for a bioinformatics application in Python that iterates through several sequences looking for amino acids in specific positions to calculate their frequency in relation to a genomic database. However, I'm new to the area and I've been learning while developing the project, so a lot of what I've done is poorly optimized. For a task that I consider relatively simple, the script takes half a minute for files and sometimes much longer for larger files

Can anyone suggest how I might try to get around this problem? Apologies if it's trivial, I'm a biologist just starting to get to grips with programming

def main(self): # Process input data and extract necessary information mutation_count, mutation_subjects, mutation_positions, alignment_file, seq_ID = self.process_input() # Initialize a dictionary to count amino acids at each mutation position aa_counts = {pos: {} for pos in mutation_positions} # Read the alignment file using Bio.AlignIO alignment = AlignIO.read(alignment_file, 'clustal') # Find the reference sequence in the alignment ref_seq = None for record in alignment: if record.id == seq_ID: ref_seq = record break # Convert sequences in the alignment to strings for easier processing ref_seq_str = str(ref_seq.seq) alignment_str = {record.id: str(record.seq) for record in alignment} # Count amino acids at each position in the alignment for seq in alignment_str.values(): pos_in_ref_seq = 0 for pos, aa in enumerate(seq): if ref_seq_str[pos] != '-': pos_in_ref_seq += 1 if pos_in_ref_seq in aa_counts: aa_counts[pos_in_ref_seq][aa] = aa_counts[pos_in_ref_seq].get(aa, 0) + 1 # If a specific position is provided, calculate and print the amino acid frequencies at that position if self.args.position: position = self.args.position total_count = sum(aa_counts[position].values()) for aa, count in aa_counts[position].items(): freq = (count / total_count * 100) if total_count > 0 else 0 print(f"Amino Acid: {aa} | Frequency: {freq:.2f}% | Count: {count}") return # Analyze mutations and calculate frequencies mutation_info = {} for mutation, count in sorted(mutation_count.items(), key=lambda x: int(x[0][1:-1])): seq_pos = int(mutation[1:-1]) query_aa = mutation[0] subject_aa = mutation[-1] if query_aa == '-' or subject_aa == '-': continue real_count = aa_counts[seq_pos].get(subject_aa, 0) total_count = sum(aa_counts[seq_pos].values()) query_aa_frequency = (aa_counts[seq_pos].get(query_aa, 0) / total_count * 100) if total_count > 0 else 0 subject_aa_frequency = (aa_counts[seq_pos].get(subject_aa, 0) / total_count * 100) if total_count > 0 else 0 if subject_aa_frequency <= 10 and real_count > 2: mutation_info[mutation] = { 'query_aa_frequency': query_aa_frequency, 'subject_aa_frequency': subject_aa_frequency, 'real_count': real_count, } # Identify strains with specific mutations strains_with_mutations = {} for mutation in mutation_count: query_aa, pos, subject_aa = mutation[0], int(mutation[1:-1]), mutation[-1] strains_with_this_mutation = [] for record in alignment: strain_name = record.id[:3] sequence = str(record.seq) pos_in_ref_seq = 0 for i, aa in enumerate(sequence): if ref_seq.seq[i] != '-': pos_in_ref_seq += 1 if pos_in_ref_seq == pos: if aa == subject_aa: strains_with_this_mutation.append(strain_name[:3]) break strains_with_mutations[mutation] = strains_with_this_mutation # Write mutation information to a text file with open('gapslist.txt', 'w') as f: for mutation, info in mutation_info.items(): f.write(f"Mutation: {mutation} | {mutation[0]} Frequency: {info['query_aa_frequency']:.2f}% | {mutation[-1]} Frequency: {info['subject_aa_frequency']:.2f}% | Count: {info['real_count']}\n\n") # Write mutation and strain information to a CSV file with open('rare_mutations.csv', 'w', newline='') as csvfile: fieldnames = ['Mutation', 'Strains'] writer = csv.DictWriter(csvfile, fieldnames=fieldnames) writer.writeheader() for mutation, info in mutation_info.items(): writer.writerow({ 'Mutation': mutation, 'Strains': ';'.join(strains_with_mutations.get(mutation, [])) }) 

Profiling the code, I realized that the biggest bottleneck is the number of times the getitem function of the Seq class in Bio.Seq is being called:

 ncalls tottime percall cumtime percall filename:lineno(function) 69006091 25.002 0.000 99.135 0.000 Seq.py:470(__getitem__) 69098925 6.311 0.000 6.311 0.000 SeqRecord.py:334(<lambda>). 69008225 10.165 0.000 49.965 0.000 <frozen abc>:117(__instancecheck__) 69006091 10.330 0.000 22.310 0.000 <frozen abc>:121(__subclasscheck__) 

I've been looking into ways of dealing with this but nothing seems to be working, or it's only marginally reducing execution time.

\$\endgroup\$

    1 Answer 1

    2
    \$\begingroup\$

    new to the area and I've been learning while developing

    Great, welcome! Software is almost as interesting as biology. And it's much less messy.

    measurements

    much longer for larger files

    “When you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind: it may be the beginning of knowledge, but you have scarcely, in your thoughts, advanced to the stage of science.” --Lord Kelvin

    You're saying that "app-level throughput is low". That would be a more useful metric if it was in terms of "rows per second" or similar units that are meaningful to the application.

    I am delighted to see that you found python -m cProfile! Good job, keep using that. But follow the advice below to make it more useful to you. I see a measurement of 99 seconds getting sequence items, but that happens in more than one place so I don't really know where your app-level hot spot is. Do pay attention to the 25 s versus 99 s on total versus cumulative. That suggests that three quarters of the time spent de-referencing x[i] is consumed by still lower layers, which surprises me slightly.

    Here's an idiom that will help you narrow down app-level bottlenecks.

    from time import time ... t0 = time() do_some_time_consuming_stuff() print("elapsed seconds:", time() - t0) 

    Record such timings, author refactored functions, and do a bake-off to measure whether the refactoring is much of a win.

    When authors introduce the concept of a python @decorator, they often will start with an elapsed time example. You may find that helpful, as well.


    helper functions

    You don't have any.

    Now, main() is a good looking function, with many helpful identifiers. But it is entirely too long. In particular, we must scroll vertically if we are to visually take it all in. If there's one software engineering skill you should learn this year, it is how to Extract Helper functions. Fortunately it's not hard, it just takes practice.

    One reason for breaking out helpers is coupling of variables. All of your local variables are essentially global. Having many such variables makes it harder to correctly reason about the code, since so many variables and side effects can affect how the current line of code behaves.

    Your code is nicely organized into distinct processing chunks, and you wrote many helpful comments.

     # Find the reference sequence in the alignment ... # Count amino acids at each position in the alignment ... # If a specific position is provided, calculate and print the amino acid frequencies at that position ... # Analyze mutations and calculate frequencies ... # Identify strains with specific mutations ... # Write mutation information to a text file ... # Write mutation and strain information to a CSV file ... 

    Every single one of those comments is telling me: "Here comes a new function!" Listen to the code. Break them out as small functions. The English sentence you wrote is practically giving you the function name, e.g. def write_mutation_and_strain_to_csv( ... ):. It's not always necessary, but feel free to copy-paste those very nice sentences into a """docstring"" for each helper function.


    review context

    You stripped out the import statements, so it's hard to see exactly what your deps are. But it appears for example that at least one of them is the AlignIO library.

    In a way it doesn't matter too much, since we don't have a reproducible example. If I suggest a change, it's not like I can run and re-run the modified code to measure size-of-effect.

    Consider offering more context if you post any subsequent questions about refactored code.


    test suite

    Writing automated unit tests is a good way to instill confidence in your code's correctness and to notice any regressions introduced by refactoring. Some would say that it "gives us the courage to refactor" safely.

    In your requirements document you wrote down more than one thing which each chunk of code must do in order to be Correct. For example, it must not only find proper strains or write a file, but it must do it fast. So a unit test might simply verify that a file was processed in less than a minute. And then that Green bar turns Red if a refactor introduces surprisingly large delays.


    big-Oh complexity

     for mutation in mutation_count: ... for record in alignment: ... for i, aa in enumerate(sequence): 

    I don't know why your code is slow -- it is large and complex. But I notice nested loops in several places. We're looping over mutation_count, alignment, and sequence. Call their lengths m, a, s. If this chunk was broken out as a helper function, you might note in the """docstring""" or in a # comment that time complexity is O(m × a × s). If you know the rough magnitudes of those inputs, that can help guide you toward places where algorithm improvements are needed.

    \$\endgroup\$
    2
    • \$\begingroup\$@greybeard, thank you for helping me to sharpen my quill, which I plucked from the copy-pasta monster and dipped once again into the ink well. I remember being dissatisfied with where the commas were placed as I read it, which I think made me not even re-read it to the end. I will cite my reference; revised text comes from an improved source.\$\endgroup\$
      – J_H
      CommentedNov 28, 2023 at 21:24
    • \$\begingroup\$Lord Kelvin 1824–1907 British scientist When you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind: it may be the beginning of knowledge, but you have scarcely, in your thoughts, advanced to the stage of science, whatever the matter may be. often quoted as ‘If you cannot measure it, then it is not science’ Popular Lectures and Addresses vol. 1 (1889) ‘Electrical Units of Measurement’, delivered 3 May 1883\$\endgroup\$
      – pippo1980
      CommentedNov 29, 2023 at 16:38

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.