Simple program to split DNA sequences based on ambiguous content

Question

The problem

I have a program that takes a file of DNA sequences as input and writes each entry to one of two output files: one file for sequences containing only canonical characters (A, C, G, and T), and another file for sequences containing non-canonical (or ambiguous) characters.

The data

DNA sequencing machines produce gobs of data, with each entry adhering to the following format.

@uniqueID free-form description and/or metadata ACGTACGTACGTACGT + BBBBBBBBBBBBBBBB

The first line contains an ID and metadata, the second line contains DNA sequence (this is what the program filters on), the third line is stupid and can be ignored, and the fourth line contains ASCII-encoded quality values corresponding to each nucleotide in the DNA. Here are 3 entries from a real file.

@ERR899705.3060 HWI-D00574:77:C6A74ANXX:5:1101:2907:2432/1 GCCTTCCAGAAAAATGTTCCCAGTTCATTGATGGCAAACTCAAACAGGTTTGAAAAAAAATCTCAATATTAAATATGGATATGATGCATCAGTTCAAGGTGATTTCTGGTTAGCTGTGTGCTGACA + BCCCCGGGEGGGGGGGGGGGGGGGGGFGGGGGGGGGGGEGGGGGGGGE1>FCFGF1F<DDDGGGGDGGCGGGGGGGGGGGFGGGGGGEGEGG>FCEG@@0FGGGGGCEDGCG>DCGGGGG=0D=GG @ERR899705.3061 HWI-D00574:77:C6A74ANXX:5:1101:2869:2437/1 CAAATAAATAAATAAATATTTAGTTACTGAAATAATAATGTTGATTGGTAAAATAAGAATGTATTCTCTCTAAAAACTAGTTAGTTTGTACTAAAAAATGGGCATAATTATTATCTCAAAGATTAG + BBBBBEGGGGGGGGGGGGGGGGGGGGGGD@GGGGGGGGEGC@FGFG>F:CGGGGGGFGGGCFCBDC11:F>FGGGGBEGGGGGGGGGC@F@FGGGGGG>000F@FGGGGGGEGGG00B0CFGGE>0 @ERR899705.3062 HWI-D00574:77:C6A74ANXX:5:1101:2786:2438/1 TTGGAAAACATTTATTACATTTTCTTCTAATTTCTTAGAATCATTTACTTGTAATATTTAGATACTCGGCGATCCTTTCAGGTATTCACTGCATATCCCCAAATATTGCATCAAACTCTCCCACGT + AABB0111@F111F111=1;@@;111E1?111:111=1=1111<11111<:11:C111<1EBG11:>F/////0/1:<F>C1:11=0:=0;@0>=0000==8>00000=8000<0000008000;0

When a DNA sequencing machine cannot reliably determine the molecule at a given position, it will use an N character instead of an ACGT character. Also, programs for analyzing DNA sequences will sometimes use other non-ACGT characters (IUPAC notation) to represent uncertainty about the content of DNA sequences.

The constraints

I need to handle sequences containing ambiguous characters separately from sequences containing only canonical characters. The input files are huge, so I want something as performant as possible, but I'm also trying to keep it as simple and easy to read as possible.

The code does no sanity checking on the input data (assumes the number of lines in the input file is a multiple of 4, assumes there are no blank lines, etc). I am ok with this for my current purposes.
It abuses stderr as a second output stream, although I'm not sure it's worth the extra clutter to parse command-line args, provide a usage statement, etc.

The code

The following minimal C++ program solves the problem as stated. Keeping the program simple, are there any ways I can improve performance or clarity of the code?

I'd welcome any feedback, but I'm specifically interested in the following.

How should I format the if statement in the non_canon function? Everything I've tried looks hideous.
I'm using the ternary operator to conditionally assign the output stream. My typical pattern in Python is to assign, check, and then reassign if needed, but std::ostream references cannot be reassigned in C++. Is there a clearer way of doing this?

#include <algorithm> #include <iostream> #include <string> char non_canon(char bp) { if ( bp != 'A' && bp != 'a' && bp != 'C' && bp != 'c' && bp != 'G' && bp != 'g' && bp != 'T' && bp != 't' ) { return true; } return false; } int main() { std::string defline, seq, qual, buffer; while (getline(std::cin, defline)) { getline(std::cin, seq); getline(std::cin, buffer); getline(std::cin, qual); bool non_acgt = std::any_of(seq.begin(), seq.end(), non_canon); std::ostream& outstream = non_acgt ? std::cerr : std::cout; outstream << defline << '\n' << seq << "\n+\n" << qual << std::endl; } return 0; }

As someone who specialises in low-latency C++ I can honestly say the answers are not the best for performance. Firstly you need to change your input file to be binary and map it to memory if the file is large. Now you can read binary data (much faster). Why are you considering lower case letters when your input is only upper case? Prefer bitwise operators over boolean, the latter require branches. Avoid memory look-ups if you can compute values. Avoid IF statements too. Avoid strings altogether. — user997112, CommentedJun 21, 2019 at 0:20

Loki Astari · Accepted Answer · 2017-04-27 14:39:25Z

Well one simple change will make it do lines in groups of four.

while (getline(std::cin, defline) && getline(std::cin, seq) && getline(std::cin, buffer) && getline(std::cin, qual) ) { // Only enter the loop if you correctly read all four lines. }

We can also improve the test for a specific character. Note I am changing the return type to bool.

bool canon(char bp) { bp = toupper(bp); return bp == 'A' || bp == 'C' || bp == 'G' || bp == 'T'; }

That will work. But its a bit slow if you get a lot of 'T'. We can optimize that if you want with a lookup table;

bool canon(unsigned char bp) { static bool const is_cannon[] = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, // A C G 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // T 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, // a c g 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // t 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; return is_cannon[bp]; }

Bit of a micro optimization but if your files are huge. Now that @Edward has published an answer I like his switch statement better.

Prefer '\n' over std::endl. It (the endl) will only slow your code down. If forces a flush and this flush is unnecessary. The streams will automatically flush when they need to.

Your C++ code will run slower than C code because the C++ streams are tied to the C stream and try and keep them in sync. If you don't care about the C streams you can improve performance by unbinding them.

std::ios::sync_with_stdio(false);

Also because you are reading from the standard input this can cause slow downs (compared to reading from a file). Because the input needs to force the output to flush before getting reading input (so you see any messages that you should respond too). Since you don't want to wait for user input you can disable this behavior. In your case it probably will not make any difference but its worth adding just to be pedantic:

std::cin.tie(nullptr);

Questions

How should I format the if statement in the non_canon function? Everything I've tried looks hideous.

Don't use and if statement. My code above just returns the expression.

I'm using the ternary operator to conditionally assign the output stream. My typical pattern in Python is to assign, check, and then reassign if needed, but std::ostream references cannot be reassigned in C++. Is there a clearer way of doing this?

I think your code looks fine. But you don't need an intermediate variable.

(non_acgt ? std::cerr : std::cout) << defline << '\n' << seq << "\n+\n" << qual << "\n";

Or you can put it in a function.

 inline std::ostream& getOutputStream(bool non_acgt) { return non_acgt ? std::cerr : std::cout; } getOutputStream(non_acgt) << defline << '\n' << seq << "\n+\n" << qual << "\n";

Holy I/O batman!!! I did some benchmarking on a small test file containing 1 million records. The program I originally posted runs consistently in about 10 minutes. Disabling stream syncing AND replacing std::endl with newlines consistently brings this to 30 seconds! Neither of these two changes has a dramatic effect on runtime without the other. And none of the other changes have a measurable effect on performance either. But some clean up the code nicely. Thanks! — Daniel Standage, CommentedApr 27, 2017 at 7:17
@DanielStandage I would rephrase that as: std::ios::sync_with_stdio(false); dramatically increases the speed of C++ streams. While using std::endl causes excessive flushing and thus slows the C++ stream down. — Loki Astari, CommentedApr 27, 2017 at 14:41
Instead of a lookup table, one coud use a std::map< char, bool >. No idea if its as efficient though — Yk Cheese, CommentedApr 27, 2017 at 19:51
@YkCheese std map has log n lookup. I would suggest something like: — enedil, CommentedApr 27, 2017 at 23:48
vector<bool> matches(128,0); char t ="ACTGactg"; for (int i = 0; i < 8; ++i) matches[t[i]] = true; — enedil, CommentedApr 27, 2017 at 23:50

Edward · Accepted Answer · 2017-04-27 11:41:19Z

The other review hits most of the important bits, but if you're writing software to classify DNA strings, I would bet that it's not the only processing you're ever going to need to do. With that in mind, I'd suggest that using C++ objects would not only make the code clearer, but provide a reusable object from which subsequent programs might be written to deal with the same data (e.g. searching for particular sequences within a sequence). Here's how that might look:

#include <algorithm> #include <iostream> #include <string> class DNA { public: friend std::istream& operator>>(std::istream &in, DNA &dna) { getline(in, dna.defline); getline(in, dna.seq); getline(in, dna.buffer); getline(in, dna.qual); return in; } friend std::ostream& operator<<(std::ostream &out, const DNA &dna) { return out << dna.defline << '\n' << dna.seq << "\n+\n" << dna.qual << '\n'; } void classify(std::ostream &canon, std::ostream& noncanon) const { (isCanonical() ? canon : noncanon) << *this; } bool isCanonical() const { return std::all_of(seq.begin(), seq.end(), is_canon); } private: static bool is_canon(char bp) { switch(bp) { case 'A': case 'C': case 'G': case 'T': case 'a': case 'c': case 'g': case 't': return true; } return false; } std::string defline, seq, buffer, qual; };

Now the main routine is extremely simple:

int main() { std::ios::sync_with_stdio(false); for (DNA dna; std::cin >> dna; ) { // canonical sequences in the first file; others in the second dna.classify(std::cout, std::cerr); } }

Note that by writing the character classification as a switch, we enable some compiler optimizations that aren't available in other ways (e.g. using a jump table). Also, by creating generic input and output stream operators, we can more clearly see the underlying logic.

A switch statement won't be as fast as return (bp == 65) | (bp == 67) | (bp == 70) | (bp == 71), where the integers are 'A' 'C' 'F' and 'G' represented in integer form. Cost is about 2-3 CPU cycles. — user997112, CommentedJun 21, 2019 at 0:23
@user997112 It depends on the compiler and the target machine architecture. On my machine, the switch is faster, and I determined that by actually measuring program speed with representative input data. — Edward, CommentedJun 21, 2019 at 5:30
@Edward A switch statement is more dependent on the compiler and architecture. You don't know if a jump table is being created or not. If there's a jump table you have branch misprediction. In contrast equalities and bitwise ORs don't have multiple code possibilities or branch prediction. You can even separate the equality in to two parts, then combine the two results and potentially remove data dependencies so it executes in 1-2 CPU cycles. — user997112, CommentedJun 21, 2019 at 12:55

Emily L. · Accepted Answer · 2017-04-27 22:51:38Z

In addition to the two excellent answers you have already received I would like to point out another way of simplifying non_canon:

bool non_canon(char bp){ constexpr std::string canon_alphabet {"actgACTG"}; return std::string::npos == canon_alphabet.find(bp); }

Stack Exchange Network

Simple program to split DNA sequences based on ambiguous content

The problem

The data

The constraints

The code

3 Answers 3

Questions

Hot Network Questions

Simple program to split DNA sequences based on ambiguous content

The problem

The data

The constraints

The code

3 Answers 3

Questions

Related

Hot Network Questions