The problem
I have a program that takes a file of DNA sequences as input and writes each entry to one of two output files: one file for sequences containing only canonical characters (A
, C
, G
, and T
), and another file for sequences containing non-canonical (or ambiguous) characters.
The data
DNA sequencing machines produce gobs of data, with each entry adhering to the following format.
@uniqueID free-form description and/or metadata ACGTACGTACGTACGT + BBBBBBBBBBBBBBBB
The first line contains an ID and metadata, the second line contains DNA sequence (this is what the program filters on), the third line is stupid and can be ignored, and the fourth line contains ASCII-encoded quality values corresponding to each nucleotide in the DNA. Here are 3 entries from a real file.
@ERR899705.3060 HWI-D00574:77:C6A74ANXX:5:1101:2907:2432/1 GCCTTCCAGAAAAATGTTCCCAGTTCATTGATGGCAAACTCAAACAGGTTTGAAAAAAAATCTCAATATTAAATATGGATATGATGCATCAGTTCAAGGTGATTTCTGGTTAGCTGTGTGCTGACA + BCCCCGGGEGGGGGGGGGGGGGGGGGFGGGGGGGGGGGEGGGGGGGGE1>FCFGF1F<DDDGGGGDGGCGGGGGGGGGGGFGGGGGGEGEGG>FCEG@@0FGGGGGCEDGCG>DCGGGGG=0D=GG @ERR899705.3061 HWI-D00574:77:C6A74ANXX:5:1101:2869:2437/1 CAAATAAATAAATAAATATTTAGTTACTGAAATAATAATGTTGATTGGTAAAATAAGAATGTATTCTCTCTAAAAACTAGTTAGTTTGTACTAAAAAATGGGCATAATTATTATCTCAAAGATTAG + BBBBBEGGGGGGGGGGGGGGGGGGGGGGD@GGGGGGGGEGC@FGFG>F:CGGGGGGFGGGCFCBDC11:F>FGGGGBEGGGGGGGGGC@F@FGGGGGG>000F@FGGGGGGEGGG00B0CFGGE>0 @ERR899705.3062 HWI-D00574:77:C6A74ANXX:5:1101:2786:2438/1 TTGGAAAACATTTATTACATTTTCTTCTAATTTCTTAGAATCATTTACTTGTAATATTTAGATACTCGGCGATCCTTTCAGGTATTCACTGCATATCCCCAAATATTGCATCAAACTCTCCCACGT + AABB0111@F111F111=1;@@;111E1?111:111=1=1111<11111<:11:C111<1EBG11:>F/////0/1:<F>C1:11=0:=0;@0>=0000==8>00000=8000<0000008000;0
When a DNA sequencing machine cannot reliably determine the molecule at a given position, it will use an N
character instead of an ACGT
character. Also, programs for analyzing DNA sequences will sometimes use other non-ACGT
characters (IUPAC notation) to represent uncertainty about the content of DNA sequences.
The constraints
I need to handle sequences containing ambiguous characters separately from sequences containing only canonical characters. The input files are huge, so I want something as performant as possible, but I'm also trying to keep it as simple and easy to read as possible.
- The code does no sanity checking on the input data (assumes the number of lines in the input file is a multiple of 4, assumes there are no blank lines, etc). I am ok with this for my current purposes.
- It abuses
stderr
as a second output stream, although I'm not sure it's worth the extra clutter to parse command-line args, provide a usage statement, etc.
The code
The following minimal C++ program solves the problem as stated. Keeping the program simple, are there any ways I can improve performance or clarity of the code?
I'd welcome any feedback, but I'm specifically interested in the following.
- How should I format the
if
statement in thenon_canon
function? Everything I've tried looks hideous. - I'm using the ternary operator to conditionally assign the output stream. My typical pattern in Python is to assign, check, and then reassign if needed, but
std::ostream
references cannot be reassigned in C++. Is there a clearer way of doing this?
#include <algorithm> #include <iostream> #include <string> char non_canon(char bp) { if ( bp != 'A' && bp != 'a' && bp != 'C' && bp != 'c' && bp != 'G' && bp != 'g' && bp != 'T' && bp != 't' ) { return true; } return false; } int main() { std::string defline, seq, qual, buffer; while (getline(std::cin, defline)) { getline(std::cin, seq); getline(std::cin, buffer); getline(std::cin, qual); bool non_acgt = std::any_of(seq.begin(), seq.end(), non_canon); std::ostream& outstream = non_acgt ? std::cerr : std::cout; outstream << defline << '\n' << seq << "\n+\n" << qual << std::endl; } return 0; }