0

I am looking for an algorithm (pseudocode, R code) for my problem here. This looks like a lot to read but is actually very very easy to read, and I am just trying to be verbose.

I have a list (say coords) with 8 million elements. They look like this:

> head(coords) cstart cend 1 35085 35094 2 35089 35098 3 35374 35383 4 35512 35521 5 35513 35522 6 35514 35523 ... 98 12600309 12600318 

What these elements represent are coordinates on a DNA string. Ie, 35374 refers to the position on the DNA string. I don't care whether its a A C G T for what its worth. AND ALL coords are length 10. So the cend field can be calculated. (I don't actually use this cend in my code)

I have a secondary list (say alignments) (~100 elements) that also have a start and end coordinate. This defines are area of interest. It looks like this

 chr_name chr_strand start_index end_index "chr1" "+" "12600311" "12600325" 

The columns chr_name, chr_strand is not important. In this case, our "area of interest" could be larger than 10.


My problem

Now, what I want to do is go through all 8 million records and see if each record is contained in the interests list (up to 80% overlap, I will explain this). For the above example, the first 6 elements are nowhere "near" the area of interest.

Row 98 starts 2 "base pairs" before our area of interest, BUT overlaps the area of interest up to 80%. Ie, 80% of the "site" is contained in the area of interest.

Result

Basically, I am looking for a list (I can preallocate it), where each element is a TRUE/FALSE. So from above I'll have a list like the following

FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ..., TRUE (97'th index)

My current R code I have documented R code that is very easy to read

 ## mdply is a function that acts essentially like a loop ## the .data is the data I am working with.. the 8 million records ## for each row, it executes a function where the signature of the function ## matches the column names. ## the best part about this .. IS THAT ITS PARALLELIZED ## and I can use up to 24 cores to do this. b <- mdply(.data = coords, function(cstart, cend){ #the function is executed for each row. #loop through our "area of interests" for(j in 1:nrow(alignments)){ chr_info <- extract_information_from_alignment(alignments[j, ]) interesting <- 0 uninteresting <- 0 hits_index_start <- cstart hits_length <- 10 ## the length of our matrix aoe_start <- as.numeric(chr_info[["start_index"]]) aoe_end <- chr_info[["end_index"]] ## go through EACH INDEX AND CALCULATE THE OVERLAP for(j in hits_index_start:(hits_index_start + hits_length-1)){ if((aoe_start <= j) && (j <= aoe_end)){ interesting <- interesting + 1 } else { uninteresting <- uninteresting + 1 } } #end for - interesting calc substr_score = interesting/sum(interesting + uninteresting) if(substr_score >= substr_score_threshold){ return(TRUE) } else { return(FALSE) } } #end for }, .parallel = TRUE) 

I ran this code last night for 13 hours and it wasn't completed. I estimate it takes about 16 hours..

Ideas I think the bottleneck is that for each row, it has a for loop (100 elements) and within that for loop, there is a second one that calculates the overlap.

But I am thinking, I don't need to go through EACH alignment. I should first test if my cstart even is close to the area of interest.

Maybe some sort of binary search?

    1 Answer 1

    0

    As you've noticed, nested loops can get to be expensive. 8 million iterations of the outer loop times 100 iterations of the middle loop times (I'm not sure) iterations of the inner loop leads to a lot of processing.

    I think you're on the right track though - try to eliminate as many of those 8 million coords as possible before doing any processing on them. Binary Search is definitely your friend.

    The approach I'd take would look roughly like

    sort the coords list // (done once, use a library function, happens fast) create your output array, initialize all to 'false' for each alignment in alignments // this loop should iterate 100x // the first alignment value that would fit your 80% coverage // would be alignment.start_index - 2 (since all coords are of length 10 // and you need an 80% overlap) let searchStart = alignment.start_index - 2 call a binary search function for searchStart in the coords list // you'll either find the first area of interest, or the location // in the list where that first one would go. // do the same for the search end (since the alignment appears to be variable length) // and because iterating over the list of coords from the starting point // would probably be slower 

    I would expect this to run non-parallel fairly quickly - certainly not requiring hours.

    All it's doing is calling binary Search on an 8 million entry list, 100 or 200 times (depending on if you search for the end_index or iterate to it).

    0

      Start asking to get answers

      Find the answer to your question by asking.

      Ask question

      Explore related questions

      See similar questions with these tags.