I am looking for an algorithm (pseudocode, R code) for my problem here. This looks like a lot to read but is actually very very easy to read, and I am just trying to be verbose.
I have a list (say coords
) with 8 million elements. They look like this:
> head(coords) cstart cend 1 35085 35094 2 35089 35098 3 35374 35383 4 35512 35521 5 35513 35522 6 35514 35523 ... 98 12600309 12600318
What these elements represent are coordinates on a DNA string. Ie, 35374
refers to the position on the DNA string. I don't care whether its a A C G T
for what its worth. AND ALL coords are length 10. So the cend
field can be calculated. (I don't actually use this cend
in my code)
I have a secondary list (say alignments
) (~100 elements) that also have a start
and end
coordinate. This defines are area of interest. It looks like this
chr_name chr_strand start_index end_index "chr1" "+" "12600311" "12600325"
The columns chr_name, chr_strand
is not important. In this case, our "area of interest" could be larger than 10.
My problem
Now, what I want to do is go through all 8 million records and see if each record is contained in the interests
list (up to 80% overlap, I will explain this). For the above example, the first 6 elements are nowhere "near" the area of interest.
Row 98
starts 2 "base pairs" before our area of interest, BUT overlaps the area of interest up to 80%. Ie, 80% of the "site" is contained in the area of interest.
Result
Basically, I am looking for a list (I can preallocate it), where each element is a TRUE/FALSE. So from above I'll have a list like the following
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ..., TRUE (97'th index)
My current R code I have documented R code that is very easy to read
## mdply is a function that acts essentially like a loop ## the .data is the data I am working with.. the 8 million records ## for each row, it executes a function where the signature of the function ## matches the column names. ## the best part about this .. IS THAT ITS PARALLELIZED ## and I can use up to 24 cores to do this. b <- mdply(.data = coords, function(cstart, cend){ #the function is executed for each row. #loop through our "area of interests" for(j in 1:nrow(alignments)){ chr_info <- extract_information_from_alignment(alignments[j, ]) interesting <- 0 uninteresting <- 0 hits_index_start <- cstart hits_length <- 10 ## the length of our matrix aoe_start <- as.numeric(chr_info[["start_index"]]) aoe_end <- chr_info[["end_index"]] ## go through EACH INDEX AND CALCULATE THE OVERLAP for(j in hits_index_start:(hits_index_start + hits_length-1)){ if((aoe_start <= j) && (j <= aoe_end)){ interesting <- interesting + 1 } else { uninteresting <- uninteresting + 1 } } #end for - interesting calc substr_score = interesting/sum(interesting + uninteresting) if(substr_score >= substr_score_threshold){ return(TRUE) } else { return(FALSE) } } #end for }, .parallel = TRUE)
I ran this code last night for 13 hours and it wasn't completed. I estimate it takes about 16 hours..
Ideas I think the bottleneck is that for each row, it has a for loop (100 elements) and within that for loop, there is a second one that calculates the overlap.
But I am thinking, I don't need to go through EACH alignment. I should first test if my cstart
even is close to the area of interest.
Maybe some sort of binary search?