Codility's GenomicRangeQuery solution in C#

Question

I've been learning C# for a few months now, and I've done some basic things in java last year. I made a solution for the problem the title states, this is the task description.

Task description
A DNA sequence can be represented as a string consisting of the letters A, C, G and T, which correspond to the types of successive nucleotides in the sequence. Each nucleotide has an impact factor, which is an integer. Nucleotides of types A, C, G and T have impact factors of 1, 2, 3 and 4, respectively. You are going to answer several queries of the form: What is the minimal impact factor of nucleotides contained in a particular part of the given DNA sequence?
The DNA sequence is given as a non-empty string S = S[0]S[1]...S[N-1] consisting of N characters. There are M queries, which are given in non-empty arrays P and Q, each consisting of M integers. The K-th query (0 ≤ K < M) requires you to find the minimal impact factor of nucleotides contained in the DNA sequence between positions P[K] and Q[K] (inclusive).
For example, consider string S = CAGCCTA and arrays P, Q such that:
P[0] = 2 Q[0] = 4 P[1] = 5 Q[1] = 5 P[2] = 0 Q[2] = 6 
The answers to these M = 3 queries are as follows:
The part of the DNA between positions 2 and 4 contains nucleotides G and C (twice), whose impact factors are 3 and 2 respectively, so the answer is 2. The part between positions 5 and 5 contains a single nucleotide T, whose impact factor is 4, so the answer is 4. The part between positions 0 and 6 (the whole string) contains all nucleotides, in particular nucleotide A whose impact factor is 1, so the answer is 1. Write a function:
class Solution { public int[] solution(string S, int[] P, int[] Q); }
that, given a non-empty string S consisting of N characters and two non-empty arrays P and Q consisting of M integers, returns an array consisting of M integers specifying the consecutive answers to all queries.
Result array should be returned as an array of integers.
For example, given the string S = CAGCCTA and arrays P, Q such that:
P[0] = 2 Q[0] = 4 P[1] = 5 Q[1] = 5 P[2] = 0 Q[2] = 6 
the function should return the values [2, 4, 1], as explained above.
Write an efficient algorithm for the following assumptions:
N is an integer within the range [1..100,000]; M is an integer within the range [1..50,000]; each element of arrays P, Q is an integer within the range [0..N − 1]; P[K] ≤ Q[K], where 0 ≤ K < M; string S consists only of upper-case English letters A, C, G, T.

This is my 100% task score solution in C#. What things can be improved?

/// <summary> /// Find the minimum impact factor for every query in a given chain /// </summary> /// <returns> /// An array of ints, every int represents the minimum impact factor found /// within ranges in a given chain /// </returns> public int[] GetMinimumImpactFactors(string S, int[] P, int[] Q) { //nuc(s) is short for nucleotide(s) Dictionary<char, int> nucsDict = new Dictionary<char, int>(){{'A',1},{'C',2},{'G',3},{'T',4}}; int[] result = new int[P.Length]; int[] nucsInOrder= new int[S.Length]; int nucsIndex = 0; //iterate through dictionary, looking for A's, then C's, and so on... foreach ( var searchedImpactPair in nucsDict ) { //Appearances of <searchedImpactPair.Key> have a given position in <S> chain, those positions are saved for (int currNucPos=0; currNucPos < S.Length; currNucPos++ ) { if( S[currNucPos] == searchedImpactPair.Key ) { nucsInOrder[nucsIndex++] = currNucPos ; } } } int start, end, minImpactValueFound, minNucFound; char nucleotide; for(int positionWithinRange = 0; positionWithinRange < P.Length; positionWithinRange++) { start = P[positionWithinRange]; end = Q[positionWithinRange]; for ( int nucInOrder = 0; nucInOrder < nucsInOrder.Length; nucInOrder++ ) { minNucFound = nucsInOrder[nucInOrder]; if ( minNucFound >= start && nucsInOrder[nucInOrder] <= end) { nucleotide = S[minNucFound]; minImpactValueFound = nucsDict[nucleotide]; result[positionWithinRange] = minImpactValueFound; break; } } } return result; }

Skary · Accepted Answer · 2022-08-27 07:43:53Z

I have rewritten the code avoiding nested loops and using linq with enumeration to simplify the code.

I have commented your code and added comments in the format //--> with annotations on why I have commented your code and why the substituted code is equivalent.

As you can see you only need 3 steps :

Normalize dna input to (string) to a sequence of integer
Normalize cutting point (P,Q) from a independant data structure to a "tuple" like datastructure
Enumerate each cutting point in the collection and calculate the score
As stated in the comment i have not made any validation over the input that are always assumed as correct.

I also admit that English is not my native language, so it's very likely I made mistakes here in the description but also in the comments (inside the program). If something is not clear, please let me know in the comments below.

 /// <summary> /// Find the minimum impact factor for every query in a given chain /// </summary> /// <returns> /// An array of ints, every int represents the minimum impact factor found /// within ranges in a given chain /// </returns> public int[] GetMinimumImpactFactors(string dna, int[] startingPoints, int[] endingPoints) //--> variables should be lower case and meaningful (no P,S,Q but instead dna,startinPoints, endingPoints) //--> because starting point and ending points are bounded, consider to use a datastructure (struc or class) to represent the pair { //--> As explained below converrt the S string (an array of character) to an array of actual integer value is simpler //nuc(s) is short for nucleotide(s) //Dictionary<char, int> nucsDict = new Dictionary<char, int>() { { 'A', 1 }, { 'C', 2 }, { 'G', 3 }, { 'T', 4 } }; //--> use a list, may be simpler, so you can just add result pieces as you get them. But in this case may be simpler because you can get pieces on the fly as you enumerate over "transformed" genome //int[] result = new int[P.Length]; //--> Double loop it's very inefficient loop once and substitute character representation with integer value //int[] nucsInOrder = new int[S.Length]; //int nucsIndex = 0; ////iterate through dictionary, looking for A's, then C's, and so on... //foreach (var searchedImpactPair in nucsDict) //{ // //Appearances of <searchedImpactPair.Key> have a given position in <S> chain, those positions are saved // for (int currNucPos = 0; currNucPos < S.Length; currNucPos++) // { // if (S[currNucPos] == searchedImpactPair.Key) // { // nucsInOrder[nucsIndex++] = currNucPos; // } // } //} //--> use linq to simply represent a loop and nested ternary operator to encode a switch-case (inline a switch case is good for performance if you have few condition, eg < 5) // then materialize the enumeration returned by linq "loop" to an actual array. This improve the previous two loop with just a single loop (abstracted by linq) int[] nucleotidesValues = dna.ToCharArray().Select(crt => crt == 'A' ? 1 : (crt == 'C' ? 2 : (crt == 'G' ? 3 : 4))) .ToArray(); //--> As stated in the method signature definition is a better idea to pass an array of datastrucutre that actually represent a pair of integer (start and end) but maybe this is a constraint of you assignement // So i buil the pail here, for simplicity it's not start-end tuple but start-length var cuttingPoints = startingPoints.Zip(endingPoints, (start, end) => new int[] { start, end - start + 1 }) .ToArray(); return cuttingPoint.Select(cutting => nucleotidesValues.Skip(cutting[0]).Take(cutting[1]).SpecialPourposeMin(absoluteMin : 1)).ToArray(); //--> again, nested loop are not good. For each cuttingPoint we simply cut nucleotidesValues and then implicitly loop on the sub array //int start, end, minImpactValueFound, minNucFound; //char nucleotide; //for (int positionWithinRange = 0; positionWithinRange < startingPoints.Length; positionWithinRange++) //{ // start = startingPoints[positionWithinRange]; // end = endingPoints[positionWithinRange]; // for (int nucInOrder = 0; nucInOrder < nucsInOrder.Length; nucInOrder++) // { // minNucFound = nucsInOrder[nucInOrder]; // if (minNucFound >= start && nucsInOrder[nucInOrder] <= end) // { // nucleotide = S[minNucFound]; // minImpactValueFound = nucsDict[nucleotide]; // result[positionWithinRange] = minImpactValueFound; // break; // } // } //} //return result; }

Here the special pourpose implementation of a Linq Min function that will exit as soon as it reach the absolute minimum (optimize long sequence situation) :

public static int SpecialPourposeMin(this IEnumerable<int> source, int absoluteMin) { if (source == null) throw new ArgumentNullException("source"); int min = int.MaxValue; foreach (int item in source) { if (item < min) min = item; if (min == absoluteMin) return absoluteMin; } return min; }

I have created a little test, with 2048 DNA sting and 1024 start/end pair of at least of length 100.

I get the result in ~~~10ms~~ ~2ms (after optimized min function). Here is the test code :

static void Main(string[] args) { //i will not validate the DNA string List<int> start = new List<int>(); List<int> end = new List<int>(); Random rng = new Random(); int pairCount = 1024; for(int index = 0; index < pairCount; index++) { int indexBase = rng.Next(0, 500); start.Add(rng.Next(indexBase, indexBase + 500)); end.Add(rng.Next(indexBase + 600, indexBase + 1000)); } int[] result = Program.GetMinimumImpactFactors(_DNA, start.ToArray(), end.ToArray()); Console.WriteLine(string.Join(", ", result)); Console.ReadKey(); }

Stack Exchange Network

Codility's GenomicRangeQuery solution in C#

1 Answer 1

Hot Network Questions

Codility's GenomicRangeQuery solution in C#

1 Answer 1

Related

Hot Network Questions