7
\$\begingroup\$

A sorting algorithm i have written.
Pros
- Gives each element a position once.
- Few comparisons needed.
Cons
- For certain sets of data, large amount of extra storage is needed.
- Only works with integers.

The main reason i post the algorithm here is to learn if it already exists or not. From the research i have done i can't find anything similar. Though i am sceptical it would be very fun to know if it is new or not. What i mostly want to know however is if the algorithm could be useful and if my implementation of it is reasonable.

Algorithm:
Basic idea: When you divide the number you want to sort with the largest number of the set you get a percentage. This percentage gives a rough position where in the sorted list that number should be.

Because it is difficult to know how the data is distributed the mean/average is used. By dividing the mean-value with the amount of unique values in the set, you can decide how much of the data set is above and below the mean-value.
Then by using the above idea on all the elements, each element can get a position and be put in a new sorted list.

Some lists with a high amount of clustering or small number of elements with large values, can have the same position calculated from different elements. Therefore all the elements are first put in a larger array to lessen the amount of collisions and if a collision occur the element is moved sligthly up/down. Bacause the new array is larger, elements can be moved up and down without causing problems with insertion of new ones.
Duplicates are handled by counting the collisions of each element.

When all elements have been placed in the larger array it is then collapsed into an array of the correct size.

Exampel:
[4, 20, 6, 4, 4]
min = 4, max = 20, mean = 8

Calculating the possible amout of unique elements:
(max - min + 1) = (20 - 4 + 1) = 17
Calculating a percentage for the mean:
(1 - (mean-min)/unique) = (1 - (8-4)/17) = 0.76
This tells us roughly how large portion of the elements are below the mean. This makes sense because we have 4 out of 5 elements lower than 8.

If we for this example create a 20 times larger array (100 elements) we know that the 80 first elements are for the 4 elements under 8. Thus each element has 20 places to use for collisions.
First element:
4 => ((element - min)*places for each element) = (4-4)*20 = 0
6 => (6-4)*20 = 40
Here 4 gets position 0 and 6 gets position 40. Just as 6 is between 4 and 8,40 is between 0 and 80. After doing this for all elements and then removing all the empty space in the large array, an array have been sorted.

Testing:
I have timed a python vesion of this code by hard-coding different scenarios and
comparing time with quicksort and countingsort. Out of a lack of experience i
can't draw any real conclusions but these are my results:

With a set of data larger than roughly 500 elements spacesort took half the time quicksort did.
Generally, countingsort is faster but certain sets of data such as a completely homogeneous set can be done faster with spacesort after a few code changes. And of course sets with a large range of data.

Good to know:
The large empty array created is called "space" in the code.

Except for a data-array the function needs an argument "size", this konstant multiplied by the length of the data-array gives the length of the space-array. e.g the example above used size = 20.

How large the empty array created needs to be depends alot on the set of data. The smaller the size the faster the algorith and less memory usage. A completely homogeneous set needs only the same size as the original data set. The less homogeneous the larger it needs to be. A good starting point might be 10 times larger.

struct spaceElement { public int data; public bool set; // true if a value have been set to this element. } /* Takes two arguments, data = set of data to be sorted and size then returns the sorted list.*/ static int[] spaceSort (int[]data, int size) { int min, max, mean; minMaxMean(data, out min, out max, out mean); spaceElement[] space = insertElements(data, size, min, max, mean); return collapseSpace(space, data.Length); } // Arguments: list of data, size constant for space, min-value, max-value, mean-value. // Returns the space array with sorted elements. static spaceElement[] insertElements(int[] data, int size, int min, int max, int mean){ int lowMax, highMin; lowMaxhighMin(data, min, max, mean, out lowMax, out highMin); int numDiffElements = max - min + 1; // Largets possible amount of different elements. e.g 4,20,6,4,4 has 20-4+1 = 17 elements. double meanPercent = 1 - (mean - min) / (double)numDiffElements; // Determines what percentage of elements are below the mean. e.g 4,5,6,7,20 => 1 - (7 - 4)/17 = 0.82. // Creates a large array where the sorted elements can be inserted according to their size. int spaceLength = data.Length * size; spaceElement[] space = new spaceElement[spaceLength + size * 2 + 1]; // Determine how much space each element gets if it lies below or above the mean. double lowMult = meanPercent * data.Length * size / (lowMax - min + 1); double highMult = (1 - meanPercent) * data.Length * size / (max - highMin + 1); foreach ( int x in data) { bool done = false; int index = 0; int lastStep = size; // Determine the initial position of the element in the space array. if ( x <= mean) { index = (int)Math.Round( (x - min) * lowMult + size/2 ); lastStep = (int)Math.Round(lowMult); } else { index = (int)Math.Round( (x - highMin) * highMult + size + meanPercent*spaceLength ); lastStep = (int)Math.Round(highMult); } /* First checks if the position is empty, if it is put the element there. Else if value is the same as the already stored value increment count on that element. If the value is not equal to the already stored value jump up/down half of the lowMult/highMult. */ while (!done) { if (!space[index].set){ space[index].data = x; space[index].set = true; done = true; } else if (space[index].data == x) { space[index].count += 1; done = true; } else if ( space[index].data > x ) { lastStep = (int)Math.Round(lastStep / 2.0d); index -= lastStep; } else { lastStep = (int)Math.Round(lastStep / 2.0d); index += lastStep; } } } return space; } /* Arguments: list of data. Out: min-value, max-value and mean/average-value */ static void minMaxMean ( int[] data, out int min, out int max, out int mean ) { min = data[0]; max = data[0]; long sum = 0; foreach ( int x in data ) { sum += x; if ( x < min ) min = x; if ( x > max ) max = x; } mean = (int)Math.Round(sum / (double)data.Length); } /* Arguments : list of data, min-value, max-value, mean/average-value. Returns the highest value less than or equal to the mean-value and the lowest value greater than the mean-value. */ static void lowMaxhighMin( int[] data, int min, int max, int mean, out int lowMax, out int highMin) { lowMax = min; highMin = max; foreach ( int x in data ) { if ( x > lowMax && x <= mean ) lowMax = x; if ( x < highMin && x > mean ) highMin = x; } } // Creates an array of the same size as the original data array and populates it with the sorted values. static int[] collapseSpace( spaceElement[] space, int length ) { int[] newArray = new int[length]; int count = 0; foreach ( spaceElement x in space) { if ( x.set) { for (int y = 0; y <= x.count; y++) { newArray[count] = x.data; count++; } } } return newArray; } 
\$\endgroup\$
2
  • \$\begingroup\$Can you guarentee this to give a correct sort? You make two passes just to get lowMax and highMin.\$\endgroup\$
    – paparazzo
    CommentedMar 22, 2017 at 18:26
  • \$\begingroup\$Even though i have tested the code quite a bit I have only tested hard-coded scenarios that i could think of and checked the result. To give such a guarantee i suppose i would need to do some kind of a mathematical proof for it to work on every possible distribution. I do not know how to do that so no, i can unfortunatley not guarantee it will always give a correct sort.\$\endgroup\$
    – Stahl
    CommentedMar 22, 2017 at 21:30

2 Answers 2

5
\$\begingroup\$

So you are using a function mapping each element to the interval (0, 1) to sort them. If you wanted to make this fully generic, here is the way to do that:

First, assume arbitrary precision. Your doubles only allow about 253 different values to be represented which means that you will probably run into problems with maliciously-constructed 64-bit integer inputs which end up mapping a large range of numbers to the same double. For example, an array of 10 elements where 5 of them are the integers 0-4, then 5 of them are maybe 261 to 261+4, probably only the first 5 get properly sorted because for the second two the doubles are likely all equal even though the integers are not.

Second, array indexing is going to mess you up. It is not a constant-time operation with regard to the length of the array, so it probably creates a secret O(n2) scaling, and the people who care about the speed of sorting algorithms usually have gigabytes or more of data to sort, they cannot easily just say "oh, I will allocate an array that is substantially bigger than the amount of data I need to sort."

I think when you look at both of these criteria together, you realize that what you're really interested in is a sort of high-fanout tree: and the first place to start with that is to start with a low-fanout tree (like a binary tree, each bit in the decimal expansion tells you to go into the left or right subtree) for "correctness" and then "optimize for speed".

At the end of this analysis you'll be taking N bits from the binary expansion at a time, let's say 6, and then we use that to index into a 64-element array. Then to handle equal elements we need stacks at the bottom and some way to determine if the arbitrary-precision numbers are equal etc. So for the pathological 5+5 example that you couldn't sort, we can just divide 64-bit ints into bitstrings with bit-shifts and masks the obvious way. In JSON (but not JS, which doesn't have 64-bit integers!) you could write the result without trailing nulls as:

[ [[[[[[[[[[ 0, 1, 2, 3, 4 ]]]]]]]]]], null, [[[[[[[[[[ 2305843009213693952, 2305843009213693953, 2305843009213693954, 2305843009213693955, 2305843009213693956 ]]]]]]]]]] ] 

In fact it seems like you would do best to have a trie data structure to reduce this overhead even a little more; using base64 encoding this I think is the trie

{ "AAAAAAAAAAA": ["A", "B", "C", "D", "E"], "CAAAAAAAAAA": ["A", "B", "C", "D", "E"] } 

Now that you see that this is generally what you are doing, I can tell you that the name of this class of algorithms is called "Radix sort." Something similar can be read in Haskell code if you look at the implementation for the standard container Data.IntSet. The direct generalization where you say "wait, now that I see that I really need this tree structure, what happens if I just let the data direct the sorting of the tree?" is called a "self-balancing" search tree, so you might want to start with looking up the red-black tree and go from there.

\$\endgroup\$
    2
    \$\begingroup\$

    Using a few statistics such as the mean or the average to get an approximate position in an interval (0, 1) has already been used in the algorithm AdaptiveSort described by Daniel M. Chapiro in Sorting by Recursive Partitioning. However the rest of the algorithm is different: it uses this information to put the numbers in a number of partitions (which depends on the size of the collection to sort). Anyway, I always find the idea kind of fun :)


    A few things to improve your algorithm:

    • You can avoid extraneous comparisons in the foreach loop of minMaxMean, by not performing a comparison with max if the value is already lesser than min:

      foreach ( int x in data ) { sum += x; if ( x < min ) min = x; else if ( x > max ) max = x; } 

      This extra else reduces the maximum number of comparisons from \$2n\$ to at most \$max(\lfloor \frac{3}{2}(n−1) \rfloor, 0)\$ comparisons, where \$n\$ is the number of elements to sort.

    • You can even improve the minMaxMean to also compute whether the collection is already sorted without extra work. I recently wrote a function to do more or less that. It's C++, but can be easily translated to C#.

    • Be careful with the summation of many elements: overflows happen fast enough. One solution to mitigate the precision vs. risk of overflow of a mean function would be to have an implementation as follows (pseudocode):

      function mean(collection): var res = 0.0 var accumulator = 0 for value in collection: if (value + accumulator) would overflow: res += accumulator / size(collection) accumulator = 0 accumulator += value return res + accumulator / size(collection) 

      Note that the algorithm is untested, but you get the idea: accumulate as long as you can, and divide when you have to. Since you end up mostly dividing huge numbers, the resulting precision should be good enough most of the time.

    \$\endgroup\$

      Start asking to get answers

      Find the answer to your question by asking.

      Ask question

      Explore related questions

      See similar questions with these tags.