Why is processing a sorted array slower than an unsorted array?

Question

Reader's Note: this is NOT a duplicate of the similar question Why is processing a sorted array faster than processing an unsorted array?; the two questions focus on different premises and thus have different explanations.

I have a list of 500000 randomly generated Tuple<long,long,string> objects on which I am performing a simple "between" search:

var data = new List<Tuple<long,long,string>>(500000); ... var cnt = data.Count(t => t.Item1 <= x && t.Item2 >= x);

When I generate my random array and run my search for 100 randomly generated values of x, the searches complete in about four seconds. Knowing of the great wonders that sorting does to searching, however, I decided to sort my data - first by Item1, then by Item2, and finally by Item3 - before running my 100 searches. I expected the sorted version to perform a little faster because of branch prediction: my thinking has been that once we get to the point where Item1 == x, all further checks of t.Item1 <= x would predict the branch correctly as "no take", speeding up the tail portion of the search. Much to my surprise, the searches took twice as long on a sorted array!

I tried switching around the order in which I ran my experiments, and used different seed for the random number generator, but the effect has been the same: searches in an unsorted array ran nearly twice as fast as the searches in the same array, but sorted!

Does anyone have a good explanation of this strange effect? The source code of my tests follows; I am using .NET 4.0.

private const int TotalCount = 500000; private const int TotalQueries = 100; private static long NextLong(Random r) { var data = new byte[8]; r.NextBytes(data); return BitConverter.ToInt64(data, 0); } private class TupleComparer : IComparer<Tuple<long,long,string>> { public int Compare(Tuple<long,long,string> x, Tuple<long,long,string> y) { var res = x.Item1.CompareTo(y.Item1); if (res != 0) return res; res = x.Item2.CompareTo(y.Item2); return (res != 0) ? res : String.CompareOrdinal(x.Item3, y.Item3); } } static void Test(bool doSort) { var data = new List<Tuple<long,long,string>>(TotalCount); var random = new Random(1000000007); var sw = new Stopwatch(); sw.Start(); for (var i = 0 ; i != TotalCount ; i++) { var a = NextLong(random); var b = NextLong(random); if (a > b) { var tmp = a; a = b; b = tmp; } var s = string.Format("{0}-{1}", a, b); data.Add(Tuple.Create(a, b, s)); } sw.Stop(); if (doSort) { data.Sort(new TupleComparer()); } Console.WriteLine("Populated in {0}", sw.Elapsed); sw.Reset(); var total = 0L; sw.Start(); for (var i = 0 ; i != TotalQueries ; i++) { var x = NextLong(random); var cnt = data.Count(t => t.Item1 <= x && t.Item2 >= x); total += cnt; } sw.Stop(); Console.WriteLine("Found {0} matches in {1} ({2})", total, sw.Elapsed, doSort ? "Sorted" : "Unsorted"); } static void Main() { Test(false); Test(true); Test(false); Test(true); }

Populated in 00:00:01.3176257 Found 15614281 matches in 00:00:04.2463478 (Unsorted) Populated in 00:00:01.3345087 Found 15614281 matches in 00:00:08.5393730 (Sorted) Populated in 00:00:01.3665681 Found 15614281 matches in 00:00:04.1796578 (Unsorted) Populated in 00:00:01.3326378 Found 15614281 matches in 00:00:08.6027886 (Sorted)

@jalf I expected the sorted version to perform a little faster because of branch prediction. My thinking was that once we get to the point where Item1 == x, all further checks of t.Item1 <= x would predict the branch correctly as "no take", speeding up the tail portion of the search. Obviously, that line of thinking has been proven wrong by the harsh reality :) — Sergey Kalinichenko, CommentedDec 24, 2012 at 17:20
Interestingly, for TotalCount around 10,000 or less, the sorted version does perform faster (of course, trivially faster at those small numbers) (FYI, your code might want to have the initial size of var data List = new List<Tuple<long, long, string>>(500000) bound against TotalCount instead of hard-coding the capacity) — Chris Sinclair, CommentedDec 24, 2012 at 17:37
I'd like to add that the slowdown is specifically connected to filtering the list. Performing data.Where() shows the same slowdown, as does anything else that iterates over the sorted list. Operating on the sorted and unsorted lists without any filter takes the same time. — Bobson, CommentedDec 24, 2012 at 17:43
Not related to your question, but you create a class TupleComparer but that is entirely unnecessary since Comparer<Tuple<long, long, string>>.Default already has this behavior (from the IComparable implementation of Tuple<,,>). So you can just use data.Sort() with no arguments. — Jeppe Stig Nielsen, CommentedAug 9, 2013 at 20:49

usr · Accepted Answer · 2012-12-24 17:58:47Z

When you are using the unsorted list all tuples are accessed in memory-order. They have been allocated consecutively in RAM. CPUs love accessing memory sequentially because they can speculatively request the next cache line so it will always be present when needed.

When you are sorting the list you put it into random order because your sort keys are randomly generated. This means that the memory accesses to tuple members are unpredictable. The CPU cannot prefetch memory and almost every access to a tuple is a cache miss.

This is a nice example for a specific advantage of GC memory management: data structures which have been allocated together and are used together perform very nicely. They have great locality of reference.

The penalty from cache misses outweighs the saved branch prediction penalty in this case.

Try switching to a struct-tuple. This will restore performance because no pointer-dereference needs to occur at runtime to access tuple members.

Chris Sinclair notes in the comments that "for TotalCount around 10,000 or less, the sorted version does perform faster". This is because a small list fits entirely into the CPU cache. The memory accesses might be unpredictable but the target is always in cache. I believe there is still a small penalty because even a load from cache takes some cycles. But that seems not to be a problem because the CPU can juggle multiple outstanding loads, thereby increasing throughput. Whenever the CPU hits a wait for memory it will still speed ahead in the instruction stream to queue as many memory operations as it can. This technique is used to hide latency.

This kind of behavior shows how hard it is to predict performance on modern CPUs. The fact that we are only 2x slower when going from sequential to random memory access tell me how much is going on under the covers to hide memory latency. A memory access can stall the CPU for 50-200 cycles. Given that number one could expect the program to become >10x slower when introducing random memory accesses.

Good reason why everything you learn in C/C++ doesn't apply verbatim to a language like C#! — user541686, CommentedDec 24, 2012 at 17:48
You can confirm this behavior by manually copying the sorted data into a new List<Tuple<long,long,string>>(500000) one-by-one before testing that new list. In this scenario, the sorted test is just as fast as the unsorted one, which matches with the reasoning on this answer. — Bobson, CommentedDec 24, 2012 at 17:52
Excellent, thank you very much! I made an equivalent Tuple struct, and the program started behaving the way I predicted: the sorted version was a little faster. Moreover, the unsorted version became twice as fast! So the numbers with struct are 2s unsorted vs. 1.9s sorted. — Sergey Kalinichenko, CommentedDec 24, 2012 at 21:31
So can we deduce from this that cache-miss hurts more than branch-mispredication? I think so, and always thought so. In C++, std::vector almost always performs better than std::list. — Sarfaraz Nawaz, CommentedDec 25, 2012 at 5:57
@Mehrdad: No. This is true for C++ also. Even in C++, compact data structures are fast. Avoiding cache-miss is as important in C++ as in any other language. std::vector vs std::list is a good example. — Sarfaraz Nawaz, CommentedDec 25, 2012 at 6:00

Emperor Orionii · Accepted Answer · 2012-12-25 15:43:35Z

LINQ doesn't know whether you list is sorted or not.

Since Count with predicate parameter is extension method for all IEnumerables, I think it doesn't even know if it's running over the collection with efficient random access. So, it simply checks every element and Usr explained why performance got lower.

To exploit performance benefits of sorted array (such as binary search), you'll have to do a little bit more coding.

I think you misunderstood the question: of course I wasn't hoping that Count or Where would "somehow" pick up on the idea that my data is sorted, and run a binary search instead of a plain "check everything" search. All I was hoping for was some improvement due to the better branch prediction (see the link inside my question), but as it turns out, locality of reference trumps branch prediction big time. — Sergey Kalinichenko, CommentedDec 25, 2012 at 16:12

Collectives™ on Stack Overflow

Why is processing a sorted array slower than an unsorted array?

2 Answers 2

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Linked

Related