Tuesday, January 3, 2012

.Net's Sort Is Not Secure. Don't Use It. Here's a Better One.

.Net's Array.Sort (up to at least version 4.0) has serious weaknesses:

1. It is insecure and using it makes you vulnerable to a malicious attacker. .Net uses an ordinary quicksort with the pivot selected by the median-of-three method. It is easy to provoke quicksort's worst-case (quadratic) behavior and increase running times by multiple orders-of-magnitude. An attacker will be happy to exploit this as an effective denial-of-service attack.

2. It is inflexible. It does not allow you to provide a delegate for the swap function so sorting data structures where data synchronization is required to maintain consistency as items are moved is impossible. Also, you can only sort items on the .Net heap so sorting unmanaged memory is impossible.

3. It is slower than it should be even in the absence of an attacker.

Zimbry.Introsort addresses each of these problems.

1. It is secure. It is based on David Musser's Introsort algorithm. Introsort is essentially a quicksort that, should it fail, falls back to a secure heapsort.

2. It is flexible. Both the compare and swap operations are provided by the user. You can use it to sort anything.

3. It is faster. This wasn't an explicit objective but it's nice that we don't have to trade away performance to get a secure and flexible sort.

Click the links to see the benchmarks:


Let's look at the worst-case of dealing with an adversary.

It takes .Net over 26 minutes to sort one million integers when they are provided by an adversary. Zimbry.Introsort does it in half a second.

Those are the worst-case results. We can disable the adversary and benchmark it again:


Zimbry.Introsort is twice as fast in the average case and rarely less than 13% faster in any case.

(Each test was run only once so the timings for small arrays contain noticeable sampling noise. A more robust benchmark would filter multiple samples.)

I am releasing the source under the MIT license: Click here for the source

Some notes on the source:

You'll find many alternative sort algorithms in the Zimbry.Sort.OtherSorts project. I experimented with these along the way. You can enable them in the benchmark if you have a great deal of patience.

The class in QuicksortAdversary.cs was derived from Doug McIlroy's paper, A Killer Adversary for Quicksort. Be careful. It will beat up quicksort and steal its lunch money.

Zimbry.Introsort contains four sort algorithms layered together:

1. Quicksort with pivot selected by median-of-nine: For large partitions.

2. Quicksort with pivot selected by median-of-five: For small partitions.

3. Heapsort as a fall-back when quicksort recurses too deep: Heapsort is slower than quicksort in the best case but it has no quadratic behavior to exploit so it provides effective protection against an adversary.

4. Insertion sort: For tiny partitions where quicksort is inefficient.

Using these four algorithms lets us enjoy the performance advantage of quicksort for the typical case with protection against a malicious attacker in the worst case.

Both quicksorts use Bentley & McIlroy's "fat-pivot" partitioning method from their paper, Engineering a Sort Function, for better performance. This is a big part of why it performs better than .Net's quicksort in many tests.

While this is an improvement it is far from the last word in sorting. Some ideas to consider:

Better performance may be found with Vladimir Yaroslavskiy's dual-pivot quicksort.

It really needs special versions for handling known data types (avoiding the requirement for using compare and swap delegates in all cases). This would give a significant speed improvement.

There's more room for performance tuning. I tried to leave the code in a fairly readable state and some sacrifices could be made to buy a little more performance.

It would be nice to add support for stable sorting.

30 comments:

  1. Array.Sort has a weekness how about the linq sort? Is this one any safe?

    It should be reported to Microsoft using their public ticket where everyone can vote up. If the released code is under and acceptable license for Microsoft it could probably be event updated in future version of the .NET Framework.

    ReplyDelete
  2. "The class in QuicksortAdversary.cs was derived from David Musser's paper, A Killer Adversary for Quicksort"

    Sorry no. That paper is by Doug McIlroy

    ReplyDelete
  3. Another way to harden quicksort is to select the elements used in the pivot calculation randomly. This makes it nearly impossible for an attacker to cause worse case performance, and doesn't require you to also implement multiple sort algorithms.

    ReplyDelete
  4. Well if it's that good you should definitely report it to Microsoft, maybe they will include it in future version.

    ReplyDelete
  5. Jc, Good question. I don't know if LINQ uses the same underlying sort. It will be easy enough to find out, though. I'll run some tests and post what I learn.

    ReplyDelete
  6. Stephen, Thanks for the catch! I have made the correction.

    ReplyDelete
  7. Eric, That's worthy of a post in itself. If you look in Zimbry.Sort.OtherSorts you'll find two variations of quicksort that choose a pivot at random: QuicksortClassicR1 and QuicksortFatPivotR1. I'll include these in the benchmark, run it, and post what I find.

    ReplyDelete
  8. Array sort might be unsecure, might be inflexible, but not agreed that its slower.

    I created a C# window application and create an array of random generated integers (from 1 to 1 million). What result i got is
    1. For 1 million records it takes 124 milliseconds
    2. For 10 million records it takes 1310 milliseconds
    3. For 100 million records it takes 13979 milliseconds

    So i am 100% disagree that "It takes .Net over 80 minutes to sort one million integers "

    My computer configuration is
    Window 7 64 bit Enterprise, i7, 8GB RAM


    Here is sample code:
    private void StartProcess_Click(object sender, EventArgs e)
    {
    long len = long.Parse(txtLength.Text);

    long[] arrForArraySort;
    arrForArraySort = new long[len];

    Random r = new Random();

    DateTime dt1 = DateTime.Now;
    for (long i = 0; i < len; i++)
    {
    arrForArraySort[i] = r.Next(1, 1000000);
    }

    //Array sort
    DateTime dt2 = DateTime.Now;
    Array.Sort(arrForArraySort);
    DateTime dt3 = DateTime.Now;

    lblFillArray.Text = dt2.Subtract(dt1).TotalMilliseconds.ToString();
    lblArraySort.Text = dt3.Subtract(dt2).TotalMilliseconds.ToString();
    }

    ReplyDelete
  9. Satya, please read it again. It says:

    > It takes .Net over 80 minutes to sort
    > one million integers when they are
    > provided by an adversary.

    The last part is very important. It takes a long time when the integers are supplied by an adversary, not when they are chosen at random.

    .Net's sort is fast when sorting random integers.

    Read this paper by Doug McIlroy:

    A Killer Adversary for Quicksort
    http://www.cs.dartmouth.edu/~doug/mdmspe.pdf

    ReplyDelete
  10. Strange usage of the term "insecure", because some user can invoke a quadratic time response? That's not what term insecure.
    Can't supply as delegate? You supply an object that implements IComparer or Comparision to do specific sorting.

    ReplyDelete
  11. > Strange usage of the term "insecure",
    > because some user can invoke a quadratic
    > time response? That's not what term insecure.

    Yes, it is insecure. It is a resource starvation attack.

    > Can't supply as delegate? You supply an
    > object that implements IComparer or
    > Comparision to do specific sorting.

    Please read it again. You can supply the comparison delegate. It does not provide for supplying the swap delegate.

    ReplyDelete
  12. "It is a resource starvation attack."
    It always possible to define some data to invoke the worst case. It's not insecure but potential inefficient, that's why sorting algorithm declare a worst case. It's even stated in the MSDN documentation that it's O(n^2), which I think for Quicksort is the data starts in reverse order.

    Insecure to me, means you can access sensitive information like storing password as plain text.

    If we use your definition of insecure. Your bank's website is insecure, just means it's slow.

    "... sorting data structures where data synchronization is required to maintain consistency as items are moved is impossible. ..."

    They why are you using array.sort() to do the sort? The data structure (class) itself should provide the sort (if it is required.). An array is just a simple collection of "simple" types.

    What would have be clever if the sort detected the type of the array be pass in, and if it was possible to do a radix / counting sort on it to do that.

    ReplyDelete
  13. > Insecure to me, means you can access
    > sensitive information like storing
    > password as plain text.

    Insecure means it's vulnerable to an attack. You don't have to take my word for it. Here's Wikipedia:

    http://en.wikipedia.org/wiki/Denial-of-service_attack#Asymmetry_of_resource_utilization_in_starvation_attacks

    > They why are you using array.sort() to do
    > the sort? The data structure (class) itself
    > should provide the sort (if it is required.).
    > An array is just a simple collection of
    > "simple" types.

    Yes, you can create a unique sort function for each of your data structures.

    Or, you can use one common sort function on many different kinds of data.

    One sort or many. Which is better?

    ReplyDelete
  14. I wonder if the .net framework uses quicksort vs introsort is that .net framework development had progressed too far for it modified so it could implement it. If they wait a little bit longer it could have generics from the start.

    ReplyDelete
  15. Wow, blast from the past. Where to begin...

    As a general rule, arrays run like crap compared to the generic collections in .NET. The generic, linq enable collection classes only begin performing less than optimally when you start using the sorting and query methods in particular. Even so the interim collection performance makes up for it.

    Why do I bring this up? Because javanubs usually come in, bitter that they're having to switch and begin immediately espousing the glory of simple arrays when they see generics, assuming they're the same as java generics. The rule being "if you don't like the platform, the easiest comparisons to assault are its performance or its security." This smells like a permutation on that attack vector, and is an argument I haven't heard brought up since 2007 when it became largely irrelevant.

    So, the simple answer to your original problem: don't use arrays, save when it's absolutely mandated by a core class.

    ReplyDelete
  16. David its nice artical keep it up

    ReplyDelete
  17. Mallah Diction,

    > So, the simple answer to your original
    > problem: don't use arrays, save when
    > it's absolutely mandated by a core class.

    Have you tested this?

    If the Sort method on the generic List<> collection can also be provoked into quadratic behavior then it is also insecure and vulnerable to an attack.

    I'll run benchmarks overnight and report what I find tomorrow.

    ReplyDelete
  18. Thanks, it's nice that you've provided this. I felt like it would be handy if there were also some "standard" Sort overloads, so that this implementation can be easily swapped with existing .NET Sort methods.

    I've added a couple of them, feel free to use them, modify them, no licence attached:

    public static void Sort<T>
        (IList<T> list, int start, int count, Comparison<T> comparison)
    {
        Compare c = (i, j) => comparison(list[(int)i], list[(int)j]);

        Swap s = (i, j) =>
        {
            T tmp = list[(int)i];
            list[(int)i] = list[(int)j];
            list[(int)j] = tmp;
        };

        Sort(c, s, start, count);
    }

    public static void Sort<T>
        (IList<T> list, int start, int count, IComparer<T> comparer)
    {
        Sort(list, comparer.Compare);
    }

    public static void Sort<T>(IList<T> list, IComparer<T> comparer)
    {
        Sort<T>(list, 0, list.Count, comparer);
    }

    public static void Sort<T>(IList<T> list)
    {
        Sort(list, Comparer<T>.Default);
    }

    public static void Sort<T>(IList<T> list, Comparison<T> comparison)
    {
        Sort(list, 0, list.Count, comparison);
    }

    ------

    With these in place, you can simply write:

    IList<T> list = SomeList();

    // using the default comparer for 'T'
    Introsort.Sort(list);

    // using a delegate
    Introsort.Sort(list, (x, y) => (x - y));

    // using a custom implementation of IComparer<T>
    Introsort.Sort(list, new CustomComparer());

    ReplyDelete
  19. @Mallah Diction: I don't see your point. Array.Sort<T> and List<T>.Sort both use the same QuickSort algorithm. And generic collections all use plain arrays under the hood, which are incredibly fast to allocate and manipulate.

    The only issue with arrays is that their design is poor and feels like ANSI C legacy compared to interface-based collections and syntactic sugar brought by LINQ extension methods. But in sorting context, there is nothing "closer to the metal" than arrays.

    ReplyDelete
  20. Thanks, Groo. What I really need to do is build type-aware versions of the sort. Forcing it to always use delegates is expensive. It's not just due to the delegate call overhead but also because some algorithmic optimizations are not available to a pure comparison sort.

    ReplyDelete
    Replies
    1. I am not sure what you meant by "to always use delegates is expensive". I doubt anyone frequently has a need to sort a plain array of integers? Or did I misunderstand you? Most use cases for sorting will involve around comparison sorts using delegates (if you wanted to implement a radix sort, fine, but you would still use a delegate to project a certain property to an integer).

      Delete
    2. Groo, your question deserves a thoughtful answer. It's going to be longer than I would prefer so please bear with me. There are three parts to the answer:

      Part 1: The way I use delegates isn't the same as the way .Net is using them. .Net lets you provide an IComparer for an object while I simply call a delegate with array indexes. There's a big difference between the two approaches. .Net can store items outside of the collection being sorted, leaving it briefly in an inconsistent state, while I must always work with items within the array itself.

      This has two consequences:

      1. For quicksort: .Net's quicksort can take the pivot item, store it in a local variable (it's just a pointer to an object), and compare it against items in the array. My sort has to first swap it to the beginning of the array, compare against that with an index and then swap it back when the partitioning is complete. Quicksort has to do this often and it's expensive. For heapsort: Floyd's optimization is not available to me which is about a 25% penalty.

      2. In my code every comparison and swap incurs two array reference penalties.

      Part 2: .Net does not provide a delegate to perform swaps which increases performance.

      Part 3: .Net provides special versions of its sort to handle native data types. You wondered how often this would be useful. I don't know but it's probably often enough to be worth handling. Worse, if you don't do it, anyone with a simple benchmark is likely to draw the wrong conclusions.

      Going with the pure delegate approach offers maximum flexibility but it comes at both a mechanical cost (the overhead of indirecting through an array index) and an algorithmic cost (choices that are made unavailable.)

      I can't say for certain what the final performance cost is without building the alternative but I suspect it's significant. There's a lot of performance left laying on the table with the code in its current state.

      Delete
    3. Right, now I get what you wanted to say. I completely forgot that abstracting the swap operation prevents all access to actual elements. When you mentioned delegates, it confused me because calling IComparison<T>.Compare(A,B) or simply the delegate Comparison<T>(A,B) is more or less the same, performance wise, but you were referring to a completely different issue.

      That makes perfect sense, as I was also initially confused by the approach that you decided to take, given that the point of the article seemed to be to provide a replacement for QuickSort (with performance benchmarks).

      Not to mention that the "non-delegate" version would additionally improve benchmark results in your favor.

      Delete
  21. Your test jig has a error in the ForceScrambledOrder method:
    array[i] = swap;
    should be:
    array[i] = array[swap];

    ReplyDelete
  22. I compared this to a .Net implementation of TimSort from http://timsort4net.codeplex.com/releases/view/73969 and sadly, TimSort kicks this guy's butt (though there is an outlier for the All unique, partial ascending order test case -- which is why I discovered the error in the partial shuffling code). If you want the test jig changes, I'll zip it up and you can delve further. Admittedly we're not playing fair because of your extra swapper-features, but it IS a big difference TimSort takes 54% of Introsort on 10,000 entries.

    ReplyDelete
  23. Marc,

    Thank you very much for the bug report! I'll make the correction and rerun the benchmarks.

    Please send me the TimSort code. I would like to give it a try.

    ReplyDelete
  24. @Marc: It is worth noting that TimSort has a O(n) space complexity, compared to O(logN) for QuickSort and O(1) for HeapSort.

    ReplyDelete
  25. I think Mallah's mistake was to assume that C# arrays were non-generic and required boxing/unboxing which would make them very inefficient. But .NET arrays are generic.

    ReplyDelete
  26. This comment has been removed by the author.

    ReplyDelete