Though big enough to be worthwhile at scale, it's notable how small, relatively, the difference is from the out-of-box Rust unstable sort to the tuned radix sort.
Lukas Bergdoll and Orson Peters did some really important heavy lifting to get this stuff from "If you're an expert you could probably tune a general purpose sort to have these nice properties" to "Rust's general purpose sort has these properties out of the box" and that's going to make a real difference to a lot of software that would never get hand tuned.
Well, years back I released an unstable sort called pdqsort in C++. Then stjepang ported it to the Rust standard library. So at first... nothing. Someone else did it.
A couple years later I was doing my PhD and I spent a lot of time optimizing a stable sort called glidesort. Around the same time Lukas Bergdoll started work on their own and started providing candidate PRs to improve the standard library sort. I reached out to him and we agreed to collaborate instead of compete, and it ended up working out nicely I'd say.
Ultimately I like tinkering with things and making them fast. I actually really like reinventing the wheel, find out why it has the shape that it does, and see if there's anything left to improve.
But it feels a bit sad to do all that work only for it to disappear into the void. It makes me the happiest if people actually use the things I build, and there's no broader path to getting things in people's hands than if it powers the standard library.
EDIT: ah no I'm being dense, you'd aggregate the union of all the set-columns indices across rows and the union of the set-row indices across the columns, keeping track of the source locations, and do the hashed sorting on those union vectors to find all the collision points. You could still get a small win I think by sorting the row-aggregation and column-aggregation separately though?
Interesting article. It’s actually very strange that the dataset needs to be “big” for the O(n log n) algorithm to beat the O(n). Usually you’d expect the big O analysis to be “wrong” for small datasets.
I expect that in this case, like in all cases, as the datasets become gallactically large, the O(n) algorithm will start winning again.
The hash-based algorithm is only O(n) because the entry size has a limit. In a more general case, it would be something more like O(m(n * e)). Here n is the number of entries, e is the maximum entry size and m is a function describing how caching and other details affect the computation. With small enough data, the hash is very fast due to CPU caches, even if it takes more steps, as the time taken by a step is smaller. The article explains this topic in a less handwavey manner.
Also, memory access is constant time only to some upper limit allowed by the hardware, which requires significant changes to the implementation when the data does not fit the system memory. So, the hash algorithm will not stay O(n) once you go past the available memory.
The sorting algorithms do not suffer from these complexities quite as much, and similar approaches can be used with data sets that do not fit a single system's memory. The sorting-based algorithms will likely win in the galactically large cases.
Edit: Also, once the hash table would need to grow beyond what the hash function can describe (e.g. beyond 64 bit integers), you need to grow the function's data type. This is essentially a hidden log(n) factor, as the required length of the data type is log(n) of the maximum data size.
Interestingly you need a hash function big enough to be unique for all data points with high probability, it doesn't take much to point out that this is at least O(log(n)) if all items are unique.
Also if items take up k bytes then the hash must typically be O(k), and both the hashing and radix sort are O(n k).
Really radix sort should be considered O(N) where N is the total amount of data in bytes. It can beat the theoretical limit because it sorts lexicographically, which is not always an option.
Radix sort isn't a comparison-based sort, so it isn't beholden to the O(n log n) speed limit in the same way. It's basically O(n log k), where k is the number of possible different values in the dataset. If we're using machine data types (TFA is discussing 64-bit integers) then k is a constant factor and drops out of the analysis. Comparison-based sorts assume, basically, that every element in the input could in principle be distinct.
Basically, the hashed sorting approach is effectively actually O(n), and is so for the same reason that the "length of hash table" approach is. The degenerate case is counting sort, which is a single pass with a radix base of k. (Which is analogous to pointing out that you don't get hash collisions when the hash table is as big as the hashed key space.)
One detail most comments seem to be missing is that the O(1) complexity of get/set in hash tables depends on memory access being O(1). However, if you have a memory system operating in physical space, that's just not possible (you'd have to break the speed of light). Ultimately, the larger your dataset, the more time it is going to take (on average) to perform random access on it. The only reason why we "haven't noticed" this yet that much in practice is that we mostly grow memory capacity by making it more compact (the same as CPU logic), not by adding more physical chips/RAM slots/etc. Still, memory latency has been slowly rising since the 2000s, so even shrinking can't save us indefinitely.
One more fun fact: this is also the reason why Turing machines are a popular complexity model. The tape on a Turing machine does not allow random access, so it simulates the act of "going somewhere to get your data". And as you might expect, hash table operations are not O(1) on a Turing machine.
The analysis in the "Why does sorting win?" section of this article gave us a way to estimate that threshold.
Here's my attempt at it, based on that analysis:
Suppose each item key requires s bytes
For the hash table, assuming s <= 64, our cache line size, then we need to read one cache line and write one cache line.
The bandwidth to sort one key is p(N) * 2 * s where p(N) is the number of passes of 1024-bucket radix sort required to sort N elements, and 2 comes from 1 read + 1 write per 1024-bucket radix sort pass
p(N) = ceil(log2(N) / log2(buckets))
Suppose N is the max number of items we can distinguish with an s=8 byte item key, so N = 2^64
then p(N) = ceil(64 / 10) = 7 passes
7 radix passes * (1 read + 1 write) * 8 bytes per item key = 112 bytes of bandwidth consumed, this is still less than the bandwidth of the hash table 2 * 64.
We haven't found the threshold yet.
We need to either increase the item count N or increase the item key size s beyond 8 bytes to find a threshold where this analysis estimates that the hash map uses less bandwidth. But we cant increase the item count N without first increasing the key size s. Suppose we just increase the key size and leave N as-is.
Assuming N=2^64 items and an item size of b=10 bytes gives an estimate of 140 bytes of bandwidth for sorting vs 128 bytes of bandwidth. we expect sorting to be slower for these values, and increasing b or N further should make it even worse.
(the bandwidth required for hash map hasnt increased as our 10 byte b is still less than the size of a cache line)
edit: above is wrong as i'm getting mixed up with elements vs items. elements are not required to be unique items. if elements are unique items then the problem is trivial. So N is not bounded by the key size, and we can increase N beyond 2^64 without increasing the key size s beyond 8 bytes.
keeping key size s fixed at 8 bytes per item suggests the threshold is at N=2^80 items
given by solving for N for the threshold where bandwidth estimate for sort = bandwidth estimate for hash table
the big O copmlexity makes assumptions that break down in this case. E.g. it "ignores" memory access cost, which seems to be a key factor here.
[edit] I should have said "basic big O complexity" makes assumptions that break down. You can ofc decide to model memory access as part of "big O" which is a mathematical model
We expect asymptotically better algorithms to be, well, asymptotically better. Even if they lose out for small N, they should win for larger N - and we typically expect this "larger N" to not be that large - just big enough so that data doesn't fit in any cache or something like that.
However, in this particular case, the higher-complexity sorting algorithms gets better than the hash algorithm as N gets larger, even up to pretty large values of N. This is the counter-intuitive part. No one is surprised if an O(n log n) algorithm beats an O(n) algorithm for n = 10. But if the O(n log n) algorithm wins for n = 10 000 000, that's quite surprising.
> Hash tables win the interview O(n) vs O(n log n),
If the original table has n entries, the hash must have at least log(n) bits to get most of the time a different hash. I don't know the current state of the art, but I think the recommendation was to have at most a 95% filled array, so the are not too many collision. So both methods are O(n log n), one under the hood and the other explicitly.
I guess you have to distinguish between # of elements vs the max bit-width of a an element. But if you're leaving the constant-memory access model (where you can assume each element fits in memory and all arithmetic on them is constant time), then doesn't it take O(n log k) (k is the numeric value of the maximum element, n is total number of elements) to simplify write out the array? You physically can't do O(n) [ignoring the k]
A way to give radix sort a performance boost is to allocate uninitialized blocks of memory for each bucket that are of size n. It exploits the MMU and you don’t have to worry about managing anything related to buckets potentially overwriting. The MMU is doing all the bookkeeping for you and the OS actually allocates memory only on page access.
Very interesting and cool article, if you love low-level optimisation (like myself!)
Interestingly, recently I've been thinking that basically the Big-O notation is essentially a scam, in particular the log(N) part.
For small values of N, log(N) is essentially a constant, <= 32, so we can just disregard it, making sorting simply O(N).
For large values, even so-called linear algorithms (e.g. linear search) are actually O(N log(N)), as the storage requirements for a single element grow with log(N) (i.e. to store distinct N=2^32 elements, you need N log(N) = 2^32 * 32 bits, but to store N=2^64 elements, you need 2^64 * 64 bits).
Cache locality consideration make this effect even more pronounced.
> For small values of N, log(N) is essentially a constant, <= 32, so we can just disregard it, making sorting simply O(N).
For large values, even so-called linear algorithms (e.g. linear search) are actually O(N log(N)), as the storage requirements for a single element grow with log(N) (i.e. to store distinct N=2^32 elements, you need N log(N) = 2^32 * 32 bits, but to store N=2^64 elements, you need 2^64 * 64 bits).
If the person who taught you big-O notation didn't point out that for all practical values of N, log(N) is tiny, then they did you a disservice. It is indeed "practically constant" in that sense.
However, when someone says an operation is O(1) vs O(log N), it still tells you something important. Very broadly speaking (tons of caveats depending on problem domain, of course) O(log N) usually implies some kind of tree traversal, while O(1) implies a very simple operation or lookup. And with tree traversal, you're chasing pointers all over memory, making your cache hate you.
So, like, if you have a binary tree with 65000 elements in it, we're talking a height of 15 or 16, something like that. That's not that much, but it is 15 or 16 pointers you're chasing, possibly cache-missing on a significant amount of them. Versus a hash-table lookup, where you do a single hash + one or two pointer dereferences. If this is in a hot path, you're going to notice a difference.
Again, lots of caveats, this article provides a good exception. In this case, the sorting has much more beneficial cache behavior than the hash table, which makes sense. But in general, log(N) hints at some kind of tree, and that's not always what you want.
But yes, don't be afraid of log(N). log(N) is tiny, and log(N) operations are very fast. log(N) is your friend.
I've given the following exercise to developers in a few workplaces:
What's the complexity of computing the nth fibonacci number? Make a graph of computation time with n=1..300 that visualizes your answer.
There are those that very quickly reply linear but admit they can't get a graph to corroborate, and there are those that very quickly say linear and even produce the graph! (though not correct fibonacci numbers...)
Binet's (or Euler's or de Moivre's) formula for the nth Fibonacci number is (((sqrt(5)+1)/2)^n-(-(sqrt(5)+1)/2)^-n)/sqrt(5). Additions are O(n). Multiplications vary depending on the algorithm, the best known (Harvey-Hoeven) is O(n log(n)) but comes with some nasty constants that probably mean classic Karatsuba or Toom-Cook multiplication (O(n^1.X) for some X will be faster for the range to be graphed, but I'll stick with O(n log(n)) for this. The exponentials will be O(n log(n)^2) in the best known case IIRC. That factor dominates, so I'd say it's probably O(n log(n)^2) but that a graph probably won't start cooperating quickly due to the algorithms picked having factors that are ignored by big-O notation for the small inputs given.
However you do it it probably can't be linear, since multiplication is probably at best O(n log(n)), though that lower bound hasn't been proven. A naive recursive calculation will be even worse since that has exponential complexity.
> i.e. to store distinct N=2^32 elements, you need N log(N) = 2^32 * 32 bits, but to store N=2^64 elements, you need 2^64 * 64 bits
N is the number of bits, not the number of elements, so no.
It is helpful to use N=# of elements, since the elements are often fixed/limited size. If elements aren't a fixed size, it's necessary to drop down to # of bits.
You can use a simple hash table without probing to classify each number as either (1) first instance of a number in the table, (2) known duplicate of a number in the table, or (3) didn't fit in the table.
If you use a hash table with n buckets and one item per bucket and the input is random, then at least 63% of the distinct numbers in the input will fall into cases 1 or 2 in expectation. Increasing table size or bucket size improves this part of the math.
On really big inputs, it's probably still healthy to do a pass or two of radix sort so that the hash table accesses are cheap.
Interesting article, I particularly like the reference to real-world workloads.
Do I understand correctly, that the data being tested is fully random? If so I'd be curious to see how the results change if a Zipfian distribution is used. I don't have hard data for sorting specifically, but I suspect the majority of real-world data being sorted - that isn't low cardinality or already nearly or fully sorted - to follow a Zipfian distribution rather than true randomness. The Rust std lib sort_unstable efficiently filters out common values using the pdqsort repeat pivot flip partition algorithm.
It's nice that we can write relatively performant code with the batteries included in these languages and no hand tuning. I only wished it was as easy as that to write code that is secure by default.
Do-What-I-Mean isn't possible. What Rust does give you is Do-What-I-Say which mostly leaves the problem of saying what you mean, a significant task but one that is hopefully what software engineering was training you to be most effective at.
One important trick is ruling out cases where what you said is nonsense. That can't be what you meant so by ruling it out we're helping you fall into the pit of success and Rust does an admirable job of that part.
Unfortunately because of Rice we must also rule out some cases where what you said wasn't nonsense when we do this. Hopefully not too many. It's pretty annoying when this happens, however, it's also logically impossible to always be sure, Rice again - maybe what you wrote was nonsense after all. So I find that acceptable.
All programming languages do what you say, the question is more about how easy it is to say something inappropriate.
For me Rust rarely hits the sweet spot since there are easier languages for high level programs (python, C#, Swift...) and for low level programs you usually want to do unsafe stuff anyway.
I can see usefull applications only where you cannot have a garbage collector AND need a safe high level language. Even there Swift's ARC could be used.
Or the "why we can't have nice things" theorem. Colloquially, anything interesting about a Turing-complete program is unprovable. You may have proved specific instance of this if you went through a formal computer science program and reduced problems like "this program never uses more than X cells on the tape" to the halting problem.
Could you reduce the BW requirement of the hash table by using an atomic increment instruction instead of read-modify-write ? It still performs a read modify write, but further down in the hierarchy, where more parallelism is available.
The one time in your career you run into it, the next day your boss will add the requirement that entries are going to be inserted and deleted all the time, and your sorting approach is fucked.
If the entries can change all the time, we can use two hash tables, U and D. U maintains the set of unique items at all times, D maintains duplicates. An item is never in both at the same time. In D, it is associated with a count that is at least 2.
A new item is inserted into U. The first duplicate insertion removes the item from U and adds it into D with a count of 2. Subsequent insertions increment the count for that item in D.
A deletion first tries to decrement a count in D, and when that hits below 2, the item is removed from D. If it's not in D, it is removed from U.
At any time we can walk through U to visit all the unique items, without having to deal with spaces of non-unique item.
That has implications for complexity. For instance suppose that for whatever reason, the number of unique items is bounded to sqrt(N). Then iterating over them is O(sqrt(N)), whereas if we just had one hash table, it would be O(N) to iterate over all items and skip the non-unique ones.
I imagine it must make a bigger difference in languages where allocations are more costly too. E.g. in Go sorting to count uniques is most certainly much faster than using a map + it saves memory too
It's not language agnostic in the sense that some languages don't have the necessary functionality to allocate everything up front and avoid allocations in the hot loop.
For example, if you were to benchmark this in Java, the HashMap<> class allocates (twice!) on every insertion. Allocations are a bit cheaper with the GC than they would be via malloc/friends, but still we'd expect to see significant allocator and hence GC overhead in this benchmark
If we used C++ the standard library's hash set type std::unordered_set doesn't have the all-in-one reserve mechanic. We can call reserve, but that just makes enough hashtable space, the linked list items it will use to store values are allocated separately each time one is created.
I mean, that type is also awful, so early in tuning for this application you'd pick a different map type, but it is the standard in their language.
Note that "fixes bad distributions" is the same thing as "causes bad distributions" if your input is untrusted (or even if you're just unlucky). Especially when you are deliberately choosing a bijective function and it is trivial for an attacker to generate "unfortunate" hashes.
The usual trie tricks can avoid this problem without letting the worst case happen. But as often, adding the extra logic can mean worse performance for non-worst-case input.
While the main conclusion of the article isn't unexpected to me I am very surprised a well optimized quicksort isn't faster than radix sort.
My experience in C is that you can significantly speed-up quick sort by removing function calls and doing insertion sorts on small arrays at the end. I was never able to get radix sort even close to that performance wise.
I don't know Rust at all so it would be pointless for me to try to code an optimized quick sort in it. Hopefully the author knows how to optimize quick sort and is not missing on huge gains there.
Just in case someone feels like porting it to Rust, here is a link to a good and simple quick sort implementation that significantly outperforms standard library one:
https://www.ucw.cz/libucw/doc/ucw/sort.html
> it would be pointless for me to try to code an optimized quick sort in it.
Perhaps more so than you've realised. Hint: Rust's old unstable sort was its take on the pattern defeating quicksort. The article is talking about the new unstable sort which has better performance for most inputs.
Rust uses monomorphization and its function types are unique, so the trick you're used to with a macro is just how everything works all the time in Rust.
The trick is not about macros but about removing function calls altogether.
Macros are there only to generate functions for specific types.
The way the trick works is to implement the stack manually and put indexes there instead of passing them as function arguments. This removes a lot of overhead and makes the implementation I linked to significantly faster than C's and C++'s standard library implementation.
Though big enough to be worthwhile at scale, it's notable how small, relatively, the difference is from the out-of-box Rust unstable sort to the tuned radix sort.
Lukas Bergdoll and Orson Peters did some really important heavy lifting to get this stuff from "If you're an expert you could probably tune a general purpose sort to have these nice properties" to "Rust's general purpose sort has these properties out of the box" and that's going to make a real difference to a lot of software that would never get hand tuned.
I (Orson Peters) am also here if anyone has any questions :)
Orson, what motivated you to tune Rust's out-of-the-box sort in this way?
Well, years back I released an unstable sort called pdqsort in C++. Then stjepang ported it to the Rust standard library. So at first... nothing. Someone else did it.
A couple years later I was doing my PhD and I spent a lot of time optimizing a stable sort called glidesort. Around the same time Lukas Bergdoll started work on their own and started providing candidate PRs to improve the standard library sort. I reached out to him and we agreed to collaborate instead of compete, and it ended up working out nicely I'd say.
Ultimately I like tinkering with things and making them fast. I actually really like reinventing the wheel, find out why it has the shape that it does, and see if there's anything left to improve.
But it feels a bit sad to do all that work only for it to disappear into the void. It makes me the happiest if people actually use the things I build, and there's no broader path to getting things in people's hands than if it powers the standard library.
̶F̶o̶r̶ ̶t̶h̶e̶ ̶s̶p̶a̶r̶s̶e̶-̶m̶a̶t̶r̶i̶x̶-̶m̶u̶l̶t̶i̶p̶l̶i̶c̶a̶t̶i̶o̶n̶ ̶e̶x̶a̶m̶p̶l̶e̶ ̶g̶i̶v̶e̶n̶,̶ ̶w̶o̶u̶l̶d̶n̶'̶t̶ ̶i̶t̶ ̶b̶e̶ ̶b̶e̶t̶t̶e̶r̶ ̶t̶o̶ ̶p̶r̶e̶-̶s̶o̶r̶t̶ ̶e̶a̶c̶h̶ ̶v̶e̶c̶t̶o̶r̶ ̶a̶n̶d̶ ̶d̶o̶ ̶n̶^̶2̶ ̶w̶a̶l̶k̶s̶ ̶o̶n̶ ̶p̶a̶i̶r̶s̶ ̶o̶f̶ ̶p̶r̶e̶s̶o̶r̶t̶e̶d̶ ̶v̶e̶c̶t̶o̶r̶s̶,̶ ̶r̶a̶t̶h̶e̶r̶ ̶t̶h̶a̶n̶ ̶(̶a̶s̶ ̶I̶ ̶t̶h̶i̶n̶k̶ ̶w̶a̶s̶ ̶i̶m̶p̶l̶i̶e̶d̶)̶ ̶d̶o̶i̶n̶g̶ ̶n̶^̶2̶ ̶h̶a̶s̶h̶e̶d̶ ̶s̶o̶r̶t̶i̶n̶g̶ ̶o̶p̶e̶r̶a̶t̶i̶o̶n̶s̶?̶ ̶(̶F̶o̶r̶ ̶t̶h̶a̶t̶ ̶m̶a̶t̶t̶e̶r̶,̶ ̶I̶ ̶w̶o̶n̶d̶e̶r̶ ̶w̶h̶y̶ ̶s̶p̶a̶r̶s̶e̶ ̶m̶a̶t̶r̶i̶c̶e̶s̶ ̶w̶o̶u̶l̶d̶n̶'̶t̶ ̶a̶l̶r̶e̶a̶d̶y̶ ̶b̶e̶ ̶r̶e̶p̶r̶e̶s̶e̶n̶t̶e̶d̶ ̶i̶n̶ ̶s̶o̶r̶t̶e̶d̶-̶a̶d̶j̶a̶c̶e̶n̶c̶y̶-̶l̶i̶s̶t̶ ̶f̶o̶r̶m̶ ̶i̶n̶ ̶t̶h̶e̶ ̶f̶i̶r̶s̶t̶ ̶p̶l̶a̶c̶e̶)̶ ̶
EDIT: ah no I'm being dense, you'd aggregate the union of all the set-columns indices across rows and the union of the set-row indices across the columns, keeping track of the source locations, and do the hashed sorting on those union vectors to find all the collision points. You could still get a small win I think by sorting the row-aggregation and column-aggregation separately though?
Interesting article. It’s actually very strange that the dataset needs to be “big” for the O(n log n) algorithm to beat the O(n). Usually you’d expect the big O analysis to be “wrong” for small datasets.
I expect that in this case, like in all cases, as the datasets become gallactically large, the O(n) algorithm will start winning again.
The hash-based algorithm is only O(n) because the entry size has a limit. In a more general case, it would be something more like O(m(n * e)). Here n is the number of entries, e is the maximum entry size and m is a function describing how caching and other details affect the computation. With small enough data, the hash is very fast due to CPU caches, even if it takes more steps, as the time taken by a step is smaller. The article explains this topic in a less handwavey manner.
Also, memory access is constant time only to some upper limit allowed by the hardware, which requires significant changes to the implementation when the data does not fit the system memory. So, the hash algorithm will not stay O(n) once you go past the available memory.
The sorting algorithms do not suffer from these complexities quite as much, and similar approaches can be used with data sets that do not fit a single system's memory. The sorting-based algorithms will likely win in the galactically large cases.
Edit: Also, once the hash table would need to grow beyond what the hash function can describe (e.g. beyond 64 bit integers), you need to grow the function's data type. This is essentially a hidden log(n) factor, as the required length of the data type is log(n) of the maximum data size.
Interestingly you need a hash function big enough to be unique for all data points with high probability, it doesn't take much to point out that this is at least O(log(n)) if all items are unique.
Also if items take up k bytes then the hash must typically be O(k), and both the hashing and radix sort are O(n k).
Really radix sort should be considered O(N) where N is the total amount of data in bytes. It can beat the theoretical limit because it sorts lexicographically, which is not always an option.
Radix sort isn't a comparison-based sort, so it isn't beholden to the O(n log n) speed limit in the same way. It's basically O(n log k), where k is the number of possible different values in the dataset. If we're using machine data types (TFA is discussing 64-bit integers) then k is a constant factor and drops out of the analysis. Comparison-based sorts assume, basically, that every element in the input could in principle be distinct.
Basically, the hashed sorting approach is effectively actually O(n), and is so for the same reason that the "length of hash table" approach is. The degenerate case is counting sort, which is a single pass with a radix base of k. (Which is analogous to pointing out that you don't get hash collisions when the hash table is as big as the hashed key space.)
One detail most comments seem to be missing is that the O(1) complexity of get/set in hash tables depends on memory access being O(1). However, if you have a memory system operating in physical space, that's just not possible (you'd have to break the speed of light). Ultimately, the larger your dataset, the more time it is going to take (on average) to perform random access on it. The only reason why we "haven't noticed" this yet that much in practice is that we mostly grow memory capacity by making it more compact (the same as CPU logic), not by adding more physical chips/RAM slots/etc. Still, memory latency has been slowly rising since the 2000s, so even shrinking can't save us indefinitely.
One more fun fact: this is also the reason why Turing machines are a popular complexity model. The tape on a Turing machine does not allow random access, so it simulates the act of "going somewhere to get your data". And as you might expect, hash table operations are not O(1) on a Turing machine.
The key here is in the cache lines.
Sorting (really scanning through an array) is very efficient in cache lines, while hash tables are very inefficient.
In the example, every memory access in hashing gets 8 byte of useful data, while every memory access in sorting gets 64 bytes of useful data.
So you have to be 8x more efficient to win out.
For this problem, the radix sort is log_1024(n) passes, which for a key size of 64 bits can’t ever get above 7 passes.
If you increase the key size then the efficiency win of sorting drops (128 bit keys means you have to run less than 4 passes to win).
Essentially the size of the key compared to the cache line size is a (hidden) part of the big O.
No, because radix sort is not a comparison-based sort and is not O(n log n).
That is what baffles me. The difference in big O complexity should be more visible with size, but thats where it looses to the "worse" algorithm.
I could imagine the hash table wins again beyond a even greater threshold. Like what about 120GB and beyond?
As others have pointed out, radix sort is O(64N) = O(N) for a fixed key length of uint64s as in this article
So it comes down to the cost of hashing, hash misses, and other factors as discussed in the article.
I remember learning that radix sort is (almost?) always fastest if you're sorting integers.
A related fun question is to analyze hash table performance for strings where you have no bound on string length.
Sorting wins for unbounded strings because it only needs a prefix of each string.
The analysis in the "Why does sorting win?" section of this article gave us a way to estimate that threshold.
Here's my attempt at it, based on that analysis:
Suppose each item key requires s bytes
For the hash table, assuming s <= 64, our cache line size, then we need to read one cache line and write one cache line.
The bandwidth to sort one key is p(N) * 2 * s where p(N) is the number of passes of 1024-bucket radix sort required to sort N elements, and 2 comes from 1 read + 1 write per 1024-bucket radix sort pass
p(N) = ceil(log2(N) / log2(buckets))
Suppose N is the max number of items we can distinguish with an s=8 byte item key, so N = 2^64
then p(N) = ceil(64 / 10) = 7 passes
7 radix passes * (1 read + 1 write) * 8 bytes per item key = 112 bytes of bandwidth consumed, this is still less than the bandwidth of the hash table 2 * 64.
We haven't found the threshold yet.
We need to either increase the item count N or increase the item key size s beyond 8 bytes to find a threshold where this analysis estimates that the hash map uses less bandwidth. But we cant increase the item count N without first increasing the key size s. Suppose we just increase the key size and leave N as-is.
Assuming N=2^64 items and an item size of b=10 bytes gives an estimate of 140 bytes of bandwidth for sorting vs 128 bytes of bandwidth. we expect sorting to be slower for these values, and increasing b or N further should make it even worse.
(the bandwidth required for hash map hasnt increased as our 10 byte b is still less than the size of a cache line)
edit: above is wrong as i'm getting mixed up with elements vs items. elements are not required to be unique items. if elements are unique items then the problem is trivial. So N is not bounded by the key size, and we can increase N beyond 2^64 without increasing the key size s beyond 8 bytes.
keeping key size s fixed at 8 bytes per item suggests the threshold is at N=2^80 items
given by solving for N for the threshold where bandwidth estimate for sort = bandwidth estimate for hash table
2 * 8 * (log2(N) / log2(1024)) = 2 * 64
the big O copmlexity makes assumptions that break down in this case. E.g. it "ignores" memory access cost, which seems to be a key factor here.
[edit] I should have said "basic big O complexity" makes assumptions that break down. You can ofc decide to model memory access as part of "big O" which is a mathematical model
Radix sort is not a comparison-based sort and is not O(n log n).
why is it strange? the case where asymptotically better = concretely worse is usually the exception, not the norm.
We expect asymptotically better algorithms to be, well, asymptotically better. Even if they lose out for small N, they should win for larger N - and we typically expect this "larger N" to not be that large - just big enough so that data doesn't fit in any cache or something like that.
However, in this particular case, the higher-complexity sorting algorithms gets better than the hash algorithm as N gets larger, even up to pretty large values of N. This is the counter-intuitive part. No one is surprised if an O(n log n) algorithm beats an O(n) algorithm for n = 10. But if the O(n log n) algorithm wins for n = 10 000 000, that's quite surprising.
As others have likely pointed out by now, radix sort is not O(n log n).
It's O(k * n), where k is constant with respect to the key size.
For 64-bit keys with 1024 buckets, k==8, so it's O(8 * n) = O(n)
> Hash tables win the interview O(n) vs O(n log n),
If the original table has n entries, the hash must have at least log(n) bits to get most of the time a different hash. I don't know the current state of the art, but I think the recommendation was to have at most a 95% filled array, so the are not too many collision. So both methods are O(n log n), one under the hood and the other explicitly.
I guess you have to distinguish between # of elements vs the max bit-width of a an element. But if you're leaving the constant-memory access model (where you can assume each element fits in memory and all arithmetic on them is constant time), then doesn't it take O(n log k) (k is the numeric value of the maximum element, n is total number of elements) to simplify write out the array? You physically can't do O(n) [ignoring the k]
Maybe relevant comments [1, 2, 3]
[1] https://news.ycombinator.com/item?id=44854520 [2] https://news.ycombinator.com/item?id=43005468 [3] https://news.ycombinator.com/item?id=45214106
N bits doesn’t mean n operations.
Probably N/64 depending, perhaps less if you can use the SIMD but I'm not sure it's possible.
A way to give radix sort a performance boost is to allocate uninitialized blocks of memory for each bucket that are of size n. It exploits the MMU and you don’t have to worry about managing anything related to buckets potentially overwriting. The MMU is doing all the bookkeeping for you and the OS actually allocates memory only on page access.
Very interesting and cool article, if you love low-level optimisation (like myself!)
Interestingly, recently I've been thinking that basically the Big-O notation is essentially a scam, in particular the log(N) part.
For small values of N, log(N) is essentially a constant, <= 32, so we can just disregard it, making sorting simply O(N).
For large values, even so-called linear algorithms (e.g. linear search) are actually O(N log(N)), as the storage requirements for a single element grow with log(N) (i.e. to store distinct N=2^32 elements, you need N log(N) = 2^32 * 32 bits, but to store N=2^64 elements, you need 2^64 * 64 bits).
Cache locality consideration make this effect even more pronounced.
> For small values of N, log(N) is essentially a constant, <= 32, so we can just disregard it, making sorting simply O(N). For large values, even so-called linear algorithms (e.g. linear search) are actually O(N log(N)), as the storage requirements for a single element grow with log(N) (i.e. to store distinct N=2^32 elements, you need N log(N) = 2^32 * 32 bits, but to store N=2^64 elements, you need 2^64 * 64 bits).
How can I unread this?
If the person who taught you big-O notation didn't point out that for all practical values of N, log(N) is tiny, then they did you a disservice. It is indeed "practically constant" in that sense.
However, when someone says an operation is O(1) vs O(log N), it still tells you something important. Very broadly speaking (tons of caveats depending on problem domain, of course) O(log N) usually implies some kind of tree traversal, while O(1) implies a very simple operation or lookup. And with tree traversal, you're chasing pointers all over memory, making your cache hate you.
So, like, if you have a binary tree with 65000 elements in it, we're talking a height of 15 or 16, something like that. That's not that much, but it is 15 or 16 pointers you're chasing, possibly cache-missing on a significant amount of them. Versus a hash-table lookup, where you do a single hash + one or two pointer dereferences. If this is in a hot path, you're going to notice a difference.
Again, lots of caveats, this article provides a good exception. In this case, the sorting has much more beneficial cache behavior than the hash table, which makes sense. But in general, log(N) hints at some kind of tree, and that's not always what you want.
But yes, don't be afraid of log(N). log(N) is tiny, and log(N) operations are very fast. log(N) is your friend.
I've given the following exercise to developers in a few workplaces:
What's the complexity of computing the nth fibonacci number? Make a graph of computation time with n=1..300 that visualizes your answer.
There are those that very quickly reply linear but admit they can't get a graph to corroborate, and there are those that very quickly say linear and even produce the graph! (though not correct fibonacci numbers...)
Binet's (or Euler's or de Moivre's) formula for the nth Fibonacci number is (((sqrt(5)+1)/2)^n-(-(sqrt(5)+1)/2)^-n)/sqrt(5). Additions are O(n). Multiplications vary depending on the algorithm, the best known (Harvey-Hoeven) is O(n log(n)) but comes with some nasty constants that probably mean classic Karatsuba or Toom-Cook multiplication (O(n^1.X) for some X will be faster for the range to be graphed, but I'll stick with O(n log(n)) for this. The exponentials will be O(n log(n)^2) in the best known case IIRC. That factor dominates, so I'd say it's probably O(n log(n)^2) but that a graph probably won't start cooperating quickly due to the algorithms picked having factors that are ignored by big-O notation for the small inputs given.
However you do it it probably can't be linear, since multiplication is probably at best O(n log(n)), though that lower bound hasn't been proven. A naive recursive calculation will be even worse since that has exponential complexity.
> i.e. to store distinct N=2^32 elements, you need N log(N) = 2^32 * 32 bits, but to store N=2^64 elements, you need 2^64 * 64 bits
N is the number of bits, not the number of elements, so no.
It is helpful to use N=# of elements, since the elements are often fixed/limited size. If elements aren't a fixed size, it's necessary to drop down to # of bits.
Seems like the author is missing a trick?
You can use a simple hash table without probing to classify each number as either (1) first instance of a number in the table, (2) known duplicate of a number in the table, or (3) didn't fit in the table.
If you use a hash table with n buckets and one item per bucket and the input is random, then at least 63% of the distinct numbers in the input will fall into cases 1 or 2 in expectation. Increasing table size or bucket size improves this part of the math.
On really big inputs, it's probably still healthy to do a pass or two of radix sort so that the hash table accesses are cheap.
Interesting article, I particularly like the reference to real-world workloads.
Do I understand correctly, that the data being tested is fully random? If so I'd be curious to see how the results change if a Zipfian distribution is used. I don't have hard data for sorting specifically, but I suspect the majority of real-world data being sorted - that isn't low cardinality or already nearly or fully sorted - to follow a Zipfian distribution rather than true randomness. The Rust std lib sort_unstable efficiently filters out common values using the pdqsort repeat pivot flip partition algorithm.
It's nice that we can write relatively performant code with the batteries included in these languages and no hand tuning. I only wished it was as easy as that to write code that is secure by default.
I would be happy with code that does what I intended it to do by default.
Do-What-I-Mean isn't possible. What Rust does give you is Do-What-I-Say which mostly leaves the problem of saying what you mean, a significant task but one that is hopefully what software engineering was training you to be most effective at.
One important trick is ruling out cases where what you said is nonsense. That can't be what you meant so by ruling it out we're helping you fall into the pit of success and Rust does an admirable job of that part.
Unfortunately because of Rice we must also rule out some cases where what you said wasn't nonsense when we do this. Hopefully not too many. It's pretty annoying when this happens, however, it's also logically impossible to always be sure, Rice again - maybe what you wrote was nonsense after all. So I find that acceptable.
All programming languages do what you say, the question is more about how easy it is to say something inappropriate.
For me Rust rarely hits the sweet spot since there are easier languages for high level programs (python, C#, Swift...) and for low level programs you usually want to do unsafe stuff anyway.
I can see usefull applications only where you cannot have a garbage collector AND need a safe high level language. Even there Swift's ARC could be used.
What is "Rice" referring to?
Rice's Theorem: https://en.wikipedia.org/wiki/Rice%27s_theorem
Or the "why we can't have nice things" theorem. Colloquially, anything interesting about a Turing-complete program is unprovable. You may have proved specific instance of this if you went through a formal computer science program and reduced problems like "this program never uses more than X cells on the tape" to the halting problem.
Great article. Informative and well written.
> First, extremely sparse unstructured matrix multiplication with e.g. sparsity of one in a billion and one scalar per nonzero.
Why does this have an equivalent inner loop, and why is it an important task?
Could you reduce the BW requirement of the hash table by using an atomic increment instruction instead of read-modify-write ? It still performs a read modify write, but further down in the hierarchy, where more parallelism is available.
This sounds like a spherical cow problem.
The one time in your career you run into it, the next day your boss will add the requirement that entries are going to be inserted and deleted all the time, and your sorting approach is fucked.
If the entries can change all the time, we can use two hash tables, U and D. U maintains the set of unique items at all times, D maintains duplicates. An item is never in both at the same time. In D, it is associated with a count that is at least 2.
A new item is inserted into U. The first duplicate insertion removes the item from U and adds it into D with a count of 2. Subsequent insertions increment the count for that item in D.
A deletion first tries to decrement a count in D, and when that hits below 2, the item is removed from D. If it's not in D, it is removed from U.
At any time we can walk through U to visit all the unique items, without having to deal with spaces of non-unique item.
That has implications for complexity. For instance suppose that for whatever reason, the number of unique items is bounded to sqrt(N). Then iterating over them is O(sqrt(N)), whereas if we just had one hash table, it would be O(N) to iterate over all items and skip the non-unique ones.
I imagine it must make a bigger difference in languages where allocations are more costly too. E.g. in Go sorting to count uniques is most certainly much faster than using a map + it saves memory too
This doesnt concerns in-language allocations at all. There is only one large allocation for both, done up-front.
The slowdowm has to do with cashe misses and cpu cashe capacity, which is optimisations the cpu does when executing.
Granted, a language like go may have more of the cpu cashe used up by different runtime checks.
Basically, i think this analysis largely language agnostic.
It's not language agnostic in the sense that some languages don't have the necessary functionality to allocate everything up front and avoid allocations in the hot loop.
For example, if you were to benchmark this in Java, the HashMap<> class allocates (twice!) on every insertion. Allocations are a bit cheaper with the GC than they would be via malloc/friends, but still we'd expect to see significant allocator and hence GC overhead in this benchmark
If we used C++ the standard library's hash set type std::unordered_set doesn't have the all-in-one reserve mechanic. We can call reserve, but that just makes enough hashtable space, the linked list items it will use to store values are allocated separately each time one is created.
I mean, that type is also awful, so early in tuning for this application you'd pick a different map type, but it is the standard in their language.
The author discusses finding “unique” values, but I think they meant “distinct”.
Note that "fixes bad distributions" is the same thing as "causes bad distributions" if your input is untrusted (or even if you're just unlucky). Especially when you are deliberately choosing a bijective function and it is trivial for an attacker to generate "unfortunate" hashes.
The usual trie tricks can avoid this problem without letting the worst case happen. But as often, adding the extra logic can mean worse performance for non-worst-case input.
You could also use a random salt (for i64, add it up with wrap around overflow) with minimal overhead.
Isn't radix sorting somewhat like hashing?
Not at all, but it can be used to sort by hashes rather than by "natural" data.
While the main conclusion of the article isn't unexpected to me I am very surprised a well optimized quicksort isn't faster than radix sort. My experience in C is that you can significantly speed-up quick sort by removing function calls and doing insertion sorts on small arrays at the end. I was never able to get radix sort even close to that performance wise. I don't know Rust at all so it would be pointless for me to try to code an optimized quick sort in it. Hopefully the author knows how to optimize quick sort and is not missing on huge gains there.
Just in case someone feels like porting it to Rust, here is a link to a good and simple quick sort implementation that significantly outperforms standard library one: https://www.ucw.cz/libucw/doc/ucw/sort.html
> it would be pointless for me to try to code an optimized quick sort in it.
Perhaps more so than you've realised. Hint: Rust's old unstable sort was its take on the pattern defeating quicksort. The article is talking about the new unstable sort which has better performance for most inputs.
Rust uses monomorphization and its function types are unique, so the trick you're used to with a macro is just how everything works all the time in Rust.
The trick is not about macros but about removing function calls altogether. Macros are there only to generate functions for specific types. The way the trick works is to implement the stack manually and put indexes there instead of passing them as function arguments. This removes a lot of overhead and makes the implementation I linked to significantly faster than C's and C++'s standard library implementation.
[dead]