The probability of a hash collision (2022)

130 points | by subset 4 days ago

30 comments

spit2wind a day ago
The author uses writing techniques like those given by Joel Spolsky:
- Rule 1: Be Funny
- Rule 2: Writing a spec is like writing code for a brain to execute
- Rule 3: Write as simply as possible
- Rule 4: Review and reread several times
The author isn't quite as adept at integrating the humor as seemlessly as Joel, yet it's interesting to see how effective the style is, even for someone still learning it. I commend them for making the topic more accessible. It was probably fun to write and was definitely fun to read!
https://www.joelonsoftware.com/2000/10/15/painless-functiona...
[-]
- dcminter a day ago
  I too thought of it (in a good way) when "and an iron will" raised a chuckle from me.
  I second the recommendation and often nudge colleagues towards that article.
meindnoch 19 hours ago
In "Appendix A: Supporting proofs for approximations" the author justifies the approximation of 1-x with e^(-x) by computing lim_{x→0} (1-x)/e^(-x) = 1.
This is wrong. Consider how this criteria is also satisfied by 1-100000x, since lim_{x→0} (1-x)/(1-100000x) = 1. But this is clearly not a good first-order approximation for 1-x around 0.
The proper justification for replacing 1-x with e^-x around 0 is done by examining the first 2 terms of their Taylor expansions, in other words, the functions' value at 0 and their first derivative at 0. Since these match for 1-x and e^-x, they are good first-order approximations of each other around 0.
perihelions 20 hours ago
The quickest way to work around the numeric overflow issues is to use the Stirling approximation of the logarithm of the factorial,
https://en.wikipedia.org/wiki/Stirling's_approximation#Speed...
You can build on that to build good, composable approximations of any the standard combinatorial functions, in log-space (and recover the approximations you want by simply exponentiating back). For example, if you've implemented an approximate ln-fac(n) ~ ln(fac(n)), you immediately get ln-choose, the logarithm of (n!)/(k!(n-k)!), as simply ln-fac(n) - ln-fac(k) - ln-fac(n-k). Fully composable: if ln-fac() is a numerically good approximation, then is so any reasonable sum or difference.
Or: the log of the binomial distribution PDF is simply ln-choose(n,k) + k*ln(p) + (n-k)*ln(1-p).
[-]
- sudhirb 19 hours ago
  A more naive but less complex way to avoid overflows would be to use the original factorial expression but work with logs - in a rough experiment with python for k=10^8 and N=10^17 (numbers plucked out of thin air), the calculation takes ~20s on an M1 MacBook Pro.
  This didn't feel too egregious until I looked at the space of inputs and outputs for SHA-256, which is 2^256 out and an absolutely stonking 2^(2^64) - 1 possible inputs.
  It feels obvious but I'm still surprised by the efficiency of approximations.
j-pb a day ago
The fun part is when you take that approximation and apply it to 2^n.
If you have n bits, your collision probability is 0.5 at generating 2^(n/2) values.
Or put differently: a 128bit uuid gives you a "good enough" 64bit distributed autoincrement.
[-]
- bravesoul2 a day ago
  Good rule of thumb there
krackers 11 hours ago
There's several ways to arrive at the same bound besides the explicit birthday paradox calculation:
* nC2 pairs of objects, each of which has a 1/k chance of matching. Since expectation is linear and it's also known that E[x] = p(x>=1) + p(x>=2) + ..., we have that p(x>=1) = E[x] - epsilon (since we can assume probability of two collisions and above is small) and so probability of collision is ~n^2 / 2k.
* Alternatively you can use union bound to see that probability of at least one match is bounded-above by sum of probabilities of any single event, so is <= n^2 / 2k, with goodness of approximation given by fact that probability of multiple events occurring simultaneously is low.
(The two are effectively the same proof, as indicator random variables can be used to show P(union of indicators) <= E[X], which is effectively a special case of markov inequality)
senderista 11 hours ago
I think in practice the simplest approximation is probably always good enough. The reason is that I don’t think we care about collision probabilities larger than 50%, and the value of k that gives p=0.5 is sqrt(n) according to the simplest approximation (p=k^2/2n), while p=0.39 for k=sqrt(n) according to the exponential approximation (p=1-(e^-(k^2/2n)). The difference at the most extreme value we care about (p=0.5) is small enough (~11%) that I think the simplest approximation should always suffice for practical applications.
permalac 20 hours ago
Honest question.
How does one write something like this?
I get the interest, and the review process. What I mean is, is this a hobby where someone is passionate about soothing, or does some employers allow people to work on side projects?
I feel my life is mostly about direct value, and I don't really understand where I went wrong in the path for meaningful knowledge.
Any philosophical help will be very welcome, as you correctly guest I'm a bit lost.
[-]
- munificent 16 hours ago
  It sounds like you're too focused on outcome and not enough on experience.
  It is a miserable life to treat everything like a chore done to earn some know, expected, concrete reward.
  I suspect the author got curious, did some reading, realized they understood something, and thought it would be fun to write up the result. Likely all in their free time.
  [-]
  - deepsun 13 hours ago
    I remember R. Feynman wrote that at some point in his life he reached the end of his achievements, and it was a pretty sad time. For many years he couldn't produce anything valuable anymore. So over time he gave up trying and just kept on living, doing stuff just for fun, not for value. One day he saw someone juggles a kitchen plate throwing it into the air, spinning. He got interested, why does the plate "waves" exactly twice less than rotation speed. He started computing it, just for fun. Because he was already a failure, so who cares. Over time that pointless kitchen plate computations grew up to quantum calculations, for which he much later was awarded a Nobel.
- richardwhiuk 20 hours ago
  Almost certainly a hobby. Employer would probably want something like this on a employer blog so that they get the benefits.
- kittoes 15 hours ago
  Unless you're supremely lucky, this kind of stuff is a hobby. One wishes that weren't the case, but capitalism is what it is...
  I'd encourage you to generally ignore whether something has direct value or not because that's not how knowledge works. For example, I once spent well over a month implementing the NthWeekday function using nothing but basic arithmetic operations. This would allow us to calculate all federal holidays for a given year at runtime instead of precalculating the values and storing them in a table (which I hated, because it meant that someone had to maintain that table). This hyper-specific problem has near-zero direct value, but it was THE project that sparked my passion for maths.
Simon_O_Rourke 19 hours ago
I've had the privilege (?) of experiencing one in the wild only once. Was hashing a few columns to create a hash key for a dimensional table in snowflake and two wholy different rows clashed.
[-]
- mjuarez 14 hours ago
  By default, Snowflake doesn't seem to use any crytographic hash algorithms, so that's not completely unexpected. Based on (1) this link, it uses a proprietary 64-bit hash, and has this tidbit: "Do not use HASH to create unique keys."
  (1) https://docs.snowflake.com/en/sql-reference/functions/hash
rienbdj a day ago
How many values can a UUID v4 take?
How many do you have to generate before a collision becomes a remote possibility?
[-]
- NickHoff a day ago
  A UUID v4 is a 128 bit number, but 4 bits are reserved to specify the version number and 2 more bits are reserved to specify the variant, which leaves 122 bits of randomness. That means it can take on 5 x 10^36 possible values. Following the birthday math, you'd have to generate about 103 trillion UUIDs to have a one-in-a-million chance of having a collision.
- jandrewrogers 13 hours ago
  UUIDv4 has 2^122 values.
  The heuristic commonly used for things that matter (e.g. high-security/high-assurance systems) is that the probability of collision should be less than 2^-32 assuming uniform distribution[0]. From this you can compute that the largest set of keys that can be used with UUIDv4 that satisfies this constraint is roughly 100 trillion.
  This is a pretty high limit that will work for most applications. Some large data models can exceed this number of records, so you can't use probabilistic UUID naively in these cases e.g. one for every unique record. In data models that approach the 100T limit, UUID-like identifiers are typically generated deterministically to avoid this issue entirely.
  [0] There have been many cases of UUIDv4 systems breaking in the wild because people naively or accidentally use weak entropy sources. This turns out to be a recurring footgun such that use of UUIDv4 is prohibited in some applications because you can't rely on people to implement it properly.
- toast0 20 hours ago
  The question becomes, how bad is your random.
  If your random is not uniformly distributed, you might get duplication from bias.
  If your random is setup wrong and you get the same seeding multiple times, you might get duplication from that.
  If your random is good, the birthday math should hold.
- ethan_smith 12 hours ago
  UUID v4 has 122 random bits giving 2^122 possible values (~5.3×10^36). Using the birthday paradox, you'd need to generate about 2^61 UUIDs (~2.3×10^18) for a 50% collision probability, which is well beyond any practical system's capacity.
  [-]
  - jandrewrogers 11 hours ago
    > ...which is well beyond any practical system's capacity.
    Well beyond a single server but not single systems. Large analytical data models run into the tens of exabytes in practical systems already. It isn't hypothetical and probabilistic identifiers become inadvisable[0] in those systems.
    Not everything is a web app and UUIDv4 does not scale infinitely.
    [0] Technically you can use probabilistic identifiers if you widen them beyond 128-bits, but at that scale the compactness of unique identifiers has a large impact on performance and cost, so 128-bit deterministic identifiers are preferable.
- masklinn a day ago
  A uuid has 122 bits of payload.
  Depends what you consider “a remote possibility” to be (the birthday attack wiki page has a table for various p and powers of 2)

rurban 18 hours ago

You can do bigint's or you can do approximations. I've tried about 8 different implementations in smhasher. (for 32bit to 256bit) See this massacre:

    // Naive multiplication, no accuracy at all
    static double ExpectedNBCollisions_Slow ( const double nbH, const double nbBits )
    {
      long balls = nbH;
      long double bins = nbBits;
      long double result = 1.0;
      for (long i = 1; i < balls / 2; i++) {
        // take a pair from the front and the end to minimize errors
        result *= ((bins - i) / bins) * ((bins - (nbH - i)) / bins);
      }
      return (double)(nbH * result);
    }
    
    // TODO This only works for a low number of collisions
    static inline double ExpectedCollisions ( const double balls, const double bins )
    {
      return balls - (bins * (1 - pow((bins - 1)/bins, balls)));
    }
    
    // Still too inaccurate: https://preshing.com/20110504/hash-collision-probabilities/
    static double EstimateNbCollisions_Taylor(const double nbH, const double nbBits)
    {
      const long double k = nbH;
      const long double b = nbBits;
      return (double)(k * (1.0 - expl(-0.5 * k * (k - 1.0) / b)));
    }
    
    // demerphq: (double(count) * double(count-1)) / pow(2.0,double(sizeof(hashtype) * 8 + 1));
    // the very same as our calc. pow 2 vs exp2. Just the high cutoff is missing here.
    static double EstimateNbCollisions_Demerphq(const double nbH, const double nbBits)
    {
      return (nbH * (nbH - 1)) / pow(2.0, nbBits + 1);
    }
    
    // GNU R: qbirthday. rough estimate. FIXME
    static double EstimateNbCollisions_R(const double nbH, const double nbBits)
    {
      return ceil(exp(((log(nbH) + lgamma(3) + log(-log1p(-0.5)))) / 2));
    }
    
    // GNU R: pbirthday. FIXME
    /*
      static double EstimateNbCollisions_Rp(const double c)
    {
      return (1 - prod((c:(c-0.5+1))/rep(2, 0.5)));
    }
    */
    
    // The previous best calculation, highly prone to inaccuracies with low results (1.0 - 10.0)
    // TODO: return also the error.
    static double EstimateNbCollisions_previmpl(const double nbH, const double nbBits)
    {
      double exp = exp2(nbBits); // 2 ^ bits
      double result = (nbH * (nbH-1)) / (2.0 * exp);
      if (result > nbH)
        result = nbH;
      // improved floating point accuracy
      if (result <= exp || nbBits > 32)
        return result;
      return result - exp;
    }
    
    static double EstimateNbCollisions_fwojcik(const double nbH, const int nbBits)
    {
        // If the probability that there are 1 or more collisions (p(C >=
        // 1)) is not much higher than the probability of exactly 1
        // collision (p(C == 1)), then the classically-good approximation
        // of the probability of any collisions is also a good estimate
        // for the expected number of collisions.
        //
        // If there are 2**n buckets and 2**(n-r) hashes, then the ratio
        // of p(C >= 1)/p(C == 1) is about 1/(1-2**(n-2r-1)). This uses
        // the new estimator if that ratio is > 1 + 2**-8. That cutoff
        // minimizes the error around the values we care about.
        if (nbBits - 2.0*log2(nbH) >= 8 - 1) {
            return nbH * (nbH - 1) * exp2(-nbBits-1);
        }
    
        // The probability that any given hash bucket is empty after nbH
        // insertions is:
        //    pE     = ((2**nbBits - 1)/(2**nbBits))**nbH
        // so we compute:
        //    ln(pE) = nbH * ln((2**nbBits - 1)/(2**nbBits))
        //           = nbH * ln(1 - 1/2**(nbBits))
        //           = nbH * ln(1 - 2**(-nbBits))
        //           = nbH * ln(1 + -(2**(-nbBits)))
        // This means the probability that any given hash bucket is
        // occupied after nbH insertions is:
        //     pF = 1 - pE
        //     pF = 1 - exp(ln(pE)
        //     pF = -(exp(ln(pE) - 1)
        //     pF = -expm1(ln(pE))
        // And the expected number of collisions is:
        //     C = m - n + n * pE
        //     C = m - n * (1 - pE)
        //     C = n * (m/n - 1 + pE)
        //     C = n * (m/n - (1 - pE))
        //     C = n * (m/n - pF)
        //     C = n * (m/n - (-expm1(ln(pE))))
        //     C = n * (m/n + expm1(ln(pE)))
        // Since the format of floats/doubles is k*2**n, multiplying by
        // exp2(x) doesn't lose any precision, and this formulation keeps
        // m/n and pF at the same general orders of magnitude, so it tends
        // to have very good precision. At low hash occupancy, pF is too
        // close to m/n for this formula to work well.
        double logpE = (double)nbH  * log1p(-exp2(-nbBits));
        double result = exp2(nbBits) * (exp2(-nbBits) * (double)nbH + expm1(logpE));
    
        return result;
    }

[-]

johnisgood 11 hours ago
So which is the best and why?
[-]
- rurban 9 hours ago
  The last by fwojcik, because it's the most accurate for our sizes, and pretty fast also.
  [-]
  - johnisgood 8 hours ago
    Thanks.
    By the way, this is extremely off-topic, but would you mind correcting me were I wrong anywhere here https://news.ycombinator.com/item?id=44359539?
    I probably should have sent an e-mail.

kleiba 14 hours ago
Not to appear overly Bayesian here, but shouldn't the prior probabilities of the objects themselves be factored in?
I mean, in most real-life situations you probably won't know ahead of time which objects you're going to have to hash (or else you could hash them perfectly anyway).
[-]
- kevindamm 13 hours ago
  In theory, you can treat the objects as i.i.d. giving them a uniform prior.
  In practice, of course only the inputs are inputs and if you were to take that into account then you could avoid collisions, but as you say, you can't really know that ahead of time.
  Unless you can, such as when there is structure to the inputs or some other source of non-uniformity that you can account for. The topic of learned indexes is worth investigating if you're interested:
  https://arxiv.org/abs/1712.01208
  https://arxiv.org/abs/2012.12501
racedude a day ago
Good to know, thanks