Austin Appleby's superb MurmurHash3 is one of the best general purpose hash functions available today. Still, as good as it is, there are a couple of minor things about it that make me uneasy. I want to talk about one of those things today.

Here is the 64-bit finalizer from MurmurHash3:

UInt64 MurmurHash3Mixer( UInt64 key ) { key ^= (key >> 33); key *= 0xff51afd7ed558ccd; key ^= (key >> 33); key *= 0xc4ceb9fe1a85ec53; key ^= (key >> 33); return key; }

The goal of a finalizer (sometimes called a "bit mixer") is to take an intermediate hash value that may not be thoroughly mixed and increase its entropy to obtain both better distribution and fewer collisions among hashes. Small differences in input values, as you might get when hashing nearly identical data, should result in large differences in output values after mixing.

Ideally, flipping a single bit in the input key results in all output bits changing with a probability of 0.5. This is called the Avalanche effect. Cryptographically-secure hashes are willing to spend an impressive number of computer cycles getting this probability close to 0.5. Hash functions that don't have to be cryptographically-secure trade away some avalanche accuracy in favor of performance. MurmurHash3 gets close to 0.5 with a remarkable economy of effort: Just two multiplies, three shifts and three exclusive-ors. Austin reports MurmurHash3 avalanches all bits to within 0.25% bias.

The overall structure of the finalizer (shifts and multiplies) was designed by Austin. The constants were chosen by "a simple simulated-annealing algorithm" (his description.) The part that troubles me is the simulation was driven with random numbers. That doesn't seem right. After all, if the inputs were truly random what would be the point of mixing? It seems to me that a good mixer should be able to take a series of, say, counting numbers and produce output that is virtually indistinguishable from random.

First question: Could I build a better mixer by training the simulation on low-entropy inputs such as counting numbers and high-likelihood bit patterns? Here are the results for MurmurHash3 and the top 14 mixers my simulation found:

Mixer | Maximum error | Mean error |
---|---|---|

MurmurHash3 mixer | 0.005974634384355 | 0.000265202074121 |

Mix01 | 0.000791851287732 | 0.000190873655461 |

Mix02 | 0.000885192071844 | 0.000197317606821 |

Mix03 | 0.000828116139837 | 0.000191444267889 |

Mix04 | 0.000991926125199 | 0.000199204707592 |

Mix05 | 0.000932171539344 | 0.000195036565653 |

Mix06 | 0.000819874127995 | 0.000199337714668 |

Mix07 | 0.000805656657567 | 0.000194049828213 |

Mix08 | 0.000906415252337 | 0.000191828700595 |

Mix09 | 0.000927020281943 | 0.000194149835046 |

Mix10 | 0.000877774261186 | 0.000194585176663 |

Mix11 | 0.000955867323390 | 0.000194238221367 |

Mix12 | 0.000932377589640 | 0.000193450189656 |

Mix13 | 0.000789996835068 | 0.000190778327016 |

Mix14 | 0.000800917500758 | 0.000191264175101 |

The maximum error tends to be about seven times lower. Mean error is lower too but only slightly. The answer, then, is yes, it does get better results with low-entropy keys. But that leads to the second question: Does training on low-entropy keys result in better or worse performance with high-entropy keys? To find out, I tested it with 100 million cryptographic-quality random numbers:

Mixer | Maximum error | Mean error |
---|---|---|

MurmurHash3 mixer | 0.000212380000000 | 0.000040445117187 |

Mix01 | 0.000177410000000 | 0.000040054211426 |

Mix02 | 0.000179150000000 | 0.000039797316895 |

Mix03 | 0.000170070000000 | 0.000040068117676 |

Mix04 | 0.000185470000000 | 0.000039775007324 |

Mix05 | 0.000192510000000 | 0.000039626535645 |

Mix06 | 0.000195660000000 | 0.000040216433105 |

Mix07 | 0.000193810000000 | 0.000039834248047 |

Mix08 | 0.000196590000000 | 0.000039063793945 |

Mix09 | 0.000174280000000 | 0.000039541943359 |

Mix10 | 0.000181790000000 | 0.000039569926758 |

Mix11 | 0.000181140000000 | 0.000039501779785 |

Mix12 | 0.000183920000000 | 0.000039622690430 |

Mix13 | 0.000175610000000 | 0.000039437180176 |

Mix14 | 0.000182000000000 | 0.000040092158203 |

Yes, we can do slightly better than MurmurHash3 even on random test keys.

On average this is a small but measurable improvement that comes at no performance cost. For the worst-case, low-entropy keys that concern us the most, it provides a significant improvement.

Here's a handy tip: A great use for a mixer like this is to use it to "purify" untrustworthy input hash keys. If a calling function provides poor-quality hashes (even counting numbers!) as input to your code then purifying it with one of these mixers will insure it does no harm.

Here are the parameters for the mixers I tested:

Mixer | Mixer operations | ||||
---|---|---|---|---|---|

MurmurHash3 mixer | 33 | 0xff51afd7ed558ccd | 33 | 0xc4ceb9fe1a85ec53 | 33 |

Mix01 | 31 | 0x7fb5d329728ea185 | 27 | 0x81dadef4bc2dd44d | 33 |

Mix02 | 33 | 0x64dd81482cbd31d7 | 31 | 0xe36aa5c613612997 | 31 |

Mix03 | 31 | 0x99bcf6822b23ca35 | 30 | 0x14020a57acced8b7 | 33 |

Mix04 | 33 | 0x62a9d9ed799705f5 | 28 | 0xcb24d0a5c88c35b3 | 32 |

Mix05 | 31 | 0x79c135c1674b9add | 29 | 0x54c77c86f6913e45 | 30 |

Mix06 | 31 | 0x69b0bc90bd9a8c49 | 27 | 0x3d5e661a2a77868d | 30 |

Mix07 | 30 | 0x16a6ac37883af045 | 26 | 0xcc9c31a4274686a5 | 32 |

Mix08 | 30 | 0x294aa62849912f0b | 28 | 0x0a9ba9c8a5b15117 | 31 |

Mix09 | 32 | 0x4cd6944c5cc20b6d | 29 | 0xfc12c5b19d3259e9 | 32 |

Mix10 | 30 | 0xe4c7e495f4c683f5 | 32 | 0xfda871baea35a293 | 33 |

Mix11 | 27 | 0x97d461a8b11570d9 | 28 | 0x02271eb7c6c4cd6b | 32 |

Mix12 | 29 | 0x3cd0eb9d47532dfb | 26 | 0x63660277528772bb | 33 |

Mix13 | 30 | 0xbf58476d1ce4e5b9 | 27 | 0x94d049bb133111eb | 31 |

Mix14 | 30 | 0x4be98134a5976fd3 | 29 | 0x3bc0993a5ad19a13 | 31 |

Thanks for sharing this.

ReplyDeleteI was wondering : Could the finalization mix in 32 bit murmurhash3 be improved in a similar fashion?

Cheers!

The 32-bit mixer has the same structure as the 64-bit mixer and both were generated by simulated annealing so, while I can't be certain without trying it, I strongly suspect the answer is yes.

ReplyDeletenice information :)

ReplyDeletebest regards

thomas from kuechengeraete-22.blogspot.com

versicherung-wv.blogspot.com | versicherung-wv.de | applicate-soft.de

That's very interesting, thanks for sharing it! Can you provide more details how you found those new mixer?

ReplyDeleteNice work!

ReplyDeleteCoincidentally, I have been using a quite similar construction that seems to behave more like a pseudo-random permutation. At least I can so far not find any obvious ways to do measurements that clearly will distinguish it from a PRP:

http://mostlymangling.blogspot.com/2018/07/mathjax.html

This seems to have become a "go to" page referenced even in Java 8 source code. I am interested to see more in regards to mix function properties. Specifically I'd like to know if it is a 1-to-1 function and whether the function can produce collisions e.g. assuming y=mix(x) can it produce the same "y" for two different "x"?

ReplyDeleteSince the individual steps are bijective, the whole function is bijective or 1-to-1.

DeleteSee http://mostlymangling.blogspot.com/2018/07/on-mixing-functions-in-fast-splittable.html for a longer analysis of MurmurHash3, Variant 13 (above) and an attempt to remedy some of the obvious (statistical) shortcomings of MurmurHash3 and friends. Included is the inverse for a similar function.

Stupid me. Indeed. This would strike me as obvious some 20 years ago but one thing about math is if you don't use it - you lose it :)

DeleteLooking at your table Mix13 seem to perform the best in both scenarios. Then I recognized that I had seen those constants before, in implementations of SplitMix64. https://github.com/lemire/testingRNG/blob/master/source/splitmix64.h

ReplyDeleteSplitMix64 paper http://dx.doi.org/10.1145/2714064.2660195 cites this David's blog post

DeleteVcein lklo1 23rd346689097r5,04ww2

ReplyDeleteTo be honest, I have been accidentally landed on your site because I was looking at Cheap Assignment Writing Service Uk to write my assignment. however, I know your post is not useful to me regarding my assignment, but I found it really helpful and informative, and I promise I will also share this information with others

ReplyDeleteYour website is very good and nice information was provided in your site, thanks for sharing.

ReplyDeleteartificial intelligence internship | best final year projects for cse | internship certificate online | internship for mba finance students | internship meaning in tamil

This is really interesting, you are such a great blogger. Visit media foster for creative and professional website design and Digital Marketing Company in Mohali and Also get Digital Marketing Course in Mohali

ReplyDeleteTOP IT Company in Mohali

best SEO Company in Mohali

We provide very affordable packages on car services in Chandigarh.

ReplyDeleteluxury car rental in chandigarh

luxury wedding cars in chandigarh

금정구출장샵

ReplyDelete연제구출장샵

수영구출장샵

사상구출장샵

수성구출장샵

달서구출장샵

How do you look for these numbers (multipliers and shifts)?

ReplyDeleteI noticed that variant 13, which seems to be the best, uses non-prime numbers (multipliers). Would it be better if it did? And, obviously, how to find then?

Step into a realm of digital brilliance with the industry's top-rated Mobile and Web Development Firm, recognized among the Best Web Development Companies. Our passion for innovation and commitment to perfection redefine your online presence, making us the trusted choice for transformative and unparalleled web development solutions.

ReplyDeleteUnlock opportunities and connect effortlessly with our platform's free classified ads feature. Post free ads with ease, whether you're buying, selling, or exploring services. Experience a seamless and empowering way to engage with a community of possibilities. Join us today and redefine your online experience.

ReplyDelete