Hash Code generation in .NET: a deep dive
A while back there was some debate on a .NET app at $WORK about its various implementations of Object.GetHashCode.
My initial vibe after looking at the implementation was to yell at everyone how bad these hash code implementations were, and how
badly they would perform. But after being asked about it, I realized that these were just that: vibes, not facts.
So naturally, I had to dive in and actually do the research on hash codes. This blog post is an updated and expanded version
of an internal issue I made. While doing more research I also made some interesting discoveries about hash code implementations in other languages,
so while this article mainly started out as a .NET hash code deep dive, it morphed into a more generic hash code generation and how hash tables
need some help from you, the programmer, to efficiently function. This is also where I discovered how staggeringly similar Java and .NET’s hash code
rules and requirements are.
The reason for this is that .NET is “Microsoft Java” and you cannot convince me otherwise.
The purpose of hash codes
Before we can talk about the properties of a hash code, we first have to properly define what Object.GetHashCode’s actual
purpose is. Why do we need such a fundamental method on the root Object type in .NET at all? Well, Microsoft explains
it rather clearly in the article
Supplemental API remarks for Object.GetHashCode:
A hash code is intended for efficient insertion and lookup in collections that are based on a hash table. A hash code is not a permanent value.
That tells us that hash codes are necessary for certain collection operations to be fast.
Specifically, System.Collections.Generic.HashSet<T> and System.Collections.Generic.Dictionary<K, V> are
collections that are based on hash tables. Even more interesting,
HashSet has the following note in its source code:
public class HashSet<T>
{
// This uses the same array-based implementation as Dictionary<TKey, TValue>.
// [...]
}
HashSet<T> is just a Dictionary<K, V> in a trenchcoat!
Wikipedia has a great article on hash tables and their workings, so if you want to know the nitty-gritty details: that page explains it way better than I ever could.
However, we do need to establish some fundamental theory before we continue:
- Hash tables need some hash function to select a bucket
- Buckets can hold multiple different values. This means that a situation where two different values and for which holds need not be fatal.
- For a given user-defined object .NET cannot know which properties
meaningfully contribute to object equality. For example, a
Lockthat protects your class from concurrent modification will always differ between objects and thus should not be considered for object equality. But only you, the programmer, know this fact.
Now, if we were only interested in equality we could just ask the programmer to implement Object.Equals or IEqualityComparer for
that type. But that only answers a yes/no question of whether the object is equal, it does not produce a value that we can reproduce
later for fast lookup.
And that is where Object.GetHashCode comes in: it is a part of the hashing function used by Dictionary and HashSet to select buckets
to store your object. Note that I said part: Object.GetHashCode is not ! The exact internals of
Dictionary are out of scope for this article, but the internal implementation uses the value produced by Object.GetHashCode to select
a bucket from a range. For our purposes, let’s just say that for our object , we can define as follows:
It may seem pedantic, but I want to be very clear that while Object.GetHashCode is important for hash table based collections, it does
not by itself determine where your object ends up in the collection.
Requirements for GetHashCode
Given that Object.GetHashCode is in hash functions, are there any hard requirements on what Object.GetHashCode does?
Well, both Java and .NET have only two real requirements:
- if two objects are equal,
Object.GetHashCodeshould return the exact same value - if
Object.GetHashCodeis invoked multiple times on an object whose values are unchanged, thenObject.GetHashCodeshould return the exact same value.
Then there’s also two “unofficial official” requirements for Object.GetHashCode: a formally, correct
implementation doesn’t need to adhere to these, but if you don’t performance is abysmal and normal operation
of hash tables go out of the window:
- if you call
Object.GetHashCode, and subsequently change the object, the next invocation ofObject.GetHashCodeshould return a different hash code. - if a mutable property is used in computing
Object.GetHashCode, then that property should not change as long as the object is contained in any hash table.
The first requirement has to do with performance: two unequal objects may return the same hash code, and because of the pidgeonhole principle it will be inevitable if you have more objects than possible hash codes. But this should still be happening as little as possible to make the hash table performant.
The second requirement is much more important, because it influences correctness:
if a property that is used to compute Object.GetHashCode is changed while the containing object is stored in a hash table, then
that object is now stored in the wrong bucket! This can lead to subtle bugs if you try to look up an object whose hash code has changed in the meantime.
For example, HashSet.Contains could report false while your object is contained in the collection.
Finally, a hash code is a runtime value: the produced hash code does not need to be consistent between program invocations. This is why .NET has a giant warning not to store hash codes in persistent places: any re-run or reboot will invalidate those hash codes. While mainly to prevent hash flooding attacks (we’ll get back to that later), most implementations even seed their hash codes with a per-run random seed. This virtually guarantees that any attempt to use hash codes out of band will fail spectacularly, which doubles as a neat anti-hyrum’s law move1.
Playing around with some examples
// suppose we have two 🌈 magical 🌈 objects of the same type.
// let's run some tests!
class Point
{
int X;
int Y;
override bool Equals(object? other)
{
if (other is not Point point)
{
return false;
}
return X == point.X && Y == point.Y;
}
override int GetHashCode()
{
// now what would we put here?
}
}
Point object1 = GetRandomPoint();
Point object2 = GetRandomPoint();
if (object1.Equals(object2))
{
// the objects are equal, so their hash codes _must_ also be equal
Assert(object1.GetHashCode() == object2.GetHashCode());
}
else
{
// if objects aren't equal, we basically can't say anything meaningful about their hash codes
// they can be unequal, but they don't have to be!
}
// if we didn't modify the object, then the hash shouldn't change
int hash1 = object1.GetHashCode();
Sleep(10);
int hash2 = object1.GetHashCode();
Assert(hash1 == hash2);
object1.Y = 5;
int hash3 = object1.GetHashCode();
// this next assert is _technically_ illegal. But if you
// want your app to perform well, this should hold, since we changed the objects contents
Assert(hash2 != hash3);
// pop quiz: what happens if you do this?
var set = new HashSet<Point>();
set.Add(point1);
point1.X = 5;
// uh-oh, this is the worst: will the assert fire? Who knows! Since
// we modified point1 we _don't know_
Assert(set.Contains(point1));
set.Add(point1);
// this may print 2, or 1, depending on a variety of factors
Console.WriteLine(set.Count);
So what would that empty GetHashCode method contain? Well, a fully complete and working
implementation could be:
override int GetHashCode()
{
return 4; // chosen by fair dice roll
}
// or, if we want to at least prevent hyrum's law
private static readonly int _hashCode = Random.GetInt();
override int GetHashCode()
{
return _hashCode;
}
This is terrible because it reduces performance of dictionaries and sets to linked lists.

Desireable properties for a GetHashCode hash function
Clearly, we also need some hashing theory. For our purposes, the following properties are important:
- speed
- avalanche effect
- uniform output distribution
- collision resistance
Collision resistance used to be more nice-to-have than a hard requirement, but given that in this day and age hash tables are effectively wired directly to the internet it has become a more important factor (note however, that speed gains still outweigh collision resistance, otherwise we’d all use SHA3 and call it a day).
First up, the most important property of a good hashing function for use in GetHashCode is speed: we’re not trying to
cryptographically protect data, so we can trade security for speed in this case. It’s also worth noting that a hash table implementation
might query the objects hash code multiple times. Any slowdown in GetHashCode directly slows down the hash table performance.
The avalanche effect is important for small key spaces (such as enums or small strings). The general idea is that if you change even a single bit of the input, the hash output should be drastically different. This is mostly a performance concern since real-life keys may be closely related such as monotonically incrementing integer keys. You don’t want these keys to all land in the same bucket, since you’ve just reduced your dictionary to a linked list in that case.
Given that this avalanche effect doesn’t formally state hard requirements (one could call it the avalanche vibe if so inclined), there is a formaliziation that states hard requirements for this effect: the Strict Avalanche Criterion (SAC). SAC mandates that if a single bit of the input is flipped, then each of the output bits change with a 50% probability. So while conforming to to SAC might not be essential, it is desireable for our hash function.
A uniform distribution is important for the load factor of a hash table: if a hash function doesn’t generate a uniform distribution of outputs then this means some buckets will be filled with more entries than others which impacts performance, so our hash function should absolutely approximate a uniform distribution.
Lastly, collison resistance is a stronger requirement on the uniform distribution requirement (remember, GetHashCode does not select a bucket on its own!).
This property is important for defending against hash flooding attacks that could turn a simple lookup cache into a denial of service attack. However, unlike
things such as HMAC generation, not having a strong variant of this requirement does not undermine our entire theoretical basis.
Generating hash codes according to the internet
Now that we know why we need them, let’s look at how the almighty internet (StackOverflow) thinks we should generate a hash code for a given object. Jon Skeet says to go with Joshua Bloch’s item 11 in Effective Java:
// https://stackoverflow.com/a/263416
// CC-BY-SA 4.0 Jon Skeet
public override int GetHashCode()
{
unchecked // Overflow is fine, just wrap
{
int hash = 17;
// Suitable nullity checks etc, of course :)
hash = hash * 23 + field1.GetHashCode();
hash = hash * 23 + field2.GetHashCode();
hash = hash * 23 + field3.GetHashCode();
return hash;
}
}
He also mentions FNV hashing which basically replaces the addition with a XOR operation, but changes precious little else:
// https://stackoverflow.com/a/263416
// CC-BY-SA 4.0 Jon Skeet
// Duck's note: the reason Jon calls this "not quite FNV" is
// because FNV only operates on the byte representation of its input,
// which is a lot harder to do in high-level languages
// Note: Not quite FNV!
public override int GetHashCode()
{
unchecked // Overflow is fine, just wrap
{
int hash = (int) 2166136261;
// Suitable nullity checks etc, of course :)
hash = (hash * 16777619) ^ field1.GetHashCode();
hash = (hash * 16777619) ^ field2.GetHashCode();
hash = (hash * 16777619) ^ field3.GetHashCode();
return hash;
}
}
This first suggested form was what started the entire search into hash codes in the first place, because to me this looks weird and, to be completely honest, amateurish. Even giants such as Jon Skeet and Joshua Bloch’s reasoning as to why you should do it this way boils down to “it’s good enough, I guess”. Really, Joshua Bloch states the following in Effective Java (3rd ed.):
“They are comparable in quality to the hash functions found in the Java platform libraries’ value types and are adequate for most uses.”
Okay, I can accept the “good enough” part, but that’s not a proper answer. We’re doing math here, we can (and should!) prove things! Just saying “Well, it’s what Java does so it’s fine” isn’t the same as a proper evaluation of the hash function. The best part is once you’ve seen this style of hash, you’ll spot it everywhere. It has spread as a sort of magic incantation that everybody uses with the caveat “it works, don’t ask me why”. That raises the question: what exactly is this type of hash function, and more importantly: does it work as advertised?
Luckily for us, Jon did link a (now defunct) blog post by Julienne Walker
introducing us to the principles of hash functions.
And if you search for a bit you’ll find a section called “Bernstein hash” after the legendary djb.
This specific version is also known as DJBX33A2 (Daniel J. Bernstein, Times 33 with Addition):
unsigned djb_hash ( void *key, int len )
{
unsigned char *p = key;
unsigned h = 0;
int i;
for ( i = 0; i < len; i++ )
h = 33 * h + p[i];
return h;
}
Well that looks awfully familiar doesn’t it? Once you have a name that’s not “that hash code computey thingy” searching becomes a lot easier. Let’s un-boomer that code a bit and give things a better name:
uint DjbHash(byte[] value)
{
const uint initial = 0;
const uint factor = 33;
uint hash = initial;
foreach (byte b in value)
{
hash = (factor * hash) + b;
}
return hash
}
I chose a regular ol’ byte array since that’s the closest thing to a void pointer that C# has.
There are basically 2 important components to a Bernstein hash:
- an initial value to start the multiplication sequence
- a constant factor to multiply each result with
As Walker explains, Bernstein posted this to comp.lang.c but “its theory is not up to snuff”.
The constant 33 for example, was never explained by Bernstein other than “any good multiplier works”.
However, as much as I’d like to just take this and dunk on Joshua Bloch, I can’t really do that because
the dismissal of the Bernstein hash is done with an equal amount of explaination as its introduction.
If I’m flaming Jon Skeet and Joshua Bloch for not being mathematically thorough, then I can’t really
throw the same “but somebody on the interwebs said it was bad” back at them.
Mixing in some math
If you trawl stackoverflow a bit more, you can discover that other people have in fact scrutinized this hash, or a derivative from it. One interesting observation is that the Bernstein hash closely mimics a Linear Congruential Generator (LCG) of the form:
Where m is usually or
Or, if you don’t speak italics:
unsigned Xn1 = (a * Xn + c) % m;
Gee wiz, that looks an awful lot like what Bernstein does if you omit the modulo, doesn’t it? The thing is, however, that LCG’s are for generating sequences of values! Since we are hashing, we are faced with the problem that c is the input for our hash, not a constant that we can tactically select. But does this actually matter? Well, if you think about it: no. Recall that there are a few properties that are important for our type of hash functions:
- S P E E D
- uniform distribution
- strong avalanche effect
Crucially, we do not need strong cryptographic properties such as:
- pre-image resistance
- second pre-image resistance
- collision resistance (but protection against hash flooding is not a theoretical concern anymore)
Mapping this to an LCG, that basically means that the ideal property of its period (amount of iterations before it restarts its sequence) would be equal to m, which would guarantee a perfect uniform distribution! Enter the Hubb-Dobbel theorem, which states that an LCG will have a period that is equal to m if, and only if:
- a and c are coprime
- m is a power of two:
- is divisible by all prime factors of m
- is a multiple of 4 if m is a multiple of 4
We can almost conform to this theorem, since we control a and m. The only thing we do not control is c, and thus we can never guarantee that c and a are coprime. But that’s the thing: if you conform to the Hubb-Dobbel theorem, you get a perfect LCG. And as stated before, for what we’re after, “pretty decent” results are actually sufficient. Given that we pick m to be , we can pick a so that only our first condition does not hold. Can you guess which one that is? Right! suddenly makes a whole lot of sense!
Now, remember that as Bernstein says: “any good multiplicator will do”, and the most important properties we can deduce from our LCG are that a must not be even, and if possible, prime.
Also recall again that Jon Skeet also mentioned the Fowler-Noll-Vo (FNV) hash. FNV is stated to be “better” than the Bernstein hash, but it is mainly used because just like the Bernstein hash, it’s implementation fits on the back of an envelope:
unsigned fnv_hash(void *key, int len)
{
unsigned char *p = key;
unsigned h = 2166136261;
int i;
for (i = 0; i < len; i++)
{
h = (h * 16777619) ^ p[i];
}
return h;
}
If you squint, you’ll see a LCR-ish construct appearing here as well.
So now we know why these hashes work, and why they became popular. But that doesn’t answer a fundamental question: are they actually good, and are there better alternatives?
Code is cheap, show me the benchmarks
It turns out that experts have not been idle and the state of the art has advanced quite a bit:
- SipHash
- SpookyHash
- CityHash and its successor, FarmHash
- MurmurHash
- xxHash
- wyhash and its successor, rapidhash
These are all non-cryptographic hashing functions: the only reason we’re using them over SHA3 and friends is S P E E D. That raises the question: is there a faster algorithm? Well, luckily, Aras did all the hard work for us and benchmarked a bunch of non-cryptographic hashes. What’s interesting is that SipHash is faster than FNV and Bernstein hashing, but not by a whole lot. Another interesting observation is how slow Bernstein and FNV hashing is if you look at the more optimized variants. Granted, this is most likely due the modern hashes using SSE and AVX to speed up throughput, but the whole reason people stick with Bernstein is that it’s “easy and fast”. Well, the current state of the art has produced some fancy results that make that vibe no longer true.
So that begs the question: what do languages use for their hashing needs? Well, after some trawling, you end up with the following table:
| language | algorithm |
|---|---|
| PHP | heavily optimized Bernstein (Yes, really) |
| Java | Bernstein (Yes, really) |
| JavaScript (V8) | rapidhash |
| Ruby | SipHash-1-3 |
| Python | SipHash, dependent on configuration |
| Rust | SipHash-1-3 |
| Swift | SipHash-2-4 and SipHash-1-3 |
| C++ (libc++) | Murmur2 or CityHash |
| Go | AES (but not true AES) |
| .NET | xxHash32 |
| Zig | wyhash |
Several notes:
- Tracking down Ruby’s hash implementation was the worst. I pride myself on being able to effectively locate things in large codebases reasonably quickly, but Ruby uses so much type indirection. it is typedef after function pointer after typedef, it took me a solid 30 minutes before I hit paydirt. Mind you, that is 20 minutes longer than searching the veritable quagmire that is V8 for its hashing implementation!
- If anyone asks what JavaScriptCore or SpiderMonkey use for their hashing: I don’t know. V8 was already hell to trawl through, I’m not going to repeat that exercise for two more giant quagmires of C++ template indirection. Besides, V8 is the de-facto JavaScript implementation and the only reason JSC has any market share at all is because Apple is a monopolist that doesn’t want other engines in its walled garden3
- Siphash seems to be the most popular, and the reason for that is most likely the wave of HashDoS attacks that hit several languages in late 2011. One of SipHash’s stated goals is protection against hash flooding attacks, so it makes sense that languages that changed their default hash function moved to SipHash.
- Python’s
hash()builtin is actually configurable! Try runningprint(__import__("sys").hash_info)on your system to see what the default configuration is. - Weirdly enough, the implementers of python’s hash feared SipHash’s initialization routine would be too slow on small inputs so they opted to fall back to… fucking Bernstein.
- Rust’s default hasher is SipHash, but it explicitly warns users that the exact implementation is unspecified and should not be relied upon. Luckily for Rust, nobody reads docs so that will be fun when they try to change it.
- Google developed both CityHash and its successor FarmHash to be optimized for their datacenter hardware. I guess at Google’s scale shaving milliseconds off of your hash calculations saves you quite a bit of money.
- wyhash and rapidhash are very light on details. You basically get a tarball and some SMHasher benchmarks, but that’s about it.
- Go is the weirdest of the bunch: it uses AES hardware intrinsics for speed, but it doesn’t really use AES? It’s weird, let’s keep it at that. For environments without AES hardware intrinsics the runtime falls back to a wyhash inspired hash. And I thought my NIH was bad.
- PHP and Java seem to not mind keeping Bernstein alive for so long. Never underestimate institutional inertia coupled with maintainer indifference. I don’t think improving hash codes is something that nets people their bonuses, and there is a lot of userland code riding on these implementations.
- PHP at least made an effort to improve the throughput of Bernstein, as it is a hand-vectorized implementation. You gotta give ‘em props for that.
- .NET’s
System.HashCodeuses xxHash32, but record types that automatically generate their ownGetHashCodestill use a Bernstein hash. In an even weirder turn, .NET’s string type uses an entirely custom hashing scheme called Marvin32, which was patented by Microsoft in response to ASP.NET hash flooding attacks and doesn’t appear to be used anywhere else.
So… what do I use?
Well, the basic answer is “not Java”, but that’s been true for as long as that language exists. In order of “engineering prudence”, the actual answers are:
- Whatever your language provides you as the default, since that is what the ecosystem expects and relies on and is (hopefully) well optimized by the language implementation.
- If the default is not good enough, for example with .NET’s autogenerated record equality that still generates a Bernstein hash,
you should first look if the language has a default facility that answers the “I don’t care about hashing theory, give me some good hash” question.
.NET’s
System.HashCodeand C++‘sstd::hashare neat examples. The advantage of this is that if the language decides to switch to a better variant at a later time, your code will automatically benefit from that when you upgrade to the new language version. - If you still have problems with hash quality or throughput, then you should think about using a library yourself, such as xxHash or rapidhash. Remember that a lot of the speed gains of these hashes comes from hand-rolled SIMD and AVX implementations and clever CPU optimization tricks, so reimplementing rapidhash in e.g. vanilla ruby is probably not going to give you a big speed gain over Ruby’s builtin C version of SipHash. The big takeaway is that at this stage, benchmarks are a must and you should profile everything before you commit to such a big departure from the defaults. Plus, the maintenance burden falls solely to you for checking if the algorithm you’re using is still good and doesn’t have any vulnerabilities.
As for our earlier .NET example, we can use the aformentioned System.HashCode struct to compute our hash codes.
Our Point example from earlier can easily be rewritten to:
override int GetHashCode()
{
return HashCode.Combine(X, Y);
}
This is the ideal solution to me: .NET gets high quality hash codes from you, can update their chosen algorithm whenever they want to, and you do not need to become a PhD in cryptography to figure all this stuff out by yourself.
Unless you’re a masochist like me of course.
Further reading
- https://orlp.net/blog/breaking-hash-functions/
- https://github.com/mignon-p/jsw-libs/blob/master/Tutorials/jsw_tut_hashing.md
- https://learn.microsoft.com/en-us/dotnet/fundamentals/runtime-libraries/system-object-gethashcode
- https://peps.python.org/pep-0456/
- https://www.npopov.com/2014/12/22/PHPs-new-hashtable-implementation.html
Footnotes
-
↩Completely unrelated, but people relying on behavior with a giant DO NOT RELY ON THIS sign above it is why Go’s map iteration order is randomized.

Reading? DOCUMENTATION? What kind of madman do you take me for?! -
depending on how old the version is you find, you may encounter a version that writes
(h * 33)ash = ((h << 5) + h) + p[i]which is exactly the same, but optimized to use shifts instead of multiplication:
↩# remember that left shifting a by x # is effectively the same as saying (a * (2 ** x)) ((h << 5) + h) == h * 33 ((h * (2 ** 5)) + h) == h * 33 ((h * 32) + h) == h * 33 h * 33 == h * 33 -
No, Bun and Firefox’s SpiderMonkey don’t count: they are dying embers of a VC AI pump ‘n dump and Google’s “look FTC, we’re not a browser monopoly” scheme respectively. ↩