IntroIn languages like Python, Java, or C++, values are hashed by calling a “hash me” method on them, implemented by the type author. This fixed-hash size is then immediately used by the hash table or what have you. This design suffers from some obvious problems, like:
How do you hash an integer? If you use a no-op hasher (booo), DoS attacks on hash tables are inevitable. If you hash it thoroughly, consumers that only cache hashes to optimize equality checks lose out of performance.
How do you mix hashes? You can:
x ^ y
. Indeed, both arguments are pseudo-random, what could possibly go wrong?a * x + y
. Cue CVEs because people used mix(x, mix(y, z))
instead of mix(mix(x, y), z)
.What if the input data is already random? Then you’re just wasting cycles.
What guarantees do you provide regarding the hash values?
Is your hash function seeded?
In RustRust learnt from these mistakes by splitting the responsibilities:
Hash
trait, allowing them to write underlying data into a Hasher
.Hasher
trait, which hashes the data written by Hash
objects.Objects turn the structured data into a stream of integers; hashers turn the stream into a numeric hash.
On paper, this is a good solution:
Hash
implementation.Surely this enables optimal and performant hashing in practice, right?
NoLet’s take a look at the Hasher
API:
pub trait Hasher {
// Required methods
fn finish(&self) -> u64;
fn write(&mut self, bytes: &[u8]);
// Provided methods
fn write_u8(&mut self, i: u8) { ... }
fn write_u16(&mut self, i: u16) { ... }
fn write_u32(&mut self, i: u32) { ... }
fn write_u64(&mut self, i: u64) { ... }
fn write_u128(&mut self, i: u128) { ... }
fn write_usize(&mut self, i: usize) { ... }
fn write_i8(&mut self, i: i8) { ... }
fn write_i16(&mut self, i: i16) { ... }
fn write_i32(&mut self, i: i32) { ... }
fn write_i64(&mut self, i: i64) { ... }
fn write_i128(&mut self, i: i128) { ... }
fn write_isize(&mut self, i: isize) { ... }
fn write_length_prefix(&mut self, len: usize) { ... }
fn write_str(&mut self, s: &str) { ... }
}
This API is tuned to streaming hashes, like the polynomial hash and its various knock-offs. But just like in encryption, hashing is block-wise these days.
Block hashes have some internal state that iteratively “absorbs” input blocks of a fixed length. When the data runs out, the last block is padded with length and absorbed as a fixed-length block too. A finalization step then reduces the internal state to 64 bits (or more, depending on the use case).
That’s how SHA-2 and many other cryptographic hashes work, but you might be surprised to know that the top hashes in the SMHasher list all use the same approach.
The block-wise design is objectively superior to streaming. Consuming as much data as possible at once reduces the amortized avalanche cost, enabling safer hash functions at greater speed than streaming hashes can achieve. Block-wise hashes have a lower latency, as the latency is accumulated per-block, not per-stream-input.
:ferrisClueless:The Hasher
API makes no effort to suit block hashes. The hasher is not informed of the length of the data or of its structure. It must always be able to absorb just one more u8
, bro, I promise. There’s only two ways to deal with this:
Let’s see what’s wrong with these approaches.
PaddingLet’s consider a very simple block-wise hash:
fn absorb(state: &mut u64, block: &[u8; 8]) {
let block = u64::from_ne_bytes(*block);
*state = state.wrapping_mul(K).wrapping_add(block);
}
This is just a multiplicative hash, not unlike FNV-1, but consuming
Now what happens if you try to hash two 32-bit integers with this hash? With padding, that will compile to two multiplications even though one would work. This halves throughput and increases latency.
Practical hashes uses much larger blocks. rapidhash
has a ahash
has a meowhash
has a
These are some of the fastest non-cryptographic hashes in the world. Do you really want to nuke their performance by padding
ChainsOkay, but what if we cheated and modified the hash functions to absorb small data somewhat more efficiently than absorbing a full block?
Say, the rapidhash
kernel is effectively this:
fn absorb(state: &mut [u64; 3], seed: &[u64; 3], block: &[u64; 6]) {
for i in 0..3 {
state[i] = mix(block[i] ^ state[i], block[i + 3] ^ seed[i]);
}
}
That’s three independent iterations, so surely we can absorb a smaller 64-bit block like this instead:
fn absorb_64bit(state: &mut [u64; 3], seed: &[u64; 3], block: u64) {
state[0] = mix(block ^ state[0], seed[0]);
}
Surely this is going to reduce the
Why does rapidhash
even use three independent chains in the first place? That’s right, latency!
mix
has a
AccumulationOkay, so padding is a terrible idea. Can we accumulate a buffer instead? How much hashes I had to scroll through in SMHasher before I found one Rust implementation that took this approach is a warning bell.
The implementation I found, of course, stores a Vec<u8>
and passes it to the underlying hasher in finish
. I believe I don’t need to explain why allocating during hash function is not the brightest idea.
Let’s consider another implementation that stores a fixed-size buffer instead. Huh, that’s a lot of if
s and for
s. I wonder what Godbolt will say about this. Let’s try something very simple:
struct StreamingHasher {
block_hasher: BlockHasher,
buffer: [u8; 8],
length: usize,
}
impl StreamingHasher {
fn write(&mut self, input: &[u8]) {
// If the input fits in the free space in the buffer, just copy it.
let rest = unsafe { self.buffer.get_unchecked_mut(self.length..) };
if input.len() < rest.len() {
rest[..input.len()].copy_from_slice(input);
self.length += input.len();
return;
}
// Otherwise, copy whatever fits and hash the chunk.
let (head, tail) = input.split_at(rest.len());
rest.copy_from_slice(head);
self.block_hasher.feed(self.buffer);
// Split the rest of the input into blocks and hash them individually, move the last one
// to the buffer.
let chunks = tail.array_chunks();
let remainder = chunks.remainder();
self.buffer[..remainder.len()].copy_from_slice(remainder);
self.length = remainder.len();
for chunk in chunks {
self.block_hasher.feed(*chunk);
}
}
}
Surely this will compile to good code? :ferrisClueless:
Here’s what writing 1 (one) byte into this hasher compiles to:
write_u8:
push r15
push r14
push r13
push r12
push rbx
sub rsp, 16
mov rbx, rdi
mov byte ptr [rsp + 15], sil
mov r14, qword ptr [rdi + 16]
add rdi, r14
add rdi, 8
cmp r14, 6
ja .LBB0_2
mov byte ptr [rdi], sil
mov r14, qword ptr [rbx + 16]
inc r14
jmp .LBB0_3
.LBB0_2:
lea r15, [rbx + 8]
mov edx, 8
sub rdx, r14
lea r12, [rsp + rdx]
add r12, 15
add r14, -7
lea rsi, [rsp + 15]
mov r13, qword ptr [rip + memcpy@GOTPCREL]
call r13
movabs rax, 5512829513697402577
imul rax, qword ptr [rbx]
add rax, qword ptr [rbx + 8]
mov qword ptr [rbx], rax
mov rsi, r14
and rsi, -8
add rsi, r12
mov rdi, r15
mov rdx, r14
call r13
.LBB0_3:
mov qword ptr [rbx + 16], r14
add rsp, 16
pop rbx
pop r12
pop r13
pop r14
pop r15
ret
Waow, what happened? That’s right, copy_from_slice
did! LLVM cannot compile a variable-length copy into anything other than memcpy
. Did you write a loop with a guaranteed bound on the iteration count by hand? Too bad, that goes in the memcpy
hole.
SipHasherSo crates in the wild do this wrong. How does the built-in Rust hasher handle this? It conveniently doesn’t define write_*
– by design, because this important optimization leads to a small increase in compile time. Riiiiiight.
The siphasher
crate, though, optimizes the short-length memcpy
with bitwise operations. Let’s try it out:
fn write(&mut self, input: u64, input_len: usize) {
assert!(input_len <= 8);
if input_len != 8 {
assert!(input >> (8 * input_len) == 0);
}
// Consume as many inputs as fit.
let old_length = self.length;
self.buffer |= input << (8 * self.length);
self.length += input_len;
// On overflow, feed the buffer block hasher and initialize the buffer with the tail.
if self.length > 8 {
self.block_hasher.feed(self.buffer);
self.buffer = input >> (8 * (8 - old_length));
self.length -= 8;
}
}
write_u8:
mov rax, qword ptr [rdi + 16]
movzx ecx, sil
lea edx, [8*rax]
lea rsi, [rax + 1]
shlx rdx, rcx, rdx
or rdx, qword ptr [rdi + 8]
mov qword ptr [rdi + 8], rdx
mov qword ptr [rdi + 16], rsi
cmp rsi, 9
jb .LBB0_2
movabs rsi, 5512829513697402577
imul rsi, qword ptr [rdi]
add rsi, rdx
mov edx, eax
add rax, -7
neg dl
mov qword ptr [rdi], rsi
shl dl, 3
shrx rcx, rcx, rdx
mov qword ptr [rdi + 8], rcx
mov qword ptr [rdi + 16], rax
.LBB0_2:
ret
This is kind of better? Now let’s try hashing (u8, u8)
like Rust would do:
#[no_mangle]
fn write_u8_pair(&mut self, pair: (u8, u8)) {
self.write(&[pair.0]);
self.write(&[pair.1]);
}
write_u8_pair:
mov r8, qword ptr [rdi + 16]
movzx r9d, sil
movabs rax, 5512829513697402577
lea ecx, [8*r8]
shlx rsi, r9, rcx
or rsi, qword ptr [rdi + 8]
lea rcx, [r8 + 1]
cmp rcx, 9
jb .LBB1_2
mov rcx, qword ptr [rdi]
imul rcx, rax
add rcx, rsi
mov qword ptr [rdi], rcx
mov ecx, r8d
add r8, -7
neg cl
shl cl, 3
shrx rsi, r9, rcx
mov rcx, r8
.LBB1_2:
lea r8d, [8*rcx]
movzx edx, dl
shlx r8, rdx, r8
or r8, rsi
lea rsi, [rcx + 1]
mov qword ptr [rdi + 8], r8
mov qword ptr [rdi + 16], rsi
cmp rcx, 8
jb .LBB1_4
imul rax, qword ptr [rdi]
add rax, r8
mov qword ptr [rdi], rax
mov eax, ecx
add rcx, -7
neg al
shl al, 3
shrx rax, rdx, rax
mov qword ptr [rdi + 8], rax
mov qword ptr [rdi + 16], rcx
.LBB1_4:
ret
Waow. So elegant. What went wrong?
In retrospect, the reason is obvious. The two writes can “tear” if the first write fills the buffer to the end. The optimizer does not realize the writes can be combined, so we’re left with this monstrosity.
More generally, the problem is that write_*
methods cannot predict the current state of the buffer, so the branches and variable-index accesses cannot be optimized out. And if write
s forced the state to a fixed one? Well, that’s equivalent to padding the data to a full block. Eugh.
InliningOkay, but hear me out, surely the state can be predicted if the hash
and write_*
calls are inlined? Here:
fn hash_u8_pair(pair: (u8, u8)) -> u64 {
let mut hasher = Self::new();
hasher.write_u8(pair.0);
hasher.write_u8(pair.1);
hasher.finish()
}
hash_u8_pair:
movzx eax, sil
movzx ecx, dil
shl eax, 8
or eax, ecx
ret
That’s a nice argument, but let me introduce to you: variable-length collections. Vec<T>
is hashed by writing the length and then hashing the elements one by one. Even if the element hashing is somehow vectorized (it’s not, LLVM is a dumdum), nothing after this variable-length collection can be hashed efficiently.
stdSurely someone thought of this problem before? C’mere, take a look at how slices of integers are hashed:
#[inline]
fn hash_slice<H: Hasher>(data: &[$ty], state: &mut H) {
let newlen = mem::size_of_val(data);
let ptr = data.as_ptr() as *const u8;
// SAFETY: `ptr` is valid and aligned, as this macro is only used
// for numeric primitives which have no padding. The new slice only
// spans across `data` and is never mutated, and its total size is the
// same as the original `data` so it can't be over `isize::MAX`.
state.write(unsafe { slice::from_raw_parts(ptr, newlen) })
}
So that’s good.
Meanwhile newtypes are crying in the corner, as #[derive(Hash)]
understandably does not apply this optimization to them (nor to structs with multiple fields and tuples), the built-in hasher uses
It canShall we benchmark some code?
use std::any::type_name;
use std::hash::{DefaultHasher, Hash, Hasher};
use std::time::Instant;
fn time<T: Hash>(obj: T) {
let start = Instant::now();
let mut hasher = DefaultHasher::new();
obj.hash(&mut hasher);
let h = hasher.finish();
println!("{}: {:?} (-> {h:0x})", type_name::<T>(), start.elapsed());
}
#[derive(Hash)]
struct NewType(i32);
fn main() {
let n = 100000000;
time((0..n).collect::<Vec<i32>>());
time((0..n).map(NewType).collect::<Vec<NewType>>());
}
Hashing [i32]
transmutes the slice into [u8]
and performs a single write
call, while hashing [NewType]
hashes the elements one by one. This benchmark thus measures the cost of individual calls. Note also that we hash almost
alloc::vec::Vec<i32>: 117.756736ms (-> 1984796e743a33f5)
alloc::vec::Vec<ruined_portal::NewType>: 469.774204ms (-> 1984796e743a33f5)
Huh, literally 1984.
We get siphasher
crate:
alloc::vec::Vec<i32>: 196.330253ms (-> 95697c476562afec)
alloc::vec::Vec<ruined_portal::NewType>: 243.031408ms (-> 95697c476562afec)
That’s better, though admittedly a
rapidhash
:
alloc::vec::Vec<i32>: 54.224434ms (-> 1908e25736ad8479)
alloc::vec::Vec<ruined_portal::NewType>: 278.101368ms (-> 949efa02155c336a)
ahash
:
alloc::vec::Vec<i32>: 56.262629ms (-> 217325aa736f75a8)
alloc::vec::Vec<ruined_portal::NewType>: 177.900032ms (-> 4ae6133ab0e0fe9f)
highway
:
alloc::vec::Vec<i32>: 53.843217ms (-> f2e68b031ff10c02)
alloc::vec::Vec<ruined_portal::NewType>: 547.520541ms (-> f2e68b031ff10c02)
That’s not good. Note that all hashers have about the same performance on Vec<i32>
. That’s about the speed of RAM. For small arrays that fits in cache, the difference is even more prominent. (I didn’t verify this, but I am the smartest person in the room and thus am obviously right.)
(Kinda)What I really want is a general-purpose hash that’s good for most practical purposes and kinda DoS-resistant but not necessarily cryptographic. It needs to perform fast on short inputs, so it can’t be a “real” block hash, but rather something close to rapidhash
.
We want:
Right, Rust doesn’t support this. Okay, let’s try another relatively well-known scheme that might be easier to implement. It’s parallel, surely that’ll help?
To hash a
where
This is a combination of certain well-known primitives. The problem here is that rustc
.
Unfortunately, Rust forces each hasher to handle all possible inputs, including of different lengths, so this scheme can’t work. The hasher isn’t even parametrized by the type of the hashed object. Four well-layouted 64-bit integers that can easily be mixed together with just two full-width multiplications? Nah, write_u64
goes brrrrrrrrrrrr-
Stop bitchingI’ve been designing fast hash-based data structures for several months before realizing they are almost unusable because of these design decisions. Surely something that isn’t a problem in C++ and Python won’t be a problem in Rust, I thought. I deserve a little bitching, okay?
Actually howThe obvious way forward is to bring the structure of the data back into the picture. If the hasher knew it’s hashing fixed-size data, it could use the
In my opinion, Hasher
and Hash
are a wrong abstraction. Instead of the Hash
driving the Hasher
insane, it should be the other way round: Hash
providing introspection facilities and Hasher
navigating the hashed objects recursively. As a bonus, this could enable (opt-in) portable hashers.
How this API should look like and whether it can be shoehorned into the existing interfaces remains to be seen. I have not started work on the design yet, and perhaps this article might be a bit premature, but I’d love to hear your thoughts on how I missed something really obvious (or, indeed, on how Rust is fast enough and no one cares).