Suppose that you’re writing a static analyzer and you want to write a diagnostic for match arms with equal bodies:
match number {
1 => { // <--
let x = 1;
f(x)
}
2 => f(g(h())),
3 => "",
4 => { // <--
let x = 1;
f(x)
}
_ => unreachable!(),
}
Well, that looks simple enough: serialize each arm into a string and throw the strings into a hash map. Then someone renames a variable:
match number {
1 => { // <--
let x = 1;
f(x)
}
2 => f(g(h())),
3 => "",
4 => { // <--
let y = 1;
f(y)
}
_ => unreachable!(),
}
Now the strings no longer match, but the arms are still clearly equivalent. Scary! It’s not immediately obvious how to handle this correctly, let alone efficiently.
It turns out that this problem has interesting connections to the theory of computation and algebra. This is what this post is about: how odd tricks come together to tackle this challenge.
ProblemWe’ll limit ourselves to syntactic analysis: snippets are considered equivalent if they match exactly, up to variable names. A smarter algorithm would realize that a + b and b + a equivalent? what if it’s an overloaded operator?), so we’ll focus on something more rigid. There’s still plenty to talk about.
We can’t just ignore all variable names, since that’s going to trigger false positives. But as we’ve just seen, we can’t use them directly either.
One possibility is to abolish hash tables. Looking at two arms individually, we can record the correspondence between variables declared in one arm and the other, and then validate that all variable accesses match that correspondence:
| Snippet A | Snippet B | Correspondence |
|---|---|---|
let x = 1; (defines x) | let y = 1; (defines y) | x maps to y |
f(x) (uses x) | f(y) (uses y) | consistent |
But that means that in a situation with, say, 20 arms, which is exactly the kind of situation where this diagnostic would be useful to catch copy-paste errors, we’re going to need 200-ish comparisons to verify that all arms are unique. That’s quite slow, even for a static analyzer. So how can we integrate this idea into the original hash table-based approach?
Luckily, we’re not alone in our struggles. We don’t have to solve this problem ourselves – or, at least, this specific variation – because that’s already been done ages ago. The notion we’ve been struggling to formulate is called
Let’s take a detour, shall we?
Side questBack at the dawn of computation, when people like Alan Turing were only starting their research, it was not yet clear what the word “computation” was even supposed to mean.
Almost every system to date, including set theory and natural numbers, was based on techniques that were unusable for formalizing computation. Mathematicians often use first-order logic quantifiers like
So someone had to introduce a new framework: a model of a real-world computer that could be studied in isolation. One of such models was the Turing machine, which you’ve likely already heard about. But it wasn’t the only one. Another candidate, distinct but equal in computational power, was lambda calculus.
If Turing machine is an esoteric imperative language,
So why do we care about
lambda x: ..., C++ says [](auto x) { ... }, and
It may seem surprising that this mechanism is powerful enough to implement anything, given that the only type we have is “function”, and the only way to create named objects is to capture them as function arguments. That’s right – you can’t even define globals here!
I’m not going to teach you
Note that true and false, you can write true, this will evaluate to:
In effect, programs in
With makeshift ternaries, you can implement boolean logic (e.g. a && b is a ? b : false), and with pairs, you can implement linked lists. That gives you numbers, arrays, and all finite computation. Loops and recursion are also possible to implement, but ultimately offtopic for this article, so I’ll wrap this up for now.
RepresentationsOnly being able to “declare” a variable by defining a function is restrictive, but on the flip side, it simplfies code analysis.
Since a variable can only be accessed within the
Any
The variable names can be removed from the
Note that, with this scheme, the same variable can be seen with different indices from different scopes:
Of course, not all variables are local, i.e. defined within the snippet. For example, we can only rewrite
LintBack to the original problem: writing a lint for repeating match arms. We don’t actually need to convert the imperative code to
// `a` is defined outside
let x = 2 + 2;
let y = "hello";
if x == a {
let z = "world";
println!("{y} {z}");
}
…could be interpreted as:
let x {
let y {
x = 2 + 2;
y = "hello";
if x == a {
let z {
z = "world";
println!("{y} {z}");
}
}
}
}
…and then converted to:
let {
let {
_2 = 2 + 2;
_1 = "hello";
if _2 == a {
let {
_1 = "world";
println!("{_2} {_1}");
}
}
}
}
…which can be directly inserted into a hash map.
Simpler path?Oh, but couldn’t we do something simpler earlier? Couldn’t we scan each match arm for a list of variable definitions and rewrite local variable accesses into indices from that list, resulting in something like:
let _1 = 2 + 2;
let _2 = "hello";
if _1 == a {
let _3 = "world";
println!("{_2} {_3}");
}
Yes, yes we could. But you see, I too have a superpower, called “lying”.
We aren’t just trying to find repetitions among the arms of a single match. We’re doing that over the whole program. Maybe we’re writing more lints. Maybe we’re trying to compress code. Maybe we’re a theorem prover trying to reduce the size of the graph. Maybe we’re writing a Java decompiler (ahem).
Either way, we’re doing this at scale. And that means that we can’t just insert the full representations of all subtrees into a single hash map: we’re going to run out of memory really fast. Barring that, this process is going to take an embarrassing amount of time, since we’ll be constantly iterating over deeply nested nodes.
And that’s where the gamble pays out. What we want is some clever way to compute the nameless encoding of a large block from the encodings of its constituents, so that we don’t have to scan subtrees repeatedly. Preferably, we want simple concatenation to work:
With de Bruijn indices, that’s exactly the case. For example, for
Meanwhile, had we tried to use indices into a list of local variables, that wouldn’t work nearly as well. We’d get the same representations for
1-0 for learning math.
The only trouble is that function calls are only one part of the story – there’s also function declarations. It sure seems like
…should work too, but that’s not quite true. Consider
where
Note that you don’t have to iterate over the subexpression to find the locations to change. Before analyzing the program, you can accumulate the positions at which each variable is mentioned, and then only mutate those locations with random access. So it’s going to take linear time in total.
This is a lot to take in, so let me put it in code so that we’re on the same page:
range_of_expr: dict[Expr, tuple[int, int]] = {}
variable_nesting: dict[VariableName, int] = {}
variable_accesses: dict[VariableName, list[tuple[Expr, int]]] = {}
current_location: int = 0
def collect_locations(expr: Expr, nesting: int):
global current_location
start = current_location
match expr:
case Variable(x):
# x
current_location += 1
variable_accesses[x].append((start, nesting - variable_nesting[x]))
case Function(x, body):
# \, body
current_location += 1
variable_nesting[x] = nesting
variable_accesses[x] = []
collect_locations(body, nesting + 1)
case Call(f, a):
# (, f, a, )
current_location += 1
collect_locations(f, nesting)
collect_locations(a, nesting)
current_location += 1
end = current_location
range_of_expr[expr] = (start, end)
collect_locations(root, 0)
output = [None] * current_location
def calculate_representations(expr: Expr):
start, end = range_of_expr[expr]
match expr:
case Variable(x):
output[start] = x
case Function(x, body):
output[start] = "\\"
calculate_representations(body)
for location, de_bruijn_index in variable_accesses[x]:
output[location] = de_bruijn_index
case Call(f, a):
output[start] = "("
calculate_representations(f)
calculate_representations(a)
output[end - 1] = ")"
print("The representation of", expr, "is", output[start:end])
calculate_representations(root)
HashingIronically, the only non-linear part of the algorithm is output[start:end] in the print statement, since the total length of representations is
However, we can still hash the representations improperly. Here’s what I mean by that. Suppose that we used the following – very stupid, I know – hash function:
This formula makes it easy to compute the hash of concatenation of two hashed strings, without knowing their exact contents:
But more importantly, it can handle replacements:
Together, this allows us to efficiently compute hashes of all representations, without even storing those representations in memory. That’s great, except for the fact that XOR makes for a terrible hash.
But there’s better hashes that still support these operations. The most common one is the polynomial hash, defined as follows:
Here,
To concatenate strings, we only need to know their hashes and lengths:
To replace a character, we need to know the original hash, the position of the affected character, and what is replaced with what:
All in all, this allows the hashes of all subexpressions to be computed in
powers_of_b: list[int] = [1]
# Computes `h * b ** count % p` in amortized constant time.
def shift(h: int, count: int) -> int:
while len(powers_of_b) <= count:
powers_of_b.append(powers_of_b[-1] * b % p)
return h * powers_of_b[count] % p
# Not shown: a function capable of hashing variable names, de Bruijn indices, and characters
# \, (, ) without collisions.
def hash_small(x) -> int: ...
def calculate_hashes(expr: Expr) -> int:
start, end = range_of_expr[expr]
match expr:
case Variable(x):
h = hash_small(x)
case Function(x, body):
h = hash_small("\\") + shift(calculate_hashes(body), 1)
for location, de_bruijn_index in variable_accesses[x]:
h += shift(hash_small(de_bruijn_index) - hash_small(x), location - start)
h %= p
case Call(f, a):
h = (
hash_small("(")
+ shift(calculate_hashes(f), 1)
+ shift(calculate_hashes(a), range_of_expr[a][0] - start)
+ shift(hash_small(")"), end - 1 - start)
)
h %= p
print("The hash of", expr, "is", h)
return h
calculate_hashes(root)
Summing upWe’ve figured out how to compare any subtrees for equality in constant time, at the cost of some linear-time precalculation. So that’s good.
Unfortunately, we had to use hashes, which means that there’s a small chance of a false positive: we might consider two distinct trees identical if their hashes match by accident. What can we do about it?
Sometimes, such a false positive is perfectly acceptable. If the chosen prime
We can avoid false positives by comparing the subexpressions for equality if the hashes match. This gives us the ability to discard checks that we know would fail, and only focus on successes. Depending on your goal, this may either be an asymptotic slow-down back to
If you’re drawn to constant-time comparison, there is a way to validate the obtained hashes for lack of collisions in
If you don’t like randomness, there’s a deterministic algorithm producing collision-free hashes in guaranteed
The algorithms in 3. and 4. are tricky, and this post is already quite long, so I’ll cover them in a later, more technical post. Subscribe to my RSS if you don’t want to miss it.
ReferencesThis post is partially based on the paper Hashing Modulo Alpha-Equivalence, though the algorithm described in the paper is slightly different: