I’m making progress on the Java decompiler I’ve mentioned in a previous post, and I want to share the next couple of tricks I’m using to speed it up.
Java bytecode is a stack-based language, and so data flow is a bit cursed, especially when the control flow is complicated. I need to analyze data flow globally for expression inlining and some other stuff. Single-static assignment produces basically everything I need as a byproduct… but it’s not very fast.
For one thing, it typically mutates the IR instead of returning data separately, and the resulting IR has imperative code mixed with functional code, which is a little unpleasant to work with. SSA has multiple implementations with very different performance characteristics and conditions, and each of them forces me to make a tradeoff I’m not positive about.
SSA is not a bad choice by any means, but I was thinking that maybe I could integrate ideas from SSA implementations into algorithms my decompiler actually needs, without computing SSA itself.
MotivationJava decompilation involves translating JVM’s stack operations to Java’s operations on variables. This is quite simple: by mapping the stack element at height
push(1);
push(2);
add();
// translates to:
a0 = 1;
a1 = 2;
a0 = a0 + a1;
To collapse such long chains into simpler ones, like a0 = 1 + 2
, I need to track, for each stack read, which write produced the corresponding value. This seems very easy until you realize that branching exists:
if (cond) {
push(1);
} else {
push(2);
}
push(3);
add();
// translates to:
if (cond) {
a0 = 1;
} else {
a0 = 2;
}
a1 = 3;
a0 = a0 + a1; // where does a0 in RHS come from here?
If you wanted to keep things simple, you’d recurse into the preceding basic blocks and merge the sources of a0
they return, but that has horrible time complexity and quickly gets out of hand.
IndependenceThere isn’t a way to efficiently track, for each use, precisely which definitions that use can see, without SSA. But I don’t need that kind of precision: I’m not writing an optimizing compiler that would benefit from such analysis. I can make do with two pieces of information:
The former is necessary for inlining expressions. The latter would benefit from an example. Suppose that I have code like this:
a0 = f();
g(a0, a0);
a0 = h();
i(a0, a0);
I obviously can’t inline f()
into g(...)
, so I have to retain a0
as a variable. But as f()
and h()
can have different return types, I want the first two and the last two lines to access different variables, e.g.:
a0v1 = f();
g(a0v1, a0v1);
a0v2 = h();
i(a0v2, a0v2);
The key idea is: all definitions visible to a given use need to access the same instance of the variable. A straightforward implementation would iterate over uses, query which definitions each use sees, and then merge them via union-find. The individual components would map to independent variables.
ApproachLet’s discuss how to do this enumeration and querying efficiently.
All def-use chains within a single basic block can be resolved statically, perhaps even while populating the basic block. We’re left with tracking def-use chains across basic blocks.
Consider a graph whose vertices are (basic block, variable name) pairs. Entering vertex (bb, var)
indicates that we’re interested in finding definitions of var
that are visible on entry to bb
. For each predecessor pred
of bb
, we can either find the definition of var
within bb
, or recurse to (pred, var)
. The algorithm starts with a DFS over such a graph.
Here’s how I accumulate the located definitions. We create a node in union-find for each mention of a variable, i.e. for both uses and definitions. For each use use
of var
that doesn’t have a matching definition in its basic block bb
, we enter the vertex (bb, var)
with the request for the answer to be merged into the component use
of union-find. This “output” argument is propagated recursively and cached as the answer for each visited vertex. Whenever any given vertex (bb, var)
is entered for the second time, the current and the cached output arguments are merged, and recursive traversal is skipped.
The implementation via iterative DFS fits on a napkin:
// We could just store `bb` on the stack, but `var` and `use` will come in handy a bit later.
stack.push((bb, var, use_));
while let Some((bb, var, use_)) = stack.pop() {
match cached.entry((bb, var)) {
Entry::Occupied(entry) => {
union_find.merge(use_, *entry.get());
continue;
}
Entry::Vacant(entry) => entry.insert(use_),
}
for pred in &predecessors[bb] {
if let Some(def) = active_defs_at_end[pred].get(&var) {
union_find.merge(use_, def);
} else {
stack.push((pred, var, use_));
}
}
}
Note that we save use
into cache before recursing. This allows the algorithm to work correctly on cyclic CFGs: even though vertices of a strongly connected component may refer to different node IDs, the nodes will correspond to the same component in union-find.
The worst-case time complexity is
However, the important part is that the big-O constant is quite low because it’s a single DFS. Moreover, this bound can be tightened in some common cases, because only basic blocks lying along the paths between the uses and the definitions are visited. This means that, for instance, that in code structured like
var0 = ...;
// <bb boundary>
f(var0);
// <bb boundary>
var1 = ...;
// <bb boundary>
f(var1);
// <bb boundary>
var2 = ...;
// <bb boundary>
f(var2);
// ...
…where the number of basic blocks within use-def is bounded, the time complexity is quasi-linear. As an additional data point, if analyzing two programs
Note that the graph is implicit and never manifests in memory. I’ve considered replacing union-find with an offline DFS scan, but I think it’d actually be slower. First, the time loss from allocating data structures to store the graph will probably outweigh the speed up from not having to touch union-find. Second, union-find allocates one word per node, while building the graph dynamically requires nested vectors and stores two words per edge, so DFS would have extremely questionable memory locality.
Dead storesYou can extend this algorithm to eliminate dead definitions within the same pass. Dead definitions are definitions that no side effect uses transitively. The required modifications are:
(bb, var)
if the corresponding use
is from a side effect, andactive_defs_at_end
, push every use inside the definition to stack. (var
and use
will come from the uses inside the definition, not the defined variable itself, which is why stack
doesn’t just contain a bb
field.)Any definition that was left untouched is a dead store.
InliningWith this approach in mind, let’s return to inlining. We want to, for each use, determine if it comes from just a single definition. But because the algorithm described above only gives us information about components, we can only learn the component of the definition, which can contain definitions that this particular use does not see (but some others do).
This analysis is similar to the one described above on a basic level. We can navigate the same graph recursively, but instead of accumulating definitions in union-find, DFS should return either “undefined”, or “defined at
enum Source {
Undefined,
DefinedAt(Definition),
ManyDefinitions,
}
impl Source {
fn merge(&mut self, other: Source) { /* ... */ }
}
fn visit(bb: usize, var: usize) -> Source {
let mut source = match cached.entry((bb, var)) {
Entry::Occupied(entry) => return *entry.get(),
Entry::Vacant(entry) => *entry.insert(Source::Undefined),
};
for pred in &predecessors[bb] {
if let Some(def) = active_defs_at_end[*pred].get(&var) {
source.merge(Source::DefinedAt(def));
} else {
source.merge(visit(*pred, var));
}
}
*cached.get_mut(&(bb, var)).unwrap() = source;
source
}
There’s only a problem: if the CFG is cyclic, only the first node in an SCC is guaranteed to see all the uses. For example, consider the following graph:
If we enter
We could condense the graph, but that would be a separate pass. But how else would we spread source
among the whole strongly connected component? Tarjan’s algorithm comes to the rescue: it’s based on DFS as well, and it tells you when you’re exiting the strongly connected component, and it provides you with the list of nodes in the component. The implementation grows a bit larger, but it’s still a single, very fast pass:
fn visit(bb: usize, var: usize) -> DfsNodeState {
let index = tarjan_stack.len();
let mut state = match cached.entry((bb, var)) {
Entry::Occupied(entry) => return *entry.get(),
Entry::Vacant(entry) => *entry.insert(DfsNodeState {
low_link: index,
source: Source::Undefined,
}),
};
tarjan_stack.push((bb, var));
for pred in &predecessors[bb] {
if let Some(def) = active_defs_at_end[*pred].get(&var) {
state.source.merge(Source::DefinedAt(def));
} else {
let pred_state = visit(*pred, var);
state.source.merge(pred_state.source);
state.low_link = state.low_link.min(pred_state.low_link);
}
}
let is_scc_root = state.low_link == index;
if is_scc_root {
for scc_node in tarjan_stack.drain(index..) {
*cached.get_mut(&scc_node).unwrap() = DfsNodeState {
low_link: usize::MAX,
source: state.source,
};
}
} else {
*cached.get_mut(&(bb, var)).unwrap() = state;
}
state
}
This is a solid implementation that can also be extended to track values through copies like a = b
, or verify that a use can never read undefined memory, etc.
OutroSo that’s what I’m working with at the moment. I have an optimized (but not necessarily inlined) IR, I know how to structure control flow, I know how to inline expressions and detect common control flow structures; now I just need to glue all of this together. And also figure out exceptions.
When decompiling a Minecraft server, the passes I’ve already implemented take
Hopefully I’ll get something else working by the time I publish the next post on this topic.