I’m working on a Java decompiler because I’m not satisfied with the performance of other solutions. I’ve always heard that decompiling JVM bytecode is a solved problem, but I’ve concluded that the decompilation methods used by CFR and Vineflower are hacky, inefficient, and sometimes don’t even work. The existing solutions are haphazard and inadequate compared to alternative approaches.
Specifically, I have beef with the control flow extraction strategies employed by most decompilers. I haven’t tackled decompilation as a whole yet, but I’ve found an approach to control flow recovery that works in isolation, is quite modular, and addresses common frustrations. I don’t claim to be the first person to think of this method, but I haven’t seen it mentioned anywhere, so this post describes it hoping that it’s useful to someone else.
Disclaimer: The approach I describe is tuned for bytecode-based languages, such as Java and Python. There’s a chance that it might apply to native code with modifications, but I have not researched that.
Setting the sceneMost decompilers seem to follow one of these two control flow extraction strategies:
if
or while
directly, orHere are the problems with these approaches.
Just match bytecodeDespite what people might tell you, javac is an optimizing compiler – kind of, anyway. While it doesn’t perform abstract execution, it can still thread jumps together and track reachability. Java bytecode rarely contains instructions goto a; ... a: goto b
, by the magic of javac rewriting goto a
to goto b
. This means that
if (cond1) {
if (cond2) {
f();
} else {
g();
}
} else {
h();
}
is compiled to
if (!cond1) goto outer_else;
if (!cond2) goto inner_else;
f();
goto outside;
inner_else: g();
goto outside;
outer_else: h();
outside:
…and not the composition of two if
s, which would look like
if (!cond1) goto outer_else;
if (!cond2) goto inner_else;
f();
goto inner_outside;
inner_else: g();
inner_outside: goto outer_outside;
outer_else: h();
outer_outside:
This gets even more complicated when you recall that constructs like do { .. } while (false);
exist, which don’t produce any bytecode, but enable unlimited forward goto
s with multi-level break
s.
Trust me, there’s a ton of special cases in real-world Java bytecode. I’ve tried to write a decompiler based on this approach and it simply falls apart on complicated code.
Just build a CFGThe CFG is effectively a high-level overview of the control flow situation, and the general approach here is to find simple patterns inside the graph. For example, an if
conditional amoutns to a node splitting into two nodes, which are then reunited. No more javac-specific hard-coding! …or that’s how the story was supposed to go.
CFG is messy. You can’t recover constructs from inside out because an if
might contain a continue
and you’d have no idea how to handle such a split, as you don’t even know if it’s a continue
or a different control flow construct yet. But if you attempt to parse the CFG in the opposite order, you’ll quickly realize that you need to track connectivity and similar properties. This means that you have to re-run the analysis each time you find a new construct, and suddenly your decompiler is frustratingly slow.
My ideaMy approach is as follows. Using the CFG directly is a) slow because it’s a graph without a simple structure, b) bad because it’s unrelated to the original program order. Let’s bring the order back and hope that linearizing the graph allows us to use faster algorithms.
Suppose that we have something that looks like an if
. We can only translate it to a true if
if there are no cross-jumps between the then
and else
branches. In a CFG, this is a connectivity query; in a linear order, this is a “is there an arrow starting in
Of course, reading sequential bytecode, building a control flow graph, and then linearizing it back would be bollocks. If we can trust control flow not to be obfuscated, javac output is already a good instruction order, and our attempts to find a better one will likely produce worse code in the end. Not only does trusting javac provide a better result, but it’s also closer to the source.
This might sound like the “just match bytecode” approach, which I’ve just criticized, so let me explain the difference. The “just match bytecode” gang attempts to immediately match if
, while
, and other CF constructs. My goal is to build a very abstract AST, where nodes aren’t labeled with any specific high-level constructs, and then refine this abstract AST to a concrete one. This closely mirrors the CFG approach but uses a tree rather than a graph.
Abstract ASTWe’re now deep into the unexplored territory. A few papers come close, but there’s nothing specifically like this, so let me introduce some terminology. You might be familiar with some of it from this work, which we’ll build upon in a moment:
Lyle Ramshaw. 1988. Eliminating go to’s while preserving program structure. J. ACM 35, 4 (Oct. 1988), 893–920. https://doi.org/10.1145/48014.48021
An input program is a list of statements, where each statement is either an atom (i.e. an unsplittable instruction without any control flow) or a (possibly conditional) goto
:
enum InputStatement {
Atom(Atom),
Goto(Condition, StatementId),
}
An output program is a list of statements: either an atom, or a labeled block containing other statements, or a (possibly conditional, possibly multi-level) continue
or break
from a block:
enum OutputStatement {
Atom(Atom),
Block(BlockId, Vec<OutputStatement>),
Continue(Condition, BlockId),
Break(Condition, BlockId),
}
When executed, a block executes all its children and exits. It does not form a loop, and it’s not a functional control flow construct by itself, but it forms the basis of other constructs. Specifically, break
ing a block jumps to the statement after the block, and continue
ing a block jumps to the beginning of the block. For example,
while (x != 10) {
x++;
}
can be implemented as
block #1 {
if (x == 10) break #1;
x++;
continue #1;
}
Blocks kind of mirror CFG nodes, in that they contain statements and can be branched to, while break
s and continue
s are ways to simulate structured goto
.
A gap is the space between consecutive statements in the input program. An arrow is an arrow connecting two gaps in a particular direction.
BlocksI’m using blocks instead of any other construct because they do not introduce any control flow that hasn’t been present in the bytecode. I’m also not making a difference between blocks that only support break
and blocks that only support continue
, as the paper does: I acknowledge that real-world programs often have what amounts to both continue
and break
in the same loop, and that making this unnatural split is counterproductive.
Let’s talk about the end goal. If we manage to produce a minimal (in some sense) abstract AST, we can then pattern-match it to well-known constructs. For example,
block #n {
if (cond) break #n;
<something not containing any mention of #n>
}
…is certainly just an if
, and
block #n {
if (cond) break #n;
<statements...>
<a divergent statement>
}
…is equivalent to a while
. Note how tricky this second example is: we can parse a while
without hard-coding continue #n
at the end. continue #n
is necessarily divergent, but so is an if
statement with continue #n
in one branch and return
in the other one.
We’ll get into the specifics more soon, but you can see that although it seems like we’re just hard-coding control flow structures, the patterns we match are much more general than bytecode matching could ever achieve. By designing later passes carefully, you can gracefully recover all control flow structures from the original code, regardless of their complexity or code optimization.
Atypical blocks can be lowered to while (true) { ... break; }
. In reality, this is a special case of the while
lowering (or should I say upping?), but it’s useful to understand that there’s always a simple fallback if push comes to shove.
Ramshaw’s approachTo construct blocks, we’ll first need to build an arrow set. An arrow set succinctly represents the structure of all jumps a program can take. The presence of an arrow goto
s as the heads and insert the shortest arrows containing the goto
statements themselves.
Ramshaw’s paper makes use of these two core facts:
break
, and if it points backward, continue
does the trick.The paper explains precisely how to expand the arrows’ tails to make them form a tree, but this is where we need to depart from the classics. This approach ignores that not all arrows represent logical blocks. continue
and break
statements will be parsed as arrows, and if those are treated equivalently to natural control flow construct arrows, the generated code will get chaotic even if you try to optimize them out after arrow extension. Specifically, consider the following two cases:
There’s no local way to decide which arrow represents continue
/break
and which one represents a control flow construct, so at least in one of those cases, the arrow corresponding to the CF statement will have its tail extended, and thus the block won’t be parsed as a high-level construct. This might make that arrow intersect another arrow not represented here, triggering more and more extensions like ripples in the water, which will be a nightmare to roll back if and when we decide the break
and continue
arrows can be optimized out.
This is where the Relooper and Stackifier approaches fundamentally get stuck. You can see from the linked article that Stackifier cannot restore good-quality high-level control flow because they attempt to create blocks that do not and cannot exist in the source.
AlternativeHere’s an alternative approach.
We construct the blocks from the outside. This allows us to ignore arrows that can be implemented with continue
or break
, as all the blocks that could have satisfied them have already been built. Here’s specifically how we do that.
We define a recursive procedure
In the latter case, we have to create a block covering the entire continue
and break
, respectively, and remove all the corresponding arrows from the arrow set.
In most cases, new split gaps will appear at this point, allowing us to recurse, and that’s the whole algorithm.
IrreducibilityBut sometimes, no new split gaps appear. To be specific, as we haven’t removed forward arrows from
The core of the problem is the head-to-head collision. It closely resembles irreducible control flow from the CFG world. Ramshaw’s paper proves that such conflicts cannot be resolved with break
, continue
, and blocks alone: no tail extension can make such arrows nest, but head extensions are impossible because they change the target of the jump… or are they?
There’s a non-zero-cost approach to this: we can introduce a dispatcher at the start of the block, extend the backward arrow’s head to point to the dispatcher, and then add an arrow from the dispatcher to the original head:
The “jump to dispatcher” arrow sets a synthetic variable and emits continue
. The dispatcher, which is a synthetic statement inserted at break
to jump to the target location. For example,
if (cond) goto b;
a: f();
b: g();
goto a;
…can be decompiled as:
Target target = Target.Fallthrough;
while (true) {
switch (target) {
case Target.A:
target = Target.Fallthrough;
break;
case Target.Fallthrough:
if (!cond) {
f();
}
}
g();
target = Target.A;
}
…and while it’s somewhat ugly, a trivial dispatcher is at least a viable starting point. When you think about it, it’s similar to how people implement finite automata in high-level languages: with a variable for the current state, switch
over the state inside while (true)
, and changing states implemented by modifying the variable and restarting the switch
. Certain decompilers handle methods with irreducible control flow like this globally, but this approach allows us to localize the impact to the exact block that causes trouble.
Anyway, the main point here is that there’s a simple way to fix this in terms of arrows. Find the first split gap inside
Create new dispatch arrows from continue
. This makes the gap valid and recursion possible:
To reiterate, dispatchers are only necessary as a fallback for complicated control flow. javac does not (typically) generate such code, and handling this case is mostly only required for correctness and completeness, as JVM bytecode doesn’t forbid it.
In a nutshellPhew. Here’s a bird’s eye view of what we’ve done.
Given a range
We then created a block for
We then found the leftmost split gap taking only forward arrows into account, found all backward arrows crossing this gap (again, such arrows are necessarily within the block; there can also be no such arrows, and that’s fine too), marked them as satisfied with a jump to dispatch, and removed them from the arrow set. We added forward dispatch arrows to the arrow set. After that, we recursed, as we know there’s now a valid split gap.
A key property of the resulting AST is that each block has at least one arrow that couldn’t be satisfied by any other block, making it minimal in a certain sense. Each jump is satisfied by the outermost valid block, not the innermost, as your passes might expect, but this is a good thing. As javac threads goto
s, javac can easily implement a break
from an internal loop as a continue
of an outer loop, and the nested if
example from the beginning of the post still applies, so forcing the outermost block as the target reduces the number of cases that must be handled.
The blocks are also minimal in the sense that either the first statement of the block uses break
or someone inside the block uses continue
, and that either the last statement uses continue
or someone inside the loop uses break
. This means that, among other things, a block without continue
will necessarily start with an if (...) break;
, indicating that this block can be implemented with an if
; and vice versa, an if
can only be parsed as a block starting with if (...) break;
, not merely containing such a statement somewhere in the middle. This makes patterns easy to enumerate.
ImplementationAt the beginning of the post, I mentioned that CFG-style approaches are not only sometimes bad at code generation but also slow, so it’s time to explain how to improve time complexity of this approach.
First, we introduce a data structure over arrows that supports the following queries in logarithmic time:
It’s quite simple – a segment tree (the one supporting range queries, not from computational geometry) over gaps suffices:
This can be used to split the
The search for backward arrows crossing a given gap can be straightforwardly implemented with an interval tree.
Finally, simple arrays can be used to find forward and backward arrows ending at a certain point.
ConclusionSo. That’s how you implement a high-quality control flow restoration algorithm with quasilinear performance. Q&A:
Why does decompilation performance matter this much to you?
Because I have a use case that requires a fast and slim Java decompiler, and so I started doing some research, and suddenly three weeks have already passed, and this is where I found myself.
Why no realistic samples?
Because you need a ton of passes after control flow structuring to make the code somewhat readable. This is just one part of the process, and while its implementation is interesting from a technical point of view, the results themselves are boring. I’ll draw the rest of the fucking owl someday, but it’s harder than it seems.
Why not wait until you get further results?
Because I might stumble upon some other problem I don’t know how to resolve and not finish the project. This is my way of sharing knowledge while I still can.
If I understand the above, can I make a Java decompiler?
Maybe, but don’t hold your breath. This is just the beginning – you still need to write passes to convert blocks to if
s and while
s, you need to handle complicated control flow with &&
and ||
in if
/while
conditions, you need to handle exception handlers and synchronized
monitors, and so on. Decompiling Java is not simple at all unless you don’t care about correctness, efficiency, and code quality, which… many decompilers don’t, and that’s valid, but it’s not something I’d be proud of.
Can this be applied to other languages?
If we’re talking about bytecode-based languages, then yes, the core of the work should be applicable as is. For native code, it’s probably not that easy, as the compiler often reorders code for efficiency. This is somewhat a problem for obfuscated JVM bytecode as well – it’s just out of scope for me at the moment – but you’ll likely get acceptable results if you reorder statements according to DFS order and a few heuristics. No guarantees, though.