This post has been sitting in my drafts for a year because I never finished benchmarking stuff to the point I consider thorough enough for release. I’m trying to deal with my backlog, and I think this post has tricks that some of you might enjoy anyway. Treat it as a director’s cut sort of thing.
Minecraft generates a bedrock floor at the bottom of the world from a random noise. Since it’s random, it can contain naturally generated unescapable regions – prisons. While small prisons are common, larger ones are hard to find – a Minecraft world is about
After keeping my PC running for a couple days, here’s the largest prison I found:


I find this problem a good exercise for performance optimization because it has a limited scope, but also covers lots of topics. And it’s quite a head-scratcher due to its large scale! In this post, I’ll cover various surprising approaches and tricks I used, and hopefully you’ll be able to apply them to your projects.
SettingThe bedrock noise is generated by a boolean function

The simplest bedrock prison we’re searching for looks like this. The player cannot perform a two block high jump under normal conditions, so after they get into this box, they’ll never be able to escape it (without damage ticks, anyway).

Large boxes of this exact kind are incredibly rare, so even with optimizations we’ll have to look for more complex structures.
Reducing complexityWhile we could perform an exact analysis in the 3D space, that would be incredibly difficult to compute performantly. Instead, we can reduce this to a 2D problem that can be handled much more efficiently.
Let’s separate

Any connected component of interior blocks that doesn’t border hazards is a prizon. This doesn’t cover some trickier valid prisons, but here’s my thought process.
Since worlds are gigantic and there are many seeds, we’ll never run out of search space. So it doesn’t matter if we skip difficult candidates – there will always be simpler ones elsewhere. The only thing that matters is how many prisons we can locate in a second, and solving the problem exactly wouldn’t compensate the decreased performance nearly enough.
That is, compared to an exact solution, “incomplete” and “incorrect” checks can produce better results for the same amount of resources spent.
This tradeoff also applies to problems outside brute-force search:
Basic algorithmLet’s start writing some code. We have a 2D grid and want to find a maximum region of “interior” cells surrounded by “wall” cells, bordering no “hazard” cells. This is typically implemented by invoking DFS from each cell:
// Pseudocode, has borrowck errors -- don't think too much about it.
let mut visited = HashSet::new();
let mut region_size = 0;
let mut dfs = |node: (i32, i32)| -> Result<(), ()> {
visited.insert(node);
match get_column_type(node) {
ColumnType::Interior => {
// Keep iterating to neighbors.
}
ColumnType::Wall => {
// Can't navigate through walls.
return Ok(());
}
ColumnType::Hazard => {
// Encountered a hazard -- abort immediately.
return Err(());
}
}
region_size += 1;
for (dx, dz) in [(-1, 0), (1, 0), (0, -1), (0, 1)] {
let node1 = (node.0 + dx, node.1 + dz);
if !visited.contains(&node1) {
dfs(node1)?;
}
}
Ok(())
};
for z in -WORLD_BORDER..=WORLD_BORDER {
for x in -WORLD_BORDER..=WORLD_BORDER {
if !visited.contains(&(x, z)) {
region_size = 0;
if dfs((x, z)).is_ok() {
// Yay, a new valid region of size `region_size`!
}
}
}
}
As the graph is enormous, visited will run out of space really quickly, so this doesn’t quite work. But we can fix this – algorithms are not set in stone. By persisting only the last row worth of data, we retain enough context for DFS (the retained cells act as a boundary between dropped cells and remaining cells, so dropped cells are never accessed) while reducing the maximum visited size to
for z in -WORLD_BORDER..=WORLD_BORDER {
for x in -WORLD_BORDER..=WORLD_BORDER {
if !visited.contains(&(x, z)) {
region_size = 0;
if dfs((x, z)).is_ok() {
// Yay, a new valid region of size `region_size`!
}
}
visited.remove(&(x, z - 1)); // <- this is new
}
}
Note that I don’t populate an entire row and then drop the previous row once it’s fully complete. Instead, I remove the cell above as soon as I process the current cell. So at any point in time, the present cells are actually split between two neighbor rows. This is known as a broken profile, as opposed to just a normal straight profile, and it’s occasionally used to optimize dynamic programming.
Now we can run the code and get the first estimates for how much time this will take:
...
[8.294270481s / 0.0000011695244307304885%] found 11 at (12102852, -29999984)
...
So it takes about
High-level optsThere’s a lot of low-hanging fruit here. For example, we’re using the default hashset, even though a bitset indexed by
A quick check shows that there’s a lot of prisons, most of them small, so there’s no point in saving or even finding all of them. Let’s reframe the problem as only finding “large enough” components.
A simple way to benefit from this is by invoking DFS in a checkerboard pattern: by starting search only from every other square, we can find all prisons of size

The radius can be chosen arbitrarily. This combination can only miss components that fit within the dfs invocations by
for coords in diagonals() {
if !visited.contains(&coords) {
region_size = 0;
if dfs(coords).is_ok() {
// Yay, a new valid region of size `region_size`!
}
}
}
A computer cannot perform
No code is better than fast code. This is an incredibly broad advice, so I’ll give some examples you might get inspiration from:
memcpy, you use pointers (although there are obviously exceptions, e.g. small objects and cache locality). Optimizing primitives is incredibly useful, but only after you reduce the amount of times they’re used.x &= x - 1;) instead of checking if each individual bit is set.Another way to skip small prisons is to probe evenly distributed coordinates:

But this can also miss narrow large components. It will find all prisons containing a
I like separating pure optimizations from the ones that change the problem. This saves you a headache of worrying about what ifs: what if the PRNG seed was unlucky? What if we haven’t found an optimal solution just because of arbitrary choices in the implementation? If you avoid heuristics, you avoid all this trouble.
visitedUnfortunately, we’re once again in trouble due to memory usage: the visited.remove(&(x, z - 1)); trick doesn’t work with diagonals.
Another way to fix this issue is to clear visited before every DFS invocation. Typically, this would increase the DFS time complexity from
for coords in diagonals() {
visited.clear();
region_size = 0;
if dfs(coords).is_ok() {
// Yay, a new valid region of size `region_size`!
}
}
Surprisingly, this improves performance HashSet size – and now suddenly a bitset doesn’t seem like a good solution, as it won’t fit in cache as well as visited does!
[445.198978ms / 0.000001923848745618791%] found 11 at (-29990069, -29998131)
[2.17199706s / 0.000009889538106497513%] found 12 at (-29991087, -29982201)
[2.798458445s / 0.000012758756604408164%] found 11 at (-29997794, 29971862)
[2.798479765s / 0.000012758756731392425%] found 11 at (-29997793, 29971863)
[2.79848252s / 0.000012758756858376685%] found 11 at (-29997792, 29971864)
...
It’s weird, right? Some optimizations work well individually and lead to pessimizations when combined. Even if you’ve found all the tricks, you still need to figure out which ones to combine. I’ve written about this in more detail earlier.
VectorizationRecursive calls are slow, and we currently invoke dfs for all rhombi cells, even though only a small percentage will be interior cells. By lifting such a check, we can avoid costly function invocations:
for coords in diagonals() {
visited.clear();
region_size = 0;
if is_interior(coords) && dfs(coords).is_ok() {
// Yay, a new valid region of size `region_size`!
}
}
Recall that get_column_type might need to check three blocks to see if the column is a hazard or a wall, whereas checking if a column is an interior only needs to compare two blocks. Checking for bedrock is slow, so this is a great optimization. But now that is_interior is invoked for consecutive cells, we can also easily vectorize the check:
// coords is now [(i32, i32); 8], not (i32, i32)
for coords in diagonals() {
// is_interior now returns a bitmask
for coords_index in BitIter::from(is_interior(coords)) {
let coords = coords[coords_index];
visited.clear();
region_size = 0;
if dfs(coords).is_ok() {
// Yay, a new valid region of size `region_size`!
}
}
}
BitIter is used to speed up iteration over set bits, as we only expect a small portion of bits to be 1s. It uses the classical x &= x - 1 trick under the hood.
A sequential is_interior function could benefit from short circuiting: if one of the blocks is bedrock, we don’t need to check the other one. Vectorization makes us abandon short circuiting, but increased throughput counteracts the performance loss.
[14.82663952s / 0.00019748246707681585%] found 15 at (-29900419, -29980309)
[14.826951506s / 0.0001974862524776069%] found 15 at (-29900418, -29980310)
[29.012100521s / 0.0003862456819715604%] found 16 at (-29997419, -29835789)
[38.530737716s / 0.000512744426557512%] found 17 at (-29899478, -29908354)
[38.53121425s / 0.0005127505261194575%] found 17 at (-29899477, -29908355)
...
This is great news: the vectorized implementation now takes HashSet!
FloatsLet’s discuss the vectorized implementation in more detail. To do that, we need to take a look at the formulas used for bedrock generation first.
Video games are notorious for using floats all over the place. Minecraft’s bedrock noise is no exception. Minecraft generates bedrock at
fn is_bedrock(x: i32, y: i32, z: i32) -> bool {
let mut prng = Prng::new(hash_coordinates(x, y, z));
let probability = lerp(lerp_inv(y as f32, -64.0, -59.0), 1.0, 0.0);
f64::from(prng.next_f32()) < probability
}
While floats are often good enough, they are a nuisance here due to imprecision. Luckily, when checking the type of a column (x, z), y is a constant, so probability can be precomputed, and we only need to deal with prng.next_f32() and the cast to f64.
next_f32 is implemented by generating f32 to f64 is always exact as well, so the last line of is_bedrock is equivalent to:
prng.next_bits(24) as f64 < probability * 2f64.powi(24)
…which can in turn be rewritten as:
prng.next_bits(24) < (probability * 2f64.powi(24)).ceil() as u64
…where the right-hand side is a constant, so we can avoid all runtime float operations.
Note the use of .ceil() and excessive focus on which floating-point operations are exact. This ensures that the simulation doesn’t use slightly incorrect data, which is certain to cause problems over the span of
A shameless plug: while we could optimize out the as f64 cast here, that’s not always possible, so you might be interested in my snippets for faster int
AVX-512hash_coordinates and prng.next_bits are not very interesting – it’s just an ad-hoc hash and Xoroshiro:
fn hash_coordinates(x: i32, y: i32, z: i32) -> u64 {
let mut l = (x.wrapping_mul(3129871) as i64) ^ (y as i64) ^ ((z as i64) * 116129781);
l = l.wrapping_mul(l.wrapping_mul(42317861).wrapping_add(11));
(l >> 16) as u64
}
impl Prng {
fn new(seed: u64) -> Self {
Self {
low: seed ^ FLOOR_SEED_LOW,
high: FLOOR_SEED_HIGH,
}
}
fn next(&mut self) -> u64 {
let Self { low, mut high } = *self;
let mid = low.wrapping_add(high).rotate_left(17).wrapping_add(low);
high ^= low;
self.low = low.rotate_left(49) ^ high ^ (high << 21);
self.high = high.rotate_left(28);
mid
}
fn next_bits(&mut self, bits: usize) -> u64 {
self.next() >> (64 - bits)
}
}
They’re both nasty because of
AVX-512’s orthogonality significantly simplifies both algorithm design and code: this whole project has no unsafe and relies exclusively on portable SIMD, and the codegen is nearly optimal, unlike in many other situations when I’ve tried this.
Obviously, AVX-512 is highly non-portable, so that’s still not an option in many cases, but it’s more than fine for this project: it’s not like such brute-force even has a chance to work well on old hardware.
That’s vectorization done. What else comes to your mind?
Coarse filteringMany algorithms include fast paths for special cases – like memcpy for small n. This is typically done if you can quickly check whether the fast path is applicable before applying it. But if the performance benefit is high enough, you can actually use this approach even if the correctness of the fast path result is only known post factum.
Here’s what I mean. We currently spend time checking columns for hazards, even if the component turns out to be too small. Let’s split the DFS run into two: first we compute the size of the interior equating hazards to walls, and then, if it’s large enough, we check for hazards. This removes a short-circuit check, but since most components are eliminated faster, in the end this boosts performance.
let mut fast_path_dfs = |node: (i32, i32)| {
region_size += 1;
for (dx, dz) in [(-1, 0), (1, 0), (0, -1), (0, 1)] {
let node1 = (node.0 + dx, node.1 + dz);
if visited.insert(node1) && is_interior(node1) {
dfs(node1);
}
}
};
let mut slow_path_dfs = /* old dfs implementation */;
// ...
if fast_path_dfs((x, z)) && region_size >= CUTOFF && slow_path_dfs((x, z)).is_ok() {
// Yay, a new valid region of size `region_size`!
}
A lower-level example of this optimization is DFS itself. Tracking visited nodes in a HashSet is slow even with a custom hash, but as most components are small, we can use a bitset which just enough bits for cells close to the starting point. If we stay inbounds, we get good cache locality. If we occasionally get out of bounds, we restart DFS with a HashSet.
fn insert(set: &mut [u32; 32], (x, z): (i32, i32)) -> bool {
let row = &mut set[z as usize & 31];
let bit = x as usize & 31;
let is_new = (*row >> bit) & 1 == 0;
*row |= 1 << bit;
is_new
}
The row can be computed as set + (z & 31) * 4, which is an addressing mode x86 supports directly.
In fact, we can avoid the per-cell overflow check entirely. Due to masking, memory accesses never cause UB, so the worst thing that can happen is skipping some cells on wrap-around. If DFS computes the component size as HashSet. Otherwise, we can trust the count.
Alternatively, we can guarantee no wrap-around if there’s an empty row and a column. This check fails more rarely, but is more expensive, so it’s not the best choice.
Such a bitset with occasional false positives is tangentially related to Bloom filters. Just putting this here for completeness – look it up if you haven’t heard of it before.
DFSThe only thing left to optimize is the DFS itself, and I left it for last because it’s a broad topic. The first step is to rewrite DFS to be iterative (BFS-style) rather than recursive:
// Initialization:
let mut stack = Vec::new();
// Main loop:
visited.clear();
visited.insert(coords);
region_size = 0;
stack.clear();
stack.push(coords);
while let Some(coords) = stack.pop() {
if !is_interior(coords) {
// Can't navigate through walls.
continue;
}
region_size += 1;
for (dx, dz) in [(-1, 0), (1, 0), (0, -1), (0, 1)] {
let coords1 = (coords.0 + dx, coords.1 + dz);
if visited.insert(&coords1) {
stack.push(coords1);
}
}
}
But be careful: in my experience, iterative DFS is often slower than recursive DFS. This seems to be a common experience in the competitive programming community for reasons I don’t really understand. Here’s some of my hypotheses:
In this case, though, iterative DFS enables another optimization. We don’t necessarily need depth-first order, and, surprisingly, this enables vectorization! Instead of popping one cell on each iteration, we can pop is_interior just once, and then only handle the rare filtered interior cells sequentially:
// Pseudocode, obviously
while !stack.is_empty() {
let coords = stack.pop_at_most_eight();
for coords_index in BitIter::from(is_interior(coords)) {
let coords = coords[coords_index];
region_size += 1;
for (dx, dz) in [(-1, 0), (1, 0), (0, -1), (0, 1)] {
let coords1 = (coords.0 + dx, coords.1 + dz);
if visited.insert(&coords1) {
stack.push(coords1);
}
}
}
}
For narrow graphs, we could often have to handle fewer than
Finally, we can apply unrolling. The first iteration always pops the interior cell on the diagonal and pushes its four neighbors, and the second iteration immediately pops them, so let’s just start DFS from that point:
// Pretend this array concatenation compiles.
let mut coords = [(-1, 0), (1, 0), (0, -1), (0, 1)].map(|(dx, dz)| (coords.0 + dx, coords.1 + dz))
+ [(i32::MAX, i32::MAX); 4];
region_size = 1;
stack.clear();
loop {
for coords_index in BitIter::from(is_interior(coords)) {
let coords = coords[coords_index];
region_size += 1;
for (dx, dz) in [(-1, 0), (1, 0), (0, -1), (0, 1)] {
let coords1 = (coords.0 + dx, coords.1 + dz);
if visited.insert(&coords1) {
stack.push(coords1);
}
}
}
if stack.is_empty() {
break;
}
coords = stack.pop_at_most_eight();
}
In the common case, when the interior cell is surrounded on all sides, this lets us avoid touching stack, which removes the delay from failed store-to-load forwarding.
Stack layoutNow that high-level algorithms are fixed, we can optimize minor details, such as the stack data structure. Since is_interior wants X and Z coordinates in two separate vectors, we’ll actually need two stacks, one for X and one for Z.
Since vectors have a fixed size (i32x8, in this case), pop_at_most_eight needs to somehow handle the case when there are fewer than i32x8 always works correctly:
// Storage (reused between iterations):
let mut x_stack = [SENTINEL; 7 + 32 * 32];
let mut z_stack = [SENTINEL; 7 + 32 * 32];
// To pop 8 elements:
let x = i32x8::from_slice(&x_stack[7 + stack_size - 8..7 + stack_size]);
let z = i32x8::from_slice(&z_stack[7 + stack_size - 8..7 + stack_size]);
stack_size = stack_size.saturating_sub(8);
Once initialized, sentinels are never overwritten, as the stack size saturates at
(i32::MIN, i32::MIN), which is outside the world border and thus non-interior, works as a sentinel that we don’t need to special-case. If there wasn’t such a natural sentinel, we could still search for some non-interior cell greedily and then use its coordinates.
ConclusionAnd that about sums it up. I could add parallelization, but it’s much easier to just run the single-threaded program on different seeds in parallel.
The code incorporating most of these optimizations can be found on my GitHub. The final project searches through the whole world in about a month in single-threaded mode. That’s about
To put this into perspective, that’s roughly the speed of summing up integers with scalar code. Here’s the list of tricks we’ve used to optimize a non-trivial analysis of a procedurally generated map this much:
visited as [u32; 32]).This bullet list can act as a checklist for your projects: see if you’ve utilized all of these approaches, and if you can’t instantly name examples from your own experience where you’ve applied these tricks, focus on improving your skills by focusing on that topic.