Last year, Lemire wrote about an optimized variation of the Euclidean algorithm for computing the greatest common divisor of two numbers, called binary Euclidean algorithm or Stein’s algorithm. It’s a best-of-class implementation, though it’s currently only used by libc++.
The post also briefly mentions the extended Euclidean algorithm, a related algorithm most often used to compute the modular multiplicative inverse (given a remainder
There is also a binary version of the extended Euclidean algorithm[,] although it is quite a bit more involved and it is not clear that it […] can be implemented at high speed, leveraging fast instructions, when working on integers that fit in general-purpose registers. […]
My implementation of the binary extended Euclidean algorithm is quite a bit slower and not recommended. I expect that it should be possible to optimize it further.
– Lemire
That’s a big shame, because the extended Euclidean algorithm can be optimized in a very similar manner, and the underlying ideas were described in a 2020 paper. It’s probably not well-known because the paper focuses on constant-time evaluation and long arithmetic, so people might have assumed it’s irrelevant.
I’m hoping to bring justice to the extended Stein’s algorithm with this post. I’ll cover how the algorithm works, its limitations, some optimizations compared to Pornin’s paper, and potential further improvements.
My implementation is available on GitHub as part of a Rust modular arithmetic library.
DisclaimerThe textbook algorithm can be used not only to compute inverses, but also to solve linear Diophantine equations. I will focus on the former in this post, since that’s where the optimizations shine at. I’ll briefly cover the general case at the end of the post.
I won’t make claims on exact performance, because something strange is going on with the Lemire’s benchmarking results and I don’t want to add to the mess. I’ve measured that my implementation of the algorithm is
Lemire’s benchmark seems to be skewed by the choice of the compiler (GCC vs Clang), its version (Clang 18 vs Clang 21), optimization flags (
-O2vs-O3), the microarchitecture (Haswell vs Ice Lake vs Zen 2), and minutiae of the benchmarking code. Results don’t make much sense mathematically and look disproportionately affected by microarchitectural conditions.If you want to get the fastest implementation, I suggest you inspect the assembly more closely than me, because I have no idea what’s going on.
Nevertheless, here is some raw data for transparency. The benchmark measures the time per inversion (in ns), the cell format is “Stein’s algorithm / Euclidean algorithm”.
| 8 bits | 16 bits | 32 bits | 64 bits | |
|---|---|---|---|---|
| Haswell | 11.38 / 19.21 (-41%) | 17.48 / 33.96 (-49%) | 29.76 / 61.69 (-52%) | 67.18 / 152.19 (-56%) |
| Alder Lake | 8.20 / 10.19 (-20%) | 13.77 / 16.87 (-18%) | 21.47 / 31.00 (-31%) | 50.38 / 69.57 (-28%) |
| Zen 5 | 7.77 / 10.56 (-26%) | 9.43 / 14.80 (-36%) | 13.96 / 23.98 (-42%) | 34.58 / 49.24 (-30%) |
| M1 | 14.58 / 13.05 (+12%) | 11.48 / 18.63 (-38%) | 19.74 / 35.47 (-44%) | 43.14 / 71.14 (-39%) |
| M2 | 8.93 / 10.26 (-13%) | 11.00 / 17.90 (-39%) | 19.38 / 33.78 (-43%) | 41.33 / 68.03 (-39%) |
| M4 | 5.28 / 8.60 (-39%) | 8.07 / 14.77 (-45%) | 13.63 / 28.05 (-51%) | 28.68 / 56.22 (-49%) |
| Cortex-A72 | 29.80 / 33.48 (-11%) | 38.30 / 49.36 (-22%) | 61.28 / 83.63 (-27%) | 162.55 / 151.77 (+7%) |
| Snapdragon 8 Gen 3 | 9.72 / 12.13 (-20%) | 14.97 / 21.91 (-32%) | 28.51 / 39.89 (-29%) | 70.11 / 75.46 (-7%) |
| Kryo 485 | 15.08 / 19.36 (-22%) | 21.54 / 30.41 (-29%) | 33.63 / 50.96 (-34%) | 90.32 / 94.76 (-5%) |
GCDLet’s start with the algorithm for computing the GCD of
The implementation is very short:
while a != 0 {
a >>= a.trailing_zeros();
if a < b {
(a, b) = (b, a);
}
a -= b;
}
return b;
If the initial
let shift = (a | b).trailing_zeros(); // == min(a.trailing_zeros(), b.trailing_zeros())
b >>= b.trailing_zeros();
/* loop from the previous snippet */
return b << shift;
But for modular inversion, the modulus is usually odd, so I won’t dwell on this.
OptimizationsThis covers the general structure of the algorithm, but some optimizations are crucial for getting good performance.
The conditional swap should be compiled to branchless code to avoid branch misprediction. Compiler hints like __builtin_unpredictable or core::hint::select_unpredictable may be useful.
The loop has a high latency because trailing_zeros, >>=, if, and -= are computed sequentially. But since (-a).trailing_zeros() == a.trailing_zeros(), a.trailing_zeros() can in principle be computed before the swap on the previous iteration:
let mut q = a.trailing_zeros();
while a != 0 {
a >>= q;
q = (a - b).trailing_zeros();
if a < b {
(a, b) = (b - a, a);
} else {
(a, b) = (a - b, b);
}
}
This brings the latency down to 3 operations: >>=; a - b and b - a computed in parallel; trailing_zeros and if computed in parallel. It also slightly increases the number of operations (computing b - a and a - b and only using one), but the tradeoff pays off.
Pay close attention to trailing_zeros if you’re implementing this in C. The algorithm can invoke it with a zero input on the last iteration. This is well-defined in Rust, which maps __builtin_clz(0) is UB. Use __builtin_clzg to avoid issues. In C++, std::countr_zero(0) is well-defined.
GCC documents
__builtin_clz(0)as having an “undefined result”, so I initially assumed it means an indeterminate value. In reality, GCC maintainers consider it UB and LLVM documents it as UB… but the optimizers seem to model it exactly like an indeterminate value? (e.g. LLVM considers@llvm.cttz(0)to producepoison) This is frankly ridiculous, someone do something about it.
ExtendingYou might be wondering how this algorithm is related to modular inversion.
The trick is to express the values of
If
That is,
When
When
When
In other words, whatever we do to
LimitationsImplementation attempts quickly reveal a problem: coefficients are not necessarily divisible by
This is actually a core difference between Stein’s algorithm and the textbook Euclidean algorithm, which is implemented as
The Euclidean algorithm uses division (q = a / b), but only to compute constant factors. The values are updated with subtraction and multiplication alone (a -= b * q). Stein’s algorithm divides values (a /= 2^q), causing non-integer coefficients.
This is likely why the extended Stein’s algorithm is unpopular. We’ll use tricks tailored to modular inverse, but the general-purpose case covered at the end of the post essentially boils down to “compute modular inverse and post-process”. I believe it can still be faster than the textbook implementation, but I haven’t tested it.
FractionsWe can track coefficients as fractions to stay in integers. The most efficient approach uses the same denominator
We start with
This seems pointless at first, since we need to know
The multitude of variables is getting confusing, so let’s simplify it. We’re looking for
// Example for 32-bit inputs (k = 32).
let mut u = 1;
let mut v = 0;
let mut p = 0;
let mut q = a.trailing_zeros();
while a != 0 {
a >>= q;
v <<= q;
p += q;
q = (a - b).trailing_zeros();
if a < b {
(a, b) = (b - a, a);
(u, v) = (v, u);
} else {
(a, b) = (a - b, b);
}
u -= v;
}
assert!(b == 1, "not invertible");
v <<= 62 - p;
return (v * inverse_of_2p62) % b0;
We don’t apply the latency-reducing trick to u - v and v - u would most likely reduce performance, since we’re already pushing the CPU limit of parallel operations.
TypesIt’s easy to prove by induction that at the beginning of each iteration,
This means that i128 arithmetic, which slows down the algorithm considerably. We’ll discuss what to do about it in a bit.
MontgomeryBefore we do this, though, let’s finish the
On the face of it, this is one multiplication and one reduction, but Montgomery multiplication demonstrates that these operations can be performed faster together.
Assume for a moment that
This is equivalent to
We’ve just found that
This number is in range
fn redc62(v: i64) -> u32 {
if v == (1 << 62) {
1
} else {
let x = v.unsigned_abs().wrapping_mul(j << 2).widening_mul(b0 as u64).1 as u32;
if v > 0 { b0 - x } else { x }
}
}
That’s it for
64-bit inputsFor i128. This makes each operation twice as slow. We can reduce
Hmm. Notice that at the beginning of the algorithm,
Just like
The trick is to save
When the coefficients
let mut u0 = 1;
let mut v0 = 0;
let mut q = a.trailing_zeros();
while a != 0 {
// The coefficients relating (u, v) to (u0, v0).
let mut (f0, g0) = (1, 0);
let mut (f1, g1) = (0, 1);
let mut p = 0;
// Run the algorithm until p reaches the limit.
while a != 0 && p + q <= 62 {
a >>= q;
f1 <<= q;
g1 <<= q;
p += q;
q = (a - b).trailing_zeros();
if a < b {
(a, b) = (b - a, a);
(f0, f1) = (f1, f0);
(g0, g1) = (g1, g0);
} else {
(a, b) = (a - b, b);
}
f0 -= f1;
g0 -= g1;
}
// This section means different things depending on the reason the loop stopped:
// - If we ran out of precision, this performs as much of the last action as possible and
// adjusts `q` so that the operation completes on the next iteration.
// - If `a = 0`, this effectively raises the precision of f1/g1 to 62. It doesn't adjust
// `f0, g0` correctly, but this doesn't matter because `u` is not read on the exit path.
a >>= 62 - p;
f1 <<= 62 - p;
g1 <<= 62 - p;
q -= 62 - p;
// Apply the coefficients.
let f0 = redc62(f0);
let g0 = redc62(g0);
let f1 = redc62(f1);
let g1 = redc62(g1);
(u0, v0) = ((f0 * u0 + g0 * v0) % b0, (f1 * u0 + g1 * v0) % b0);
}
assert!(b == 1, "not invertible");
return v0;
VectorizationThe astute among you might realize this doesn’t improve much, since we went from updating two
We can’t use SIMD because x86 doesn’t have cmov for vector registers, but we can decrease the coefficient length to
This simplifies the inner loop to:
while a != 0 && p + q <= 30 {
a >>= q;
c1 <<= q;
p += q;
q = (a - b).trailing_zeros();
if a < b {
(a, b) = (b - a, a);
(c0, c1) = (c1, c0);
} else {
(a, b) = (a - b, b);
}
c0 -= c1;
}
Just like
Only
// 31 would be 30 without this optimization
a >>= 31 - p;
c1 <<= 31 - p;
q -= 31 - p;
let (f0, g0) = parse_coefficients(c0);
let (f1, g1) = parse_coefficients(c1);
let f0 = redc31(f0);
let g0 = redc31(g0);
let f1 = redc31(f1);
let g1 = redc31(g1);
(u0, v0) = ((f0 * u0 + g0 * v0) % b0, (f1 * u0 + g1 * v0) % b0);
Note that the inner loop is still limited by
Parsing coefficients from
This assumes that
SymmetryWith packed coefficients, the inner loop is similar to the unoptimized version, differing only in
The final implementation looks something like this:
let mut u0 = 1;
let mut v0 = 0;
let mut q = a.trailing_zeros();
let mut is_first_iteration = true;
while a != 0 {
// Either coefficients in SWAR format, or the values u/v, depending on the iteration.
let mut c0 = 1;
let mut c1 = if is_first_iteration { 0 } else { 1 << 32 };
let mut p_left = if is_first_iteration { 63 } else { 31 };
while a != 0 && q < p_left { // < instead of <= is load-bearing
a >>= q;
c1 <<= q;
p_left -= q;
q = (a - b).trailing_zeros();
if a < b {
(a, b) = (b - a, a);
(c0, c1) = (c1, c0);
} else {
(a, b) = (a - b, b);
}
c0 -= c1;
}
a >>= p_left;
c1 <<= p_left;
q -= p_left;
if is_first_iteration {
u0 = redc63(c0);
v0 = redc63(c1);
} else {
let (f0, g0) = parse_coefficient(c0);
let (f1, g1) = parse_coefficient(c1);
let f0 = redc31(f0);
let g0 = redc31(g0);
let f1 = redc31(f1);
let g1 = redc31(g1);
(u0, v0) = ((f0 * u0 + g0 * v0) % m, (f1 * u0 + g1 * v0) % m);
}
is_first_iteration = false;
}
assert!(b == 1, "not invertible");
return v0;
We store p_left instead of p so that p_left -= q and q < p_left can be computed with a single instruction.
The q < p_left with true makes it identical to the
redc31(x) can be implemented as redc63(x << 32).
And that’s it! You now know a cool way to compute
General caseTo support variable
To replace the extended Euclidean algorithm, we need to find integers
Luckily, our
Since this division is exact, it can be calculated with multiplication by
Despite this complexity, I believe this method can be faster than the extended Euclidean algorithm, since the auxiliary logic takes constant time, except for computing
OutroAs a reminder, you can find my code on GitHub. The source of latency-optimized GCD is this post. Using coefficients to reset bit lengths of
Thanks to many friends of mine for contributing to the benchmarking results, to Ian Qvist for the motivation to complete this post and editorial comments, and to Yuki for saving me from going insane over unexplainable performance phenomena.