Bernstein-Yang's Inverse
Presentation: https://youtu.be/t0_iII-nKNw
Last updated
Presentation: https://youtu.be/t0_iII-nKNw
Last updated
GCD-based algorithms are widely used as an efficient method for computing modular inverses. However, most of them involve input-dependent branching, which makes them vulnerable to side-channel attacks.
For example, an implementation of in OpenSSL was shown to be susceptible to an RSA key extraction attack. This kind of attack demonstrates that the execution path of conditional branches can leak secret information.
A simple countermeasure is to use , which computes the inverse as . This method can be implemented in constant time, but it is generally less efficient.
This raises the question: Is there a way to compute the modular inverse that is both constant-time secure and faster than the Fermat-based approach?
The Binary GCD algorithm (also known as Stein’s Algorithm) involves branching based on the parity of the input integers or comparisons between them, which causes the execution path to vary. For example:
If is even → right shift
If is even → right shift
If →
Branching based on whether or is even is relatively safe since it only inspects the least significant bit. However, branching on comparisons like often requires examining multiple bits (or limbs), which makes it difficult to ensure constant-time execution.
The original paper also covers the case of computing inverses of polynomials, but for simplicity, we will restrict ourselves to the case of integers. The explanation closely follows the implementation described in , while adhering to the original paper's formulation of the divstep
update as much as possible.
divstep
The divstep
operation can be viewed as a composition of two functions:
Conditional Swap
Elimination
Always performed
This division by 2 is always valid because:
This logic is rooted in the classical Euclidean algorithm:
divstep
The behavior of divstep
can be more concretely classified as:
As previously discussed, this operation ultimately returns the GCD.
divstep
Formally, after a sufficient number of iterations, we reach:
A single divstep
can be represented using a matrix multiplication. The one-step update is expressed as:
Example: For the BN254 curve where the prime has 254 bits:
To do this, we first compute:
modinv
: Combines all components to compute the modular inverse
The original modinv
function already involved a modulus operation, but with the introduction of batched divstep
, expensive modular operations now occur at every loop iteration. In this section, we aim to eliminate those costly modulus computations.
div2n
✅ Version 1: Allowing Range Expansion
🔍 Range Analysis of Version 1
⚠️ Drawback
After each divstep
, the range expands:
This increasing range leads to:
Unnecessary arithmetic overhead
Larger intermediate values
Potential memory inefficiencies
✅ Version 2: Pre-Clamping Strategy
modinv
if v < 0: v += p
if sign == -1: v = -v
if v < 0: v += p
Traditional GCD-based modular inverse algorithms are fast but suffer from a major drawback: they are vulnerable to side-channel attacks due to conditional branches and comparison operations. This vulnerability can be critical in environments where sensitive key material is involved, such as RSA or ECC.
This article introduced a robust alternative: the Bernstein–Yang divstep
-based modular inverse algorithm. This approach offers several compelling advantages:
✅ Constant-time execution: Designed to ensure execution time does not vary based on input values
✅ High performance: Faster than Fermat’s method and comparable to (or better than) classical GCD-based methods
Ultimately, the divstep
-based algorithm strikes an excellent balance between performance and security. It is especially well-suited for environments that demand resistance to timing attacks, such as cryptographic protocol implementations or hardware acceleration contexts.
When is an odd integer and is any integer, the divstep
function is defined as follows:
This operation is designed to execute in constant-time, relying only on fixed-bit values like and the least significant bit of to determine behavior, ensuring that only a single bit needs to be checked.
Condition: and is odd
Operation:
TODO(chokobole): explain the reason why helps to reduce to 0.
Purpose: to reduce the size of
Operation:
Since is always odd, is guaranteed, allowing simplification:
If is odd, then is even (odd - odd = even)
If is even, the result is trivially divisible by 2
Thus, divstep
preserves the GCD while continually reducing the size of . Eventually, reaches 0, and the GCD remains in .
The idea of using divstep
to compute the modular inverse becomes most clear when we assume that the modulus p
is an odd prime. In this section, we explain the procedure for computing the inverse of an integer , i.e., .
We initialize the variables as , , , and , and maintain the following invariants throughout:
At each iteration, we apply a divstep
transformation to the pair . Since is a prime and , we know that . This guarantees that repeated applications of divstep
will eventually reduce to , and to either or .
At this point, because the algorithm maintains the invariant , we conclude:
Since , multiplying by gives the correct modular inverse of .
The divstep
function branches into three cases depending on and the parity of . We analyze how the invariants are preserved in each case.
We use the rule for Case 1. Accordingly, the values of and are updated as:
Since is guaranteed to be odd, the subtraction is always even, making the division by 2 valid.
We apply the same update rule to and as follows:
We use the rule for Case 2 Accordingly, the values of and are updated as:
Since is guaranteed to be odd, the subtraction is always even, making the division by 2 valid.
We apply the same update rule to and as follows:
We use the rule for Case 3 Accordingly, the values of and are updated as:
We apply the same update rule to and as follows:
The divstep
operation involves division by 2 for the variables . Instead of performing division by 2 on and at every iteration, we can batch divstep
s together and apply them to and all at once. This reduces individual divisions into a single division.
Here, is or depending on and that make sure the expression inside the parentheses becomes divisible by 2.
Note that and are standard integers, while and are integers modulo
The transition matrix used to update is the same one applied to . The key idea is to compute the transition matrix from and over steps, and then apply it to using matrix-vector multiplication as follows:
Here, and are correction terms chosen so that the expression inside the parentheses becomes divisible by . This will be explained in .
The transition matrix is defined as follows:
Let , which can usually be considered the bit length of the modulus. Then, the number of N-step matrix applications is bounded by:
So if we set to 62, it has been proven that a modular inverse can be computed with at most 12 iterations of the loop.
transition_matrix
: Computes the matrix after N
divsteps, this also computes
div2n
& update_x1x2
: Applies the matrix to to compute
We can express an integer as:
where is the remainder mod . To make divisible by , we need to eliminate the part.
Then, we subtract from :
which is now cleanly divisible by .
Additionally, since , the subtraction step preserves the modular equivalence:
Original Purpose:
The div2n
function is used to reduce results into the range .
Optimization Insight:
Instead of restricting outputs to modulo , we can eliminate the mod
operation if we ensure that all intermediate values remain within a wider range, such as .
Each transition matrix acts on the vector in one of the following ways:
After iterations, we get:
Since initially:
In update_x1x2_optimized_ver1
, we subtract a multiple of before division:
With , we obtain:
After division by :
Hence, the result stays within .
To prevent range blow-up, we can add to negative values to bring the range back to :
After the final iteration, the results lie in . We want to map into the range using the following normalization:
✅ Optimized computation: Improves efficiency through batch processing via matrix multiplication and by avoiding modulus operations ↪ For additional optimizations, refer to
Written by from