LogoLogo
LogoLogo
  • Introduction
    • About Us
    • Notations & Definitions
      • MPC
      • ZK
    • Contribute to this Site!
  • Primitives
    • Multiplication
      • Karatsuba Multiplication
      • Toom-Cook Multiplication
    • NAF (Non-adjacent form)
    • Chinese Remainder Theorem (CRT)
    • Euclidean Algorithm
      • Extended Euclidean Algorithm
      • Binary Euclidean Algorithm
      • Extended Binary Euclidean Algorithm
    • Coding Theory
      • Linear Code
    • Number Theoretic Transform
    • Abstract Algebra
      • Group
        • -Morphisms
        • Batch Inverse
      • Elliptic Curve
        • Weierstrass Curve
          • Coordinate Forms
          • Fast Elliptic Curve Arithmetic and Improved WEIL Pairing Evaluation
        • Edwards Curve
          • Coordinate Forms
          • Twisted Edwards ↔ Short Weierstrass Transformation
        • Batch Inverse for Batch Point Additions
        • Scalar Multiplication
          • Double-and-add
          • GLV Decomposition
        • MSM
          • Pippenger's Algorithm
          • Signed Bucket Index
          • CycloneMSM
          • EdMSM
          • cuZK
        • 2-Chain and 2-Cycle of Elliptic Curves
    • Encryption Scheme
      • ElGamal Encryption
    • Modular Arithmetic
      • Modular Reduction
        • Barrett Reduction
        • Montgomery Reduction
      • Modular Inverse
        • Bernstein-Yang's Inverse
    • Multiset Check
    • Sumcheck
    • Commitment Scheme
      • Fflonk
      • SHPlonk
      • Zeromorph
  • MPC
    • Yao's Garbled Circuits
    • GMW
    • BMR
  • ZK
    • Arithmetization
      • R1CS
      • PLONK
      • AIR
    • Folding
      • LatticeFold
      • Nova
        • Nova over Cycles of Curves
    • Lookup
      • Lasso
      • LogUp-GKR
    • SNARK
      • Groth16
      • HyperPlonk
      • Spartan
        • SPARK
    • STARK
      • Additive NTT
      • Basefold
      • Binius
      • Brakedown
      • CircleSTARK
      • FRI
        • FRI Security Features and Optimizations
      • DEEP FRI
      • STIR
      • WHIR
    • Distributed ZK
      • Ryan's Trick for Distributed Groth16
  • Application
    • zkLogin
    • zkHoldem
    • zkTLS
      • DECO
      • Proxying is enough
  • zkVM
Powered by GitBook
On this page
  • Introduction
  • Background
  • Limitations of Existing Constant-Time GCD Algorithms
  • Protocol Explanation
  • GCD with divstep
  • Modular Inverse with divstep
  • Batching Multiple Divsteps
  • Code
  • Avoiding Modulus Operations
  • Conclusion
  • References
Export as PDF
  1. Primitives
  2. Modular Arithmetic
  3. Modular Inverse

Bernstein-Yang's Inverse

Presentation: https://youtu.be/t0_iII-nKNw

PreviousModular InverseNextMultiset Check

Last updated 23 days ago

Introduction

GCD-based algorithms are widely used as an efficient method for computing modular inverses. However, most of them involve input-dependent branching, which makes them vulnerable to side-channel attacks.

For example, an implementation of in OpenSSL was shown to be susceptible to an RSA key extraction attack. This kind of attack demonstrates that the execution path of conditional branches can leak secret information.

A simple countermeasure is to use , which computes the inverse as a−1≡ap−2mod  pa^{-1} \equiv a^{p-2} \mod pa−1≡ap−2modp. This method can be implemented in constant time, but it is generally less efficient.

This raises the question: Is there a way to compute the modular inverse that is both constant-time secure and faster than the Fermat-based approach?

Background

Limitations of Existing Constant-Time GCD Algorithms

The Binary GCD algorithm (also known as Stein’s Algorithm) involves branching based on the parity of the input integers or comparisons between them, which causes the execution path to vary. For example:

  • If aaa is even → right shift

  • If bbb is even → right shift

  • If a>ba > ba>b → a:=a−ba := a - ba:=a−b

Branching based on whether aaa or bbb is even is relatively safe since it only inspects the least significant bit. However, branching on comparisons like a>ba > ba>b often requires examining multiple bits (or limbs), which makes it difficult to ensure constant-time execution.

Protocol Explanation

The original paper also covers the case of computing inverses of polynomials, but for simplicity, we will restrict ourselves to the case of integers. The explanation closely follows the implementation described in , while adhering to the original paper's formulation of the divstep update as much as possible.

GCD with divstep

Stages

The divstep operation can be viewed as a composition of two functions:

  1. Conditional Swap

  2. Elimination

    • Always performed

    This division by 2 is always valid because:

GCD Preservation

This logic is rooted in the classical Euclidean algorithm:

Simplified divstep

The behavior of divstep can be more concretely classified as:

As previously discussed, this operation ultimately returns the GCD.

Code

def gcd(a, b):
    """Compute the GCD of an odd integer a and another integer b."""
    assert a & 1  # require a to be odd
    delta = 1     # additional state variable
    while b != 0:
        assert a & 1  # a will be odd in every iteration
        if delta > 0 and b & 1:
            delta, a, b = 1 - delta, b, (a - b) // 2
        elif b & 1:
            delta, a, b = 1 + delta, a, (b - a) // 2
        else:
            delta, a, b = 1 + delta, a, b // 2
    return abs(a)

Modular Inverse with divstep

Initial Setup

Formally, after a sufficient number of iterations, we reach:

Code

def div2(p, a):
    """Helper routine to compute a/2 mod p (where p is odd)."""
    assert p & 1
    if a & 1:  # If a is odd, make it even by adding p.
        a += p
    # a must be even now, so clean division by 2 is possible.
    return a // 2

def modinv(p, a):
    """Compute the inverse of a mod p (given that it exists, and p is odd)."""
    assert p & 1
    delta, u, v = 1, p, a
    x1, x2 = 0, 1
    while v != 0:
        # Note that while division by two for u and v is only ever done on even inputs,
        # this is not true for x1 and x2, so we need the div2 helper function.
        if delta > 0 and v & 1:
            delta, u, v = 1 - delta, v, (u - v) // 2
            x1, x2 = x2, div2(p, x1 - x2)
        elif v & 1:
            delta, u, v = 1 + delta, u, (v - u) // 2
            x1, x2 = x1, div2(p, x2 - x1)
        else:
            delta, u, v = 1 + delta, u, v // 2
            x1, x2 = x1, div2(p, x2)
        # Verify that the invariants x1=u/a mod p, x2=v/a mod p are maintained.
        assert u % p == (a * x1) % p
        assert v % p == (a * x2) % p
    assert u == 1 or u == -1  # |u| is the GCD, it must be 1
    # Because of invariant x1 = u/a mod p, 1/a = x1/u mod p. Since |u| = 1, x1/u = x1 * u.
    return (x1 * u) % p

Batching Multiple Divsteps

Motivation

Transition Matrix

A single divstep can be represented using a matrix multiplication. The one-step update is expressed as:

Upper Bound on Iteration Count: Theorem 11.2

Example: For the BN254 curve where the prime has 254 bits:

Code

def transition_matrix(delta, u, v):
    """Compute delta and transition matrix t after N divsteps (multiplied by 2^N)."""
    m00, m01, m10, m11 = 1, 0, 0, 1  # start with identity matrix
    for _ in range(N):
        if delta > 0 and v & 1:
            delta, u, v, m00, m01, m10, m11 = 1 - delta, v, (u - v) // 2, 2*m10, 2*m11, m00 - m10, m01 - m11
        elif v & 1:
            delta, u, v, m00, m01, m10, m11 = 1 + delta, u, (v - u) // 2, 2*m00, 2*m01, m10 - m00, m11 - m01
        else:
            delta, u, v, m00, m01, m10, m11 = 1 + delta, u, v // 2, 2*m00, 2*m01, m10, m11
    return delta, u, v, (m00, m01, m10, m11)
def div2n(p, p_inv, x):
    """Compute x/2^N mod p, given p_inv = 1/p mod 2^N."""
    assert (p * p_inv) % 2**N == 1
    m = (x * p_inv) % 2**N
    x -= m * p
    assert x % 2**N == 0
    return (x >> N) % p

def update_x1x2(x1, x2, t, p, p_inv):
    """Multiply matrix t/2^N with [x1, x2], modulo p."""
    m00, m01, m10, m11 = t
    x1n, x2n = m00*x1 + m01*x2, m10*x1 + m11*x2
    return div2n(p, p_inv, x1n), div2n(p, p_inv, x2n)

To do this, we first compute:

  1. modinv: Combines all components to compute the modular inverse

def modinv(p, p_inv, x):
    """Compute the modular inverse of x mod p, given p_inv=1/p mod 2^N."""
    assert p & 1
    delta, u, v, x1, x2 = 1, p, x, 0, 1
    while v != 0:
        delta, u, v, t = transition_matrix(delta, u, v)
        x1, x2 = update_x1x2(x1, x2, t, p, p_inv)
    assert u == 1 or u == -1  # |u| must be 1
    return (x1 * u) % p

Avoiding Modulus Operations

The original modinv function already involved a modulus operation, but with the introduction of batched divstep, expensive modular operations now occur at every loop iteration. In this section, we aim to eliminate those costly modulus computations.

Removing Modulus Operation in div2n

✅ Version 1: Allowing Range Expansion

def update_x1x2_optimized_ver1(x1, x2, t, p, p_inv):
    """Multiply matrix t/2^N with [x1, x2], modulo p, given p_inv = 1/p mod 2^N."""
    m00, m01, m10, m11 = t
    x1n, x2n = m00*x1 + m01*x2, m10*x1 + m11*x2
    # Cancel out bottom N bits of x1n and x2n.
    mx1n = ((x1n * p_inv) % 2**N)
    mx2n = ((x2n * p_inv) % 2**N)
    x1n -= mx1n * p
    x2n -= mx2n * p
    return x1n >> N, x2n >> N  

🔍 Range Analysis of Version 1

⚠️ Drawback

After each divstep, the range expands:

This increasing range leads to:

  • Unnecessary arithmetic overhead

  • Larger intermediate values

  • Potential memory inefficiencies

✅ Version 2: Pre-Clamping Strategy

def update_x1x2_optimized_ver2(x1, x2, t, p, p_inv):
    """Multiply matrix t/2^N with [x1, x2], modulo p, given p_inv = 1/p mod 2^N."""
    m00, m01, m10, m11 = t
    # x1, x2 in (-2*p, p)
    if x1 < 0:
        x1 += p
    if x2 < 0:
        x2 += p
    # x1, x2 in (-p, p)
    x1n, x2n = m00*x1 + m01*x2, m10*x1 + m11*x2
    mx1n = -((p_inv * x1n) % 2**N)
    mx2n = -((p_inv * x2n) % 2**N)
    x1n += mx1n * p
    x2n += mx2n * p
    return x1n >> N, x2n >> N

Avoiding Modulus Operation in modinv

def normalize(sign, v, p):
    """Compute sign*v mod p, where v is in range (-2*p, p); output in [0, p)."""
    assert sign == 1 or sign == -1
    if v < 0:
        v += p
    if sign == -1:
        v = -v
    if v < 0:
        v += p
    return v
sign == -1 and v < 0
sign == -1 and v >= 0
sign == 1 and v < 0
sign == 1 and v >= 0

if v < 0: v += p

if sign == -1: v = -v

if v < 0: v += p

Final Implementation

def modinv(p, p_inv, x):
    """Compute the modular inverse of x mod p, given p_inv=1/p mod 2^N."""
    assert p & 1
    delta, u, v, x1, x2 = 1, p, x, 0, 1
    while v != 0:
        delta, u, v, t = transition_matrix(delta, u, v)
        x1, x2 = update_x1x2_optimized_ver2(x1, x2, t, p, p_inv)
    assert u == 1 or u == -1  
    return normalize(u, x1, p)

Conclusion

Traditional GCD-based modular inverse algorithms are fast but suffer from a major drawback: they are vulnerable to side-channel attacks due to conditional branches and comparison operations. This vulnerability can be critical in environments where sensitive key material is involved, such as RSA or ECC.

This article introduced a robust alternative: the Bernstein–Yang divstep-based modular inverse algorithm. This approach offers several compelling advantages:

  • ✅ Constant-time execution: Designed to ensure execution time does not vary based on input values

  • ✅ High performance: Faster than Fermat’s method and comparable to (or better than) classical GCD-based methods

Ultimately, the divstep-based algorithm strikes an excellent balance between performance and security. It is especially well-suited for environments that demand resistance to timing attacks, such as cryptographic protocol implementations or hardware acceleration contexts.

References

When aaa is an odd integer and bbb is any integer, the divstep function is defined as follows:

divstep(δ,a,b)={(1−δ, b, (b mod 2)⋅a−(a mod 2)⋅b2)if δ>0∧b mod 2=1(1+δ, a, (a mod 2)⋅b−(b mod 2)⋅a2)otherwise\mathsf{divstep}(\delta, a, b) = \begin{cases} (1 - \delta,\ b,\ \frac{(b \bmod 2)\cdot a - (a \bmod 2)\cdot b}{2}) & \text{if } \delta > 0 \land b \bmod 2 = 1 \\ (1 + \delta,\ a,\ \frac{(a \bmod 2)\cdot b - (b \bmod 2)\cdot a}{2}) & \text{otherwise} \end{cases}divstep(δ,a,b)={(1−δ, b, 2(bmod2)⋅a−(amod2)⋅b​)(1+δ, a, 2(amod2)⋅b−(bmod2)⋅a​)​if δ>0∧bmod2=1otherwise​

This operation is designed to execute in constant-time, relying only on fixed-bit values like δ\deltaδ and the least significant bit of bbb to determine behavior, ensuring that only a single bit needs to be checked.

Condition: δ>0\delta > 0δ>0 and bbb is odd

Operation: (δ,a,b)→(−δ,b,a)(\delta, a, b) \rightarrow (-\delta, b, a)(δ,a,b)→(−δ,b,a)

TODO(chokobole): explain the reason why δ\deltaδ helps to reduce bbb to 0.

Purpose: to reduce the size of bbb

Operation: (δ,a,b)→(1+δ,a,(a mod 2)⋅b−(b mod 2)⋅a2)(\delta, a, b) \rightarrow (1 + \delta, a, \frac{(a \bmod 2)\cdot b - (b \bmod 2)\cdot a}{2})(δ,a,b)→(1+δ,a,2(amod2)⋅b−(bmod2)⋅a​)

Since aaa is always odd, a mod 2=1a \bmod 2 = 1amod2=1 is guaranteed, allowing simplification:

(δ,a,b)→(1+δ,a,b−(b mod 2)⋅a2)(\delta, a, b) \rightarrow (1 + \delta, a, \frac{b - (b \bmod 2)\cdot a}{2})(δ,a,b)→(1+δ,a,2b−(bmod2)⋅a​)

If bbb is odd, then b−ab - ab−a is even (odd - odd = even)

If bbb is even, the result is trivially divisible by 2

gcd⁡(a,b)=gcd⁡(b,a)=gcd⁡(a,b−a)=gcd⁡(a,b/2)(if b is even)\gcd(a, b) = \gcd(b, a) = \gcd(a, b - a) = \gcd(a, b/2) \quad \text{(if } b \text{ is even)}gcd(a,b)=gcd(b,a)=gcd(a,b−a)=gcd(a,b/2)(if b is even)

Thus, divstep preserves the GCD while continually reducing the size of bbb. Eventually, bbb reaches 0, and the GCD remains in aaa.

divstep(δ,a,b)={aif b=0(1−δ,b,a−b2)if δ>0∧b mod 2=1(1+δ,a,b−a2)if δ≤0∧b mod 2=1(1+δ,a,b2)if b mod 2=0\mathsf{divstep}(\delta, a, b) = \begin{cases} a & \text{if } b = 0 \\ (1 - \delta, b, \frac{a - b}{2}) & \text{if } \delta > 0 \land b \bmod 2 = 1 \\ (1 + \delta, a, \frac{b - a}{2}) & \text{if } \delta \le 0 \land b \bmod 2 = 1 \\ (1 + \delta, a, \frac{b}{2}) & \text{if } b \bmod 2 = 0 \end{cases}divstep(δ,a,b)=⎩⎨⎧​a(1−δ,b,2a−b​)(1+δ,a,2b−a​)(1+δ,a,2b​)​if b=0if δ>0∧bmod2=1if δ≤0∧bmod2=1if bmod2=0​

The idea of using divstep to compute the modular inverse becomes most clear when we assume that the modulus p is an odd prime. In this section, we explain the procedure for computing the inverse of an integer a∈Fpa \in \mathbb{F}_pa∈Fp​, i.e., a−1mod  pa^{-1} \mod pa−1modp.

We initialize the variables as u=pu = pu=p, v=av = av=a, x1=0x_1 = 0x1​=0, and x2=1x_2 = 1x2​=1, and maintain the following invariants throughout:

ax1≡umod  p,ax2≡vmod  pax_1 \equiv u \mod p, \quad ax_2 \equiv v \mod pax1​≡umodp,ax2​≡vmodp

At each iteration, we apply a divstep transformation to the pair (u,v)(u, v)(u,v). Since ppp is a prime and a∈Fp×a \in \mathbb{F}_p^\timesa∈Fp×​, we know that gcd⁡(p,a)=1\gcd(p, a) = 1gcd(p,a)=1. This guarantees that repeated applications of divstep will eventually reduce vvv to 000, and uuu to either 111 or −1-1−1.

(u,v)=(±1,0)(u, v) = (\pm 1, 0)(u,v)=(±1,0)

At this point, because the algorithm maintains the invariant ax1≡umod  pa x_1 \equiv u \mod pax1​≡umodp, we conclude:

x1⋅u≡a−1mod  px_1 \cdot u \equiv a^{-1} \mod px1​⋅u≡a−1modp

Since u=±1u = \pm1u=±1, multiplying x1x_1x1​ by uuu gives the correct modular inverse of aaa.

The divstep function divstep(δ,u,v)\mathsf{divstep}(\delta, u, v)divstep(δ,u,v) branches into three cases depending on δ\deltaδ and the parity of vvv. We analyze how the invariants are preserved in each case.

Case 1: δ>0\delta > 0δ>0 and vvv is odd

We use the rule divstep(δ,a,b)=(1−δ,b,a−b2)\mathsf{divstep}(\delta, a, b) = (1 - \delta, b, \frac{a - b}{2})divstep(δ,a,b)=(1−δ,b,2a−b​) for Case 1. Accordingly, the values of uuu and vvv are updated as:

u←vv←u−v2u \leftarrow v \\ v \leftarrow \frac{u - v}{2}u←vv←2u−v​

Since uuu is guaranteed to be odd, the subtraction u−vu - vu−v is always even, making the division by 2 valid.

We apply the same update rule to x1x_1x1​ and x2x_2x2​ as follows:

x1←x2x2←{x1−x22if (x1−x2) mod 2=0x1−x2+p2otherwisex_1 \leftarrow x_2 \\ x_2 \leftarrow \begin{cases} \frac{x_1 - x_2}{2} & \text{if } (x_1 - x_2) \bmod 2 = 0 \\ \frac{x_1 - x_2 + p}{2} & \text{otherwise} \end{cases}x1​←x2​x2​←{2x1​−x2​​2x1​−x2​+p​​if (x1​−x2​)mod2=0otherwise​

Case 2: δ≤0\delta \le 0δ≤0 and vvv is odd

We use the rule divstep(δ,a,b)=(1+δ,a,b−a2)\mathsf{divstep}(\delta, a, b) = (1 + \delta, a, \frac{b - a}{2})divstep(δ,a,b)=(1+δ,a,2b−a​) for Case 2 Accordingly, the values of uuu and vvv are updated as:

u←uv←v−u2u \leftarrow u\\ v \leftarrow \frac{v - u}{2}u←uv←2v−u​

Since uuu is guaranteed to be odd, the subtraction u−vu - vu−v is always even, making the division by 2 valid.

We apply the same update rule to x1x_1x1​ and x2x_2x2​ as follows:

x1←x1x2←{x2−x12if (x2−x1) mod 2=0x2−x1+p2otherwisex_1 \leftarrow x_1 \\ x_2 \leftarrow \begin{cases} \frac{x_2 - x_1}{2} & \text{if } (x_2 - x_1) \bmod 2 = 0 \\ \frac{x_2 - x_1 + p}{2} & \text{otherwise} \end{cases}x1​←x1​x2​←{2x2​−x1​​2x2​−x1​+p​​if (x2​−x1​)mod2=0otherwise​

Case 3: vvv is even

We use the rule divstep(δ,a,b)=(1+δ,a,b2)\mathsf{divstep}(\delta, a, b) = (1 + \delta, a, \frac{b}{2})divstep(δ,a,b)=(1+δ,a,2b​) for Case 3 Accordingly, the values of uuu and vvv are updated as:

u←uv←v2u \leftarrow u\\ v \leftarrow \frac{v}{2}u←uv←2v​

We apply the same update rule to x1x_1x1​ and x2x_2x2​ as follows:

x1←x1x2←{x22if x2 mod 2=0x2+p2otherwisex_1 \leftarrow x_1 \\ x_2 \leftarrow \begin{cases} \frac{x_2}{2} & \text{if } x_2 \bmod 2 = 0 \\ \frac{x_2 + p}{2} & \text{otherwise} \end{cases}x1​←x1​x2​←{2x2​​2x2​+p​​if x2​mod2=0otherwise​

The divstep operation involves division by 2 for the variables u,v,x1,x2u, v, x_1, x_2u,v,x1​,x2​. Instead of performing division by 2 on x1x_1x1​​ and x2x_2x2​​ at every iteration, we can batch NNN divsteps together and apply them to x1x_1x1​ and x2x_2x2​ all at once. This reduces NNN individual divisions into a single division.

[x1(1)x2(1)]≡12⋅(T(δ,u,v)⋅[x1x2]+[0α])mod  p\begin{bmatrix}x_1^{(1)} \\x_2^{(1)}\end{bmatrix} \equiv \frac{1}{2} \cdot \left( \mathcal{T}(\delta, u, v) \cdot \begin{bmatrix}x_1 \\x_2\end{bmatrix} + \begin{bmatrix} 0 \\ \alpha \end{bmatrix} \right) \mod p[x1(1)​x2(1)​​]≡21​⋅(T(δ,u,v)⋅[x1​x2​​]+[0α​])modp

Here, α\alphaα is 000 or ppp depending on x1x_1x1​ and x2x_2x2​ that make sure the expression inside the parentheses becomes divisible by 2.

Note that uuu and vvv are standard integers, while x1x_1x1​ and x2x_2x2​ are integers modulo ppp

The transition matrix T\mathcal{T}T used to update u,vu, vu,v is the same one applied to x1,x2x_1, x_2x1​,x2​. The key idea is to compute the transition matrix from uuu and vvv over NNN steps, and then apply it to x1,x2x_1, x_2x1​,x2​ using matrix-vector multiplication as follows:

[x1(N)x2(N)]≡12N⋅(TN(δ,u,v)⋅[x1x2]−[m1pm2p])mod  p\begin{bmatrix}x_1^{(N)} \\x_2^{(N)}\end{bmatrix} \equiv \frac{1}{2^N} \cdot \left( \mathcal{T}^N(\delta, u, v) \cdot \begin{bmatrix}x_1 \\x_2\end{bmatrix} - \begin{bmatrix}m_1p \\ m_2p\end{bmatrix} \right) \mod p[x1(N)​x2(N)​​]≡2N1​⋅(TN(δ,u,v)⋅[x1​x2​​]−[m1​pm2​p​])modp

Here, m1pm_1pm1​p and m2pm_2pm2​p are correction terms chosen so that the expression inside the parentheses becomes divisible by 2N2^N2N. This will be explained in .

The transition matrix T(δ,u,v)\mathcal{T}(\delta, u, v)T(δ,u,v) is defined as follows:

T(δ,u,v)={(021−1)if δ>0∧v mod 2=1(20−11)if δ≤0∧v mod 2=1(2001)otherwise\mathcal{T}(\delta, u, v) = \begin{cases} \begin{pmatrix}0 & 2 \\1 & -1\end{pmatrix} & \text{if } \delta > 0 \land v \bmod 2 = 1 \\ \begin{pmatrix}2 & 0 \\-1 & 1\end{pmatrix} & \text{if } \delta \le 0 \land v \bmod 2 = 1 \\ \begin{pmatrix}2 & 0 \\0 & 1\end{pmatrix} & \text{otherwise} \end{cases}T(δ,u,v)=⎩⎨⎧​(01​2−1​)(2−1​01​)(20​01​)​if δ>0∧vmod2=1if δ≤0∧vmod2=1otherwise​

Let b=max⁡(bit_length(u),bit_length(v))b = \max(\text{bit\_length}(u), \text{bit\_length}(v))b=max(bit_length(u),bit_length(v)), which can usually be considered the bit length of the modulus. Then, the number of N-step matrix applications is bounded by:

N={⌊49b+8017⌋if b<46⌊49b+5717⌋otherwiseN = \begin{cases} \left\lfloor \frac{49b + 80}{17} \right\rfloor & \text{if } b < 46 \\ \left\lfloor \frac{49b + 57}{17} \right\rfloor & \text{otherwise} \end{cases}N={⌊1749b+80​⌋⌊1749b+57​⌋​if b<46otherwise​
⌊49⋅254+5717⌋=735⇒12⋅62=744>735⇒N=62\left\lfloor \frac{49 \cdot 254 + 57}{17} \right\rfloor = 735 \\ \Rightarrow 12 \cdot 62 = 744 > 735 \Rightarrow \boxed{N = 62}⌊1749⋅254+57​⌋=735⇒12⋅62=744>735⇒N=62​

So if we set NNN to 62, it has been proven that a modular inverse can be computed with at most 12 iterations of the loop.

transition_matrix: Computes the matrix after N divsteps, this also computes u(N),v(N)u^{(N)}, v^{(N)}u(N),v(N)

div2n & update_x1x2: Applies the matrix to [x1,x2][x_1, x_2][x1​,x2​] to compute x1(N),x2(N)x_1^{(N)}, x_2^{(N)}x1(N)​,x2(N)​

Why div2n makes xxx divisible by 2N2^N2N:

We can express an integer xxx as:

x=2N⋅q+rx = 2^N \cdot q + rx=2N⋅q+r

where rrr is the remainder mod 2N2^N2N. To make xxx divisible by 2N2^N2N, we need to eliminate the rrr part.

m=(x⋅p−1)mod  2N=r⋅p−1mod  2N⇒m⋅p≡rmod  2Nm = (x \cdot p^{-1}) \mod 2^N = r \cdot p^{-1} \mod 2^N \Rightarrow m \cdot p \equiv r \mod 2^Nm=(x⋅p−1)mod2N=r⋅p−1mod2N⇒m⋅p≡rmod2N

Then, we subtract m⋅pm \cdot pm⋅p from xxx:

x−m⋅p=2N⋅q+r−m⋅p≡2N+r−r≡2Nq≡0mod  2Nx - m \cdot p = 2^N \cdot q + r - m \cdot p\equiv2^N +r - r \equiv 2^Nq \equiv 0\mod 2^Nx−m⋅p=2N⋅q+r−m⋅p≡2N+r−r≡2Nq≡0mod2N

which is now cleanly divisible by 2N2^N2N.

Additionally, since m⋅p≡rmod  2Nm \cdot p \equiv r \mod 2^Nm⋅p≡rmod2N, the subtraction step preserves the modular equivalence:

x−m⋅p≡xmod  px - m \cdot p \equiv x \mod px−m⋅p≡xmodp

Original Purpose: The div2n function is used to reduce results into the range [0,p)[0, p)[0,p).

Optimization Insight: Instead of restricting outputs to modulo ppp, we can eliminate the mod operation if we ensure that all intermediate values remain within a wider range, such as (−2p,p)(-2p, p)(−2p,p).

Each transition matrix T\mathcal{T}T acts on the vector [x1,x2]T[x_1, x_2]^T[x1​,x2​]T in one of the following ways:

T(δ,u,v)(x1x2)={(2x2x1−x2)if δ>0∧v mod 2=1(2x1x2−x1)if δ≤0∧v mod 2=1(2x1x2)otherwise\mathcal{T}(\delta, u, v)\begin{pmatrix} x_1 \\ x_2 \end{pmatrix} = \begin{cases} \begin{pmatrix}2x_2 \\ x_1 - x_2\end{pmatrix} & \text{if } \delta > 0 \land v \bmod 2 = 1 \\ \begin{pmatrix}2x_1 \\ x_2 - x_1\end{pmatrix} & \text{if } \delta \le 0 \land v \bmod 2 = 1 \\ \begin{pmatrix}2x_1 \\ x_2\end{pmatrix} & \text{otherwise} \end{cases}T(δ,u,v)(x1​x2​​)=⎩⎨⎧​(2x2​x1​−x2​​)(2x1​x2​−x1​​)(2x1​x2​​)​if δ>0∧vmod2=1if δ≤0∧vmod2=1otherwise​

After NNN iterations, we get:

−2Nx1<x1(N)<2Nx1,−2Nx2<x2(N)<2Nx2-2^N x_1 < x_1^{(N)} < 2^N x_1, \quad -2^N x_2 < x_2^{(N)} < 2^N x_2−2Nx1​<x1(N)​<2Nx1​,−2Nx2​<x2(N)​<2Nx2​

Since x1,x2∈(−p,p)x_1, x_2 \in (-p, p)x1​,x2​∈(−p,p) initially:

−2Np<x1(N)<2Np,−2Np<x2(N)<2Np-2^N p < x_1^{(N)} < 2^N p, \quad -2^N p < x_2^{(N)} < 2^N p−2Np<x1(N)​<2Np,−2Np<x2(N)​<2Np

In update_x1x2_optimized_ver1, we subtract a multiple of ppp before division:

x1(N)←x1(N)−mx1⋅p,x2(N)←x2(N)−mx2⋅px_1^{(N)} \leftarrow x_1^{(N)} - m_{x_1} \cdot p, \quad x_2^{(N)} \leftarrow x_2^{(N)} - m_{x_2} \cdot px1(N)​←x1(N)​−mx1​​⋅p,x2(N)​←x2(N)​−mx2​​⋅p

With 0≤mx1,mx2<2N0 \le m_{x_1}, m_{x_2} < 2^N0≤mx1​​,mx2​​<2N, we obtain:

−2N+1p<x1(N)−mx1p<2Np,−2N+1p<x2(N)−mx2p<2Np-2^{N+1}p < x_1^{(N)} - m_{x_1}p < 2^N p, \quad -2^{N+1}p < x_2^{(N)} - m_{x_2}p < 2^N p−2N+1p<x1(N)​−mx1​​p<2Np,−2N+1p<x2(N)​−mx2​​p<2Np

After division by 2N2^N2N:

−2p<x1(N)−mx1p2N<p,−2p<x2(N)−mx2p2N<p-2p < \frac{x_1^{(N)} - m_{x_1}p}{2^N} < p, \quad -2p < \frac{x_2^{(N)} - m_{x_2}p}{2^N} < p−2p<2Nx1(N)​−mx1​​p​<p,−2p<2Nx2(N)​−mx2​​p​<p

Hence, the result stays within (−2p,p)(-2p, p)(−2p,p).

(−p,p)→(−2p,p)→(−3p,2p)→...(-p, p) \rightarrow (-2p, p) \rightarrow (-3p, 2p) \rightarrow ...(−p,p)→(−2p,p)→(−3p,2p)→...

To prevent range blow-up, we can add ppp to negative values to bring the range back to (−p,p)(-p, p)(−p,p):

After the final iteration, the results x1(N),x2(N)x_1^{(N)}, x_2^{(N)}x1(N)​,x2(N)​ lie in (−2p,p)(-2p, p)(−2p,p). We want to map x1x_1x1​ into the range [0,p)[0, p)[0,p) using the following normalization:

✅ Optimized computation: Improves efficiency through batch processing via matrix multiplication and by avoiding modulus operations ↪ For additional optimizations, refer to

Written by from

−p<v<p-p < v < p−p<v<p
0≤v<p0 \le v < p0≤v<p
−p<v<p-p < v < p−p<v<p
0≤v<p0 \le v < p0≤v<p
0<v<p0 < v < p0<v<p
−p<v≤0-p < v \le 0−p<v≤0
−p<v<p-p < v < p−p<v<p
0≤v<p0 \le v < p0≤v<p
0<v<p0 < v < p0<v<p
0≤v<p0 \le v < p0≤v<p
0<v<p0 < v < p0<v<p
0≤v<p0 \le v < p0≤v<p
Bitcoin Core’s safegcd implementation guide
https://eprint.iacr.org/2019/266
https://github.com/bitcoin-core/secp256k1/blob/master/doc/safegcd_implementation.md
Stein’s binary GCD algorithm
Fermat’s Little Theorem
https://github.com/bitcoin-core/secp256k1/blob/master/doc/safegcd_implementation.md
here
A41
ryan Kim