LogoLogo
LogoLogo
  • Introduction
    • About Us
    • Notations & Definitions
      • MPC
      • ZK
    • Contribute to this Site!
  • Primitives
    • Multiplication
      • Karatsuba Multiplication
      • Toom-Cook Multiplication
    • NAF (Non-adjacent form)
    • Chinese Remainder Theorem (CRT)
    • Euclidean Algorithm
      • Extended Euclidean Algorithm
      • Binary Euclidean Algorithm
      • Extended Binary Euclidean Algorithm
    • Coding Theory
      • Linear Code
    • Number Theoretic Transform
    • Abstract Algebra
      • Group
        • -Morphisms
        • Batch Inverse
      • Elliptic Curve
        • Weierstrass Curve
          • Coordinate Forms
          • Fast Elliptic Curve Arithmetic and Improved WEIL Pairing Evaluation
        • Edwards Curve
          • Coordinate Forms
          • Twisted Edwards ↔ Short Weierstrass Transformation
        • Batch Inverse for Batch Point Additions
        • Scalar Multiplication
          • Double-and-add
          • GLV Decomposition
        • MSM
          • Pippenger's Algorithm
          • Signed Bucket Index
          • CycloneMSM
          • EdMSM
          • cuZK
        • 2-Chain and 2-Cycle of Elliptic Curves
    • Encryption Scheme
      • ElGamal Encryption
    • Modular Arithmetic
      • Modular Reduction
        • Barrett Reduction
        • Montgomery Reduction
      • Modular Inverse
        • Bernstein-Yang's Inverse
    • Multiset Check
    • Sumcheck
    • Commitment Scheme
      • Fflonk
      • SHPlonk
      • Zeromorph
  • MPC
    • Yao's Garbled Circuits
    • GMW
    • BMR
  • ZK
    • Arithmetization
      • R1CS
      • PLONK
      • AIR
    • Folding
      • LatticeFold
      • Nova
        • Nova over Cycles of Curves
    • Lookup
      • Lasso
      • LogUp-GKR
    • SNARK
      • Groth16
      • HyperPlonk
      • Spartan
        • SPARK
    • STARK
      • Additive NTT
      • Basefold
      • Binius
      • Brakedown
      • CircleSTARK
      • FRI
        • FRI Security Features and Optimizations
      • DEEP FRI
      • STIR
      • WHIR
    • Distributed ZK
      • Ryan's Trick for Distributed Groth16
  • Application
    • zkLogin
    • zkHoldem
    • zkTLS
      • DECO
      • Proxying is enough
  • zkVM
Powered by GitBook
On this page
  • The Core Idea
  • The Math Explained
  • The Montgomery Representation
  • Montgomery Multiplication
  • Montgomery Reduction
  • Summary of Montgomery Reduction
  • Multi-precision Montgomery Reduction
  • Separated Operands Scanning (SOS)
  • SOS with Ingonyama Optimization
  • Coarsely Integrated Operands Scanning (CIOS)
  • Cost analysis
  • References
Export as PDF
  1. Primitives
  2. Modular Arithmetic
  3. Modular Reduction

Montgomery Reduction

PreviousBarrett ReductionNextModular Inverse

Last updated 27 days ago

Montgomery reduction provides an efficient way to perform modular arithmetic, particularly modular multiplication a⋅bmod  na\cdot b \mod na⋅bmodn. This is crucial when the modulus nnn is large and many operations are needed, as in cryptography. Montgomery reduction excels when nnn is odd and aims to replace slow divisions by n.

The Core Idea

The method shifts calculations into a special "Montgomery domain" defined by an auxiliary modulus RRR. Arithmetic happens within this domain, where the reduction step is cleverly transformed into a division by RRR. Since RRR is chosen as a power of 2, this division is just a fast bit shift and the remainder is equivalent to its last few bits. Results are converted back to the standard representation when the sequence of operations is complete.

The Math Explained

The Montgomery Representation

We choose an auxiliary modulus R=bkR=b^kR=bk (where bbb is the base, usually 2) such that R>nR>nR>n. Critically, nnn and RRR must be coprime: gcd⁡(n,R)=1\gcd(n,R)=1gcd(n,R)=1. In this case, this just implies nnn must be odd.

Definition 1. The Montgomery representation xˉ\bar{x}xˉ of an integer xxx (where 0≤x<n0\leq x<n0≤x<n) is xˉ=x⋅Rmod  n\bar{x} = x \cdot R \mod n xˉ=x⋅Rmodn.

Since nnn and RRR are coprime, we can define the following conversion functions:

  • fromMont(xˉ)\mathsf{fromMont}(\bar{x})fromMont(xˉ): Converts xˉ\bar{x}xˉ back to standard form: x=xˉ⋅R−1mod  nx=\bar{x}⋅R^{−1} \mod nx=xˉ⋅R−1modn. This operation is also called Montgomery Reduction.

  • toMont(x)\mathsf{toMont}(x)toMont(x): Converts xxx into Montgomery form: xˉ=x⋅bkmod  n\bar{x}=x⋅b^k \mod nxˉ=x⋅bkmodn. This calculation itself can be optimized with Montgomery Reduction.

Lemma 1. Remainder of the division by R=2kR=2^kR=2k can be efficiently calculated as xmod  R=lowbitsk(x)x \mod R =\mathsf{lowbits_k}(x)xmodR=lowbitsk​(x) where lowbitsk(x)\mathsf{lowbits}_k(x)lowbitsk​(x) takes the least significant kkk bits of xxx.

Lemma 2. Quotient ⌊x/R⌋\lfloor x / R\rfloor⌊x/R⌋ can be efficiently computed using bit shift: x≫kx \gg kx≫k.

Montgomery Multiplication

Given two numbers in Montgomery form, aˉ=a⋅Rmod  n\bar{a}=a\cdot R \mod naˉ=a⋅Rmodn and bˉ=b⋅Rmod  n\bar{b}=b⋅R \mod nbˉ=b⋅Rmodn. We would like to calculate cˉ≡(a⋅b)⋅Rmod  n\bar{c}\equiv (a\cdot b)\cdot R \mod ncˉ≡(a⋅b)⋅Rmodn. This can be calculated using aˉ⋅bˉ⋅R−1\bar{a}\cdot \bar{b} \cdot R^{-1}aˉ⋅bˉ⋅R−1 since:

To do the Montgomery Multiplication efficiently, we need to be able to calculate:

Montgomery Reduction

And we have:

Overview of REDC(T):

Summary of Montgomery Reduction

Multi-precision Montgomery Reduction

Separated Operands Scanning (SOS)

Iterative Partial Reduction

And after the next partial reduction, we will have:

How many limbs are needed?

Summary of the Iteration step

SOS with Ingonyama Optimization

Then, similar to the previous calculation, the upper bound on the total addition will become:

Coarsely Integrated Operands Scanning (CIOS)

Naively speaking, SOS algorithm follows this sequence:

which results in its namesake "separated" in SOS.

CIOS, on the other hand, is different from SOS in that it interleaves the multiplication and reduction steps, limb by limb.

Pseudocode

Now we alternate between iterations of the outer loops for multiplication and reduction.

for i = 0 to l-1
    C := 0
    for j = 0 to l-1                         // a * b[i]
        (C, t[j]) := t[j] + a[j]*b[i] + C  
    (t[l+1], t[l]) := t[l] + C.              
    C := 0
    m := t[0]*n'[0] mod W                    // Partial m for t[0] only
    (C, _) := t[0] + m*n[0]
    for j = 1 to s-1                         // Add(t, m*n) >> w 
        (C, t[j-1]) := t[j] + m*n[j] + C
    (C, t[l-1]) := t[l] + C                   
    t[l] := t[l+1] + C                       

Cost analysis

CIOS
SOS
SOS + Ingonyama

Addition

Multiplication

Memory space

References

aˉ⋅bˉ≡(a⋅R)⋅(b⋅R)≡a⋅b⋅R2≡cˉ⋅Rmod  n\bar{a}\cdot\bar{b}\equiv (a\cdot R)\cdot(b\cdot R)\equiv a\cdot b \cdot R^2\equiv \bar{c}\cdot R\mod n aˉ⋅bˉ≡(a⋅R)⋅(b⋅R)≡a⋅b⋅R2≡cˉ⋅Rmodn
cˉ=doMontMul(aˉ,bˉ)=aˉ⋅bˉ⋅R−1mod  n\bar{c}=\mathsf{doMontMul}(\bar{a},\bar{b})=\bar{a}\cdot\bar{b}\cdot R^{-1} \mod ncˉ=doMontMul(aˉ,bˉ)=aˉ⋅bˉ⋅R−1modn

To get the actual c=a⋅bmod  nc=a\cdot b \mod nc=a⋅bmodn, we will have to additionally convert back from montgomery form using fromMont(cˉ)\mathsf{fromMont}(\bar{c})fromMont(cˉ).

Definition 2. Montgomery Reduction is an efficient reduction method to calculate T⋅R−1mod  nT\cdot R^{-1}\mod nT⋅R−1modn for a given T<n⋅RT < n\cdot RT<n⋅R. This operation can also be denoted as:

REDC(T):=T⋅R−1mod  n\mathsf{REDC}(T):=T\cdot R^{-1} \mod nREDC(T):=T⋅R−1modn

The key idea to perform Montgomery Reduction efficiently is to find an integer mmm such that T+m⋅nT + m\cdot nT+m⋅n is exactly divisible by RRR. Then the desired result can be computed as:

T⋅R−1≡T+m⋅nR≡(T+m⋅n)≫kmod  nT\cdot R^{-1}\equiv\frac{T+m⋅n}{R}\equiv(T+m\cdot n)\gg k \mod nT⋅R−1≡RT+m⋅n​≡(T+m⋅n)≫kmodn

What will happen if we precompute R−1R^{-1}R−1 to compute this?

We need to find m mm such that T+m⋅n≡0mod  RT+m⋅n≡0 \mod RT+m⋅n≡0modR

T+m⋅n≡0mod  Rm⋅n≡−Tmod  Rm≡−T⋅n−1mod  Rm≡T⋅(−n−1)mod  R\begin{align*} T+m\cdot n&\equiv 0 &\mod R\\ m\cdot n&\equiv -T &\mod R\\ m &\equiv -T\cdot n^{-1} &\mod R\\ m &\equiv T\cdot(-n^{-1}) &\mod R \end{align*}T+m⋅nm⋅nmm​≡0≡−T≡−T⋅n−1≡T⋅(−n−1)​modRmodRmodRmodR​

To calculate mmm efficiently, we use a precomputed value n′=−n−1mod  Rn'=−n^{−1} \mod Rn′=−n−1modR. This n′n'n′ exists because gcd(n,R)=1\mathsf{gcd}(n,R)=1gcd(n,R)=1. Now, we have m≔T⋅n′mod  Rm\coloneqq T\cdot n'\mod Rm:=T⋅n′modR and REDC\mathsf{REDC}REDC can be defined as:

REDC(T)=(T+m⋅n)≫kmod  n\mathsf{REDC}(T)=(T+m⋅n)\gg k \mod nREDC(T)=(T+m⋅n)≫kmodn

T<n⋅RT<n\cdot RT<n⋅R so (T≫k)=T/R<n(T\gg k)=T/R< n(T≫k)=T/R<n

m<Rm < Rm<R so (m⋅n≫k)=n⋅m/R<n(m\cdot n \gg k) = n\cdot m / R< n(m⋅n≫k)=n⋅m/R<n

Hence, ((T+m⋅n)≫k)<2n((T+m\cdot n)\gg k) < 2n((T+m⋅n)≫k)<2n so we just need to subtract nnn once if ((T+m⋅n)≫k)≥n((T+m\cdot n)\gg k)\geq n((T+m⋅n)≫k)≥n to take the modulus.

Given T<n⋅RT < n\cdot RT<n⋅R, and precomputed n′≔−n−1mod  Rn' \coloneqq −n^{−1} \mod Rn′:=−n−1modR.

Calculate m=(T⋅n′)mod  R=lowbitsk(lowbitsk(T)⋅n′)m=(T⋅n') \mod R=\mathsf{lowbits}_k(\mathsf{lowbits}_k(T)\cdot n')m=(T⋅n′)modR=lowbitsk​(lowbitsk​(T)⋅n′). (Effectively: multiply the lowest kkk bits of TTT by n′n'n′, take the lowest kkk bits of the product).

Compute x=(T+m⋅n)/Rx=(T+m⋅n)/Rx=(T+m⋅n)/R. (Involves one multiplication m⋅nm\cdot nm⋅n, one addition, and one right shift by kkk).

Return x−nx−nx−n if x≥nx\geq nx≥n, otherwise return xxx.

With some precomputation and using Montgomery Reduction REDC\mathsf{REDC}REDC, we can optimize the doMontMul(aˉ,bˉ)\mathsf{doMontMul}(\bar{a}, \bar{b})doMontMul(aˉ,bˉ), toMont(x)\mathsf{toMont}(x)toMont(x), and fromMont(xˉ)\mathsf{fromMont}(\bar{x})fromMont(xˉ) operations as following:

Precompute Rsq′:=R2mod  nRsq' := R^2 \mod nRsq′:=R2modn and n′:=−(n−1)mod  Rn':=-(n^{-1})\mod Rn′:=−(n−1)modR:

toMont(x)=x⋅Rmod  n=REDC(x⋅R2)=REDC(x⋅Rsq′)fromMont(xˉ)=REDC(xˉ)doMontMul(aˉ,bˉ)=REDC(aˉ⋅bˉ)\begin{aligned} \mathsf{toMont}(x)&=x\cdot R \mod n = \mathsf{REDC}(x\cdot R^2) = \mathsf{REDC}(x\cdot Rsq')\\ \mathsf{fromMont}(\bar{x}) &= \mathsf{REDC}(\bar{x}) \\ \mathsf{doMontMul}(\bar{a},\bar{b})&=\mathsf{REDC}(\bar{a}\cdot\bar{b}) \end{aligned}toMont(x)fromMont(xˉ)doMontMul(aˉ,bˉ)​=x⋅Rmodn=REDC(x⋅R2)=REDC(x⋅Rsq′)=REDC(xˉ)=REDC(aˉ⋅bˉ)​

In a multi-precision setting, suppose the modulo nnn has lll limbs. Let R=BlR=B^lR=Bl, where www is limb bit width and B=2wB=2^wB=2w. To calculate T⋅R−1=T⋅B−lmod  nT\cdot R^{-1}=T\cdot B^{-l}\mod nT⋅R−1=T⋅B−lmodn, the multi-precision Montgomery Reduction operates iteratively on:

T=T2l−1⋅B2l−1+⋯+Tl⋅Bl+Tl−1⋅Bl−1+⋯+T0T = T_{2l-1}\cdot B^{2l-1} + \dots + T_l\cdot B^l + T_{l-1}\cdot B^{l-1} + \dots + T_0T=T2l−1​⋅B2l−1+⋯+Tl​⋅Bl+Tl−1​⋅Bl−1+⋯+T0​

where TiT_iTi​ represents limb values. At each iteration, the limbs of TTT are reduced one by one which is done by partial reduction. The first partial reduction looks like so:

T(1)=T⋅B−1mod  n=T2l−1⋅B2l−2+⋯+Tl⋅Bl−1+Tl−1⋅Bl−2+⋯+T1+(T0⋅B−1mod  n)mod  n=T2l−1(1)B2l−2+⋯+Tl(1)⋅Bl−1+Tl−1(1)⋅Bl−2+⋯+T1(1)mod  n\begin{align*} T^{(1)}=T\cdot B^{-1} \mod n &= T_{2l-1}\cdot B^{2l-2} + \dots + T_l\cdot B^{l-1} + T_{l-1}\cdot B^{l-2} + \dots + T_1 + (T_0\cdot B^{-1} \mod n) &\mod n\\ &= T^{(1)}_{2l-1}B^{2l-2} + \dots + T^{(1)}_l\cdot B^{l-1} + T^{(1)}_{l-1}\cdot B^{l-2} + \dots + T^{(1)}_1 &\mod n \end{align*}T(1)=T⋅B−1modn​=T2l−1​⋅B2l−2+⋯+Tl​⋅Bl−1+Tl−1​⋅Bl−2+⋯+T1​+(T0​⋅B−1modn)=T2l−1(1)​B2l−2+⋯+Tl(1)​⋅Bl−1+Tl−1(1)​⋅Bl−2+⋯+T1(1)​​modnmodn​
T(2)=T⋅B−2mod  n=T2l−1(1)⋅B2l−3+⋯+Tl(1)⋅Bl−2+Tl−1(1)⋅Bl−3+⋯+T2(1)+(T1(1)⋅B−1mod  n)mod  n=T2l−1(2)⋅B2l−3+⋯+Tl(2)⋅Bl−2+Tl−1(2)⋅Bl−3+⋯+T2(2)mod  n\begin{align*} T^{(2)}=T\cdot B^{-2} \mod n &= T^{(1)}_{2l-1}\cdot B^{2l-3} + \dots + T^{(1)}_l\cdot B^{l-2} + T^{(1)}_{l-1}\cdot B^{l-3} + \dots + T^{(1)}_2+(T^{(1)}_1\cdot B^{-1} \mod n) &\mod n\\ &=T^{(2)}_{2l-1}\cdot B^{2l-3} + \dots + T^{(2)}_l\cdot B^{l-2} + T^{(2)}_{l-1}\cdot B^{l-3} + \dots + T^{(2)}_2 &\mod n\\ \end{align*}T(2)=T⋅B−2modn​=T2l−1(1)​⋅B2l−3+⋯+Tl(1)​⋅Bl−2+Tl−1(1)​⋅Bl−3+⋯+T2(1)​+(T1(1)​⋅B−1modn)=T2l−1(2)​⋅B2l−3+⋯+Tl(2)​⋅Bl−2+Tl−1(2)​⋅Bl−3+⋯+T2(2)​​modnmodn​

and so on. After lll iterations, we have T⋅B−lmod  nT\cdot B^{-l}\mod nT⋅B−lmodn with lll limbs. The values of limbs change at each round because Ti(i)⋅B−1mod  nT^{(i)}_i \cdot B^{-1} \mod nTi(i)​⋅B−1modn is also a multi-precision integer.

How to multiply by the inverse of BBB?

At round iii, we need to calculate (Ti−1(i−1)⋅B−1mod  n)(T^{(i-1)}_{i-1}\cdot B^{-1} \mod n)(Ti−1(i−1)​⋅B−1modn). This can be done efficiently, similar to the single-precision case by adding some multiple of nnn to make it divisible by BBB:

Precompute n′≔−n−1mod  Bn'\coloneqq -n^{-1}\mod Bn′:=−n−1modB.

Calculate mi≔Ti−1(i−1)⋅n′mod  Bm_i\coloneqq T^{(i-1)}_{i-1}\cdot n' \mod Bmi​:=Ti−1(i−1)​⋅n′modB. Then we have Ti−1(i−1)+mi⋅n≡Ti−1(i−1)−Ti−1(i−1)⋅n−1⋅n≡0mod  BT^{(i-1)}_{i-1} + m_i\cdot n \equiv T^{(i-1)}_{i-1} -T^{(i-1)}_{i-1}\cdot n^{-1}\cdot n \equiv 0\mod BTi−1(i−1)​+mi​⋅n≡Ti−1(i−1)​−Ti−1(i−1)​⋅n−1⋅n≡0modB

Calculate Ti−1(i−1)⋅B−1≡(Ti−1(i−1)+mi⋅n)⋅B−1≡(Ti−1(i−1)+mi⋅n)≫wmod  nT^{(i-1)}_{i-1}\cdot B^{-1} \equiv (T^{(i-1)}_{i-1} + m_i\cdot n)\cdot B^{-1}\equiv (T^{(i-1)}_{i-1} + m_i\cdot n)\gg w\mod nTi−1(i−1)​⋅B−1≡(Ti−1(i−1)​+mi​⋅n)⋅B−1≡(Ti−1(i−1)​+mi​⋅n)≫wmodn .

Since mi<Bm_i < Bmi​<B and Ti−1(i−1)<BT^{(i-1)}_{i-1} < BTi−1(i−1)​<B, we have

(Ti−1(i−1)+mi⋅n)/B≤(B−1+(B−1)⋅n)/B=n+(B−1−n)/B<n(T^{(i-1)}_{i-1} +m_i\cdot n) / B \leq (B-1 + (B-1)\cdot n)/B=n + (B-1-n)/B < n(Ti−1(i−1)​+mi​⋅n)/B≤(B−1+(B−1)⋅n)/B=n+(B−1−n)/B<n

which means after step 3, we don't have to worry about subtracting nnn.

While trying to reduce, we add mi⋅nm_i\cdot nmi​⋅n every round which can cause potential overflow in the largest limb. Let's calculate how much we added in total. If we omit dividing by BBB, at each round, we can see that at each round iii, we add mi⋅n⋅Bi−1m_i \cdot n\cdot B^{i-1}mi​⋅n⋅Bi−1. This can be confusing but if you think of round iii as adding some value mi⋅nm_i\cdot nmi​⋅n to convert the (i−1)(i-1)(i−1)-th limb to 0, it can be more easier to see.

Total added=∑i=1lmi⋅n⋅Bi−1<n∑i=1lB⋅Bi−1=n⋅B(Bl−1B−1)<n⋅R\text{Total added}=\sum_{i=1}^lm_i\cdot n \cdot B^{i-1}<n\sum_{i=1}^lB \cdot B^{i-1}=n\cdot B\bigg(\frac{B^l-1}{B-1}\bigg)<n\cdot RTotal added=i=1∑l​mi​⋅n⋅Bi−1<ni=1∑l​B⋅Bi−1=n⋅B(B−1Bl−1​)<n⋅R

This gives that total sum can be T+n⋅R<R2+n⋅R<2R2=2⋅22lw=22lw+1T+n\cdot R <R^2+n\cdot R < 2R^2=2\cdot2^{2lw}=2^{2lw+1}T+n⋅R<R2+n⋅R<2R2=2⋅22lw=22lw+1 which needs 2l+12l+12l+1 limbs. For TTT, we already need 2l2l2l limbs and after the first step, the least significant limb will be 0's which can be discarded. In the first round, we have a maximum value of T+n<n⋅R+n=n(R+1)≤(R−1)(R+1)=R2−1T+n<n\cdot R + n=n(R+1)\leq(R-1)(R+1)=R^2-1T+n<n⋅R+n=n(R+1)≤(R−1)(R+1)=R2−1 so it fits within the 2l2l2l limbs. So, if we discard the least significant limb after first round, we don't actually need extra limb space to handle the overflowing.

A complete round iii of partial reduction can be summarized by the following steps:

Compute mi=n′⋅Ti−1(i−1)mod  Rm_i=n' \cdot T^{(i-1)}_{i-1} \mod Rmi​=n′⋅Ti−1(i−1)​modR (which costs 1×11\times 11×1 limb multiplication).

Compute mi⋅nm_i\cdot nmi​⋅n (which costs 1×l1 \times l1×l limb multiplication).

Set T(i)=(T(i−1)+mi⋅n)≫wT^{(i)}=(T^{(i-1)} + m_i\cdot n ) \gg wT(i)=(T(i−1)+mi​⋅n)≫w.

Now, let's calculate the upper bound on the final result of the reductions. First of all, we saw above that, the total added value has an upper bound of n⋅Rn\cdot Rn⋅R, and using T<n⋅RT<n\cdot RT<n⋅R, we have:

T(l)<T+n⋅RR<2n⋅RR<2nT^{(l)} < \frac{T+n\cdot R}{R}<\frac{2n\cdot R}{R}<2nT(l)<RT+n⋅R​<R2n⋅R​<2n

Therefore, the reduced value may require one additional subtraction to bring the value within the range [0,…,n−1][0,\dots,n-1][0,…,n−1]. Since we need l+1l+1l+1 limb multiplications for steps 1 and 2, over lll rounds, we will need a total of l2+ll^2+ll2+l limb multiplications.

We have seen that we need to do l2+ll^2+ll2+l multiplications per round for multi-precision Montgomery Reduction. However, that we can reduce it to l2+1l^2+1l2+1. Let's see what happens if we just precompute B′=B−1mod  nB'=B^{-1} \mod nB′=B−1modn and multiply the free coefficients with B′B'B′. Then, the partial reduction round iii becomes:

Compute mi′=Ti−1(i−1)⋅B′m'_i=T^{(i-1)}_{i-1}\cdot B'mi′​=Ti−1(i−1)​⋅B′ which is 1×l1\times l1×l limb multiplication, resulting in a l+1l+1l+1 limb integer.

Set T(i)=(T(i−1)≫w)+mi′T^{(i)}=(T^{(i-1)}\gg w) + m'_iT(i)=(T(i−1)≫w)+mi′​.

Total added=∑i=1lmi′⋅Bi−1<∑i=1ln⋅B⋅Bi−1=n⋅B(Bl−1B−1)<n⋅R\text{Total added}=\sum_{i=1}^l m'_i \cdot B^{i-1}<\sum_{i=1}^l n\cdot B \cdot B^{i-1}=n\cdot B\bigg(\frac{B^l-1}{B-1}\bigg)<n\cdot RTotal added=i=1∑l​mi′​⋅Bi−1<i=1∑l​n⋅B⋅Bi−1=n⋅B(B−1Bl−1​)<n⋅R

So it hasn't changed. Which means the upper bound on the value after lll reductions will also be the same:

T(l)<T+n⋅RR<2n⋅RR<2nT^{(l)} < \frac{T+n\cdot R}{R}<\frac{2n\cdot R}{R}<2nT(l)<RT+n⋅R​<R2n⋅R​<2n

This gives us the total cost of l⋅l=l2l\cdot l =l^2l⋅l=l2 limb multiplications. Ingonyama's article mentions that this reduction cannot be applied in the last step because T(l)=(T(l−1)≫w)+mi′T^{(l)}=(T^{(l-1)}\gg w) + m'_iT(l)=(T(l−1)≫w)+mi′​ will result in a l+1l+1l+1 limb integer while in the traditional case, we have lll limbs. Indeed it is true since we add it after the bit shift unlike the original version but if the result is less than 2⋅n2\cdot n2⋅n, it shouldn't matter.

In the SOS method above, we iteratively divided ttt by BBB lll times where BBB denotes the limb base (2w2^w2w) and lll denotes the number of the limbs. lll iterations gives the desired result t⋅R−1mod  nt \cdot R^{-1} \mod nt⋅R−1modn since R=BlR = B^lR=Bl. Now we can discard the lower lll limbs, which is technically equivalent to dividing by RRR.

Multiply two large integers in Montgomery form using a .

Perform Montgomery reduction to produce cˉ=aˉ⋅bˉ\bar{c} = \bar{a} \cdot \bar{b}cˉ=aˉ⋅bˉ from aˉ⋅bˉ⋅R\bar{a} \cdot \bar{b} \cdot Raˉ⋅bˉ⋅R.

Interleaving multiplication and reduction is valid since the value of mmm in the iii-th iteration of the outer loop for reduction depends only on the value t[i]t[i]t[i], which is completely computed by the iii-th iteration of the outer loop for the multiplication.

SOS with Ingonyama's optimization saves l−1l-1l−1 multiplications (2l2+12l^2 + 12l2+1 vs. 2l2+l2l^2 + l2l2+l) whereas CIOS saves l−1l-1l−1 memory space compared to SOS (2l+22l+22l+2 vs. l+3l+3l+3) owing to not storing the total ttt. CIOS can be further optimized using .

Written by and from

4l2+4l+24l^2 + 4l + 24l2+4l+2
4l2+4l+24l^2 + 4l + 24l2+4l+2
4l2+4l+24l^2 + 4l + 24l2+4l+2
2l2+l2l^2 +l2l2+l
2l2+l2l^2 +l2l2+l
2l2+12l^2 +12l2+1
l+3l+3l+3
2l+22l+ 22l+2
2l+22l+ 22l+2
textbook method
https://en.wikipedia.org/wiki/Montgomery_modular_multiplication
https://hackmd.io/@Ingonyama/Barret-Montgomery#Barrett-Montgomery-duality
Analyzing and Comparing Montgomery Multiplication Algorithms
EdMSM: Multi-Scalar-Multiplication for SNARKs and Faster Montgomery multiplication
Ingonyama showed
EdMSM's trick
A41
BATZORIG ZORIGOO
Carson Lee
Analyzing and Comparing Montgomery Multiplication Algorithms