Montgomery Reduction

Montgomery reduction provides an efficient way to perform modular arithmetic, particularly modular multiplication abmodna\cdot b \mod n. This is crucial when the modulus nn is large and many operations are needed, as in cryptography. Montgomery reduction excels when nn is odd and aims to replace slow divisions by n.

The Core Idea

The method shifts calculations into a special "Montgomery domain" defined by an auxiliary modulus RR. Arithmetic happens within this domain, where the reduction step is cleverly transformed into a division by RR. Since RR is chosen as a power of 2, this division is just a fast bit shift and the remainder is equivalent to its last few bits. Results are converted back to the standard representation when the sequence of operations is complete.

The Math Explained

The Montgomery Representation

We choose an auxiliary modulus R=bkR=b^k (where bb is the base, usually 2) such that R>nR>n. Critically, nn and RR must be coprime: gcd(n,R)=1\gcd(n,R)=1. In this case, this just implies nn must be odd.

Definition 1. The Montgomery representation xˉ\bar{x} of an integer xx (where 0x<n0\leq x<n) is xˉ=xRmodn\bar{x} = x \cdot R \mod n .

Since nn and RR are coprime, we can define the following conversion functions:

  • fromMont(xˉ)\mathsf{fromMont}(\bar{x}): Converts xˉ\bar{x} back to standard form: x=xˉR1modnx=\bar{x}⋅R^{−1} \mod n. This operation is also called Montgomery Reduction.

  • toMont(x)\mathsf{toMont}(x): Converts xx into Montgomery form: xˉ=xbkmodn\bar{x}=x⋅b^k \mod n. This calculation itself can be optimized with Montgomery Reduction.

Lemma 1. Remainder of the division by R=2kR=2^k can be efficiently calculated as xmodR=lowbitsk(x)x \mod R =\mathsf{lowbits_k}(x) where lowbitsk(x)\mathsf{lowbits}_k(x) takes the least significant kk bits of xx.

Lemma 2. Quotient x/R\lfloor x / R\rfloor can be efficiently computed using bit shift: xkx \gg k.

Montgomery Multiplication

Given two numbers in Montgomery form, aˉ=aRmodn\bar{a}=a\cdot R \mod n and bˉ=bRmodn\bar{b}=b⋅R \mod n. We would like to calculate cˉ(ab)Rmodn\bar{c}\equiv (a\cdot b)\cdot R \mod n. This can be calculated using aˉbˉR1\bar{a}\cdot \bar{b} \cdot R^{-1} since:

aˉbˉ(aR)(bR)abR2cˉRmodn\bar{a}\cdot\bar{b}\equiv (a\cdot R)\cdot(b\cdot R)\equiv a\cdot b \cdot R^2\equiv \bar{c}\cdot R\mod n

To do the Montgomery Multiplication efficiently, we need to be able to calculate:

cˉ=doMontMul(aˉ,bˉ)=aˉbˉR1modn\bar{c}=\mathsf{doMontMul}(\bar{a},\bar{b})=\bar{a}\cdot\bar{b}\cdot R^{-1} \mod n

To get the actual c=abmodnc=a\cdot b \mod n, we will have to additionally convert back from montgomery form using fromMont(cˉ)\mathsf{fromMont}(\bar{c}).

Montgomery Reduction

Definition 2. Montgomery Reduction is an efficient reduction method to calculate TR1modnT\cdot R^{-1}\mod n for a given T<nRT < n\cdot R. This operation can also be denoted as:

REDC(T):=TR1modn\mathsf{REDC}(T):=T\cdot R^{-1} \mod n

The key idea to perform Montgomery Reduction efficiently is to find an integer mm such that T+mnT + m\cdot n is exactly divisible by RR. Then the desired result can be computed as:

TR1T+mnR(T+mn)kmodnT\cdot R^{-1}\equiv\frac{T+m⋅n}{R}\equiv(T+m\cdot n)\gg k \mod n

What will happen if we precompute R1R^{-1} to compute this?

We need to find m m such that T+mn0modRT+m⋅n≡0 \mod R

T+mn0modRmnTmodRmTn1modRmT(n1)modR\begin{align*} T+m\cdot n&\equiv 0 &\mod R\\ m\cdot n&\equiv -T &\mod R\\ m &\equiv -T\cdot n^{-1} &\mod R\\ m &\equiv T\cdot(-n^{-1}) &\mod R \end{align*}

To calculate mm efficiently, we use a precomputed value n=n1modRn'=−n^{−1} \mod R. This nn' exists because gcd(n,R)=1\mathsf{gcd}(n,R)=1. Now, we have mTnmodRm\coloneqq T\cdot n'\mod R and REDC\mathsf{REDC} can be defined as:

REDC(T)=(T+mn)kmodn\mathsf{REDC}(T)=(T+m⋅n)\gg k \mod n

And we have:

  • T<nRT<n\cdot R so (Tk)=T/R<n(T\gg k)=T/R< n

  • m<Rm < R so (mnk)=nm/R<n(m\cdot n \gg k) = n\cdot m / R< n

Hence, ((T+mn)k)<2n((T+m\cdot n)\gg k) < 2n so we just need to subtract nn once if ((T+mn)k)n((T+m\cdot n)\gg k)\geq n to take the modulus.

Overview of REDC(T):

Given T<nRT < n\cdot R, and precomputed nn1modRn' \coloneqq −n^{−1} \mod R.

  1. Calculate m=(Tn)modR=lowbitsk(lowbitsk(T)n)m=(T⋅n') \mod R=\mathsf{lowbits}_k(\mathsf{lowbits}_k(T)\cdot n'). (Effectively: multiply the lowest kk bits of TT by nn', take the lowest kk bits of the product).

  2. Compute x=(T+mn)/Rx=(T+m⋅n)/R. (Involves one multiplication mnm\cdot n, one addition, and one right shift by kk).

  3. Return xnx−n if xnx\geq n, otherwise return xx.

Summary of Montgomery Reduction

With some precomputation and using Montgomery Reduction REDC\mathsf{REDC}, we can optimize the doMontMul(aˉ,bˉ)\mathsf{doMontMul}(\bar{a}, \bar{b}), toMont(x)\mathsf{toMont}(x), and fromMont(xˉ)\mathsf{fromMont}(\bar{x}) operations as following:

Precompute Rsq:=R2modnRsq' := R^2 \mod n and n:=(n1)modRn':=-(n^{-1})\mod R:

toMont(x)=xRmodn=REDC(xR2)=REDC(xRsq)fromMont(xˉ)=REDC(xˉ)doMontMul(aˉ,bˉ)=REDC(aˉbˉ)\begin{aligned} \mathsf{toMont}(x)&=x\cdot R \mod n = \mathsf{REDC}(x\cdot R^2) = \mathsf{REDC}(x\cdot Rsq')\\ \mathsf{fromMont}(\bar{x}) &= \mathsf{REDC}(\bar{x}) \\ \mathsf{doMontMul}(\bar{a},\bar{b})&=\mathsf{REDC}(\bar{a}\cdot\bar{b}) \end{aligned}

Multi-precision Montgomery Reduction

Separated Operands Scanning (SOS)

Iterative Partial Reduction

In a multi-precision setting, suppose the modulo nn has ll limbs. Let R=BlR=B^l, where ww is limb bit width and B=2wB=2^w. To calculate TR1=TBlmodnT\cdot R^{-1}=T\cdot B^{-l}\mod n, the multi-precision Montgomery Reduction operates iteratively on:

T=T2l1B2l1++TlBl+Tl1Bl1++T0T = T_{2l-1}\cdot B^{2l-1} + \dots + T_l\cdot B^l + T_{l-1}\cdot B^{l-1} + \dots + T_0

where TiT_i represents limb values. At each iteration, the limbs of TT are reduced one by one which is done by partial reduction. The first partial reduction looks like so:

T(1)=TB1modn=T2l1B2l2++TlBl1+Tl1Bl2++T1+(T0B1modn)modn=T2l1(1)B2l2++Tl(1)Bl1+Tl1(1)Bl2++T1(1)modn\begin{align*} T^{(1)}=T\cdot B^{-1} \mod n &= T_{2l-1}\cdot B^{2l-2} + \dots + T_l\cdot B^{l-1} + T_{l-1}\cdot B^{l-2} + \dots + T_1 + (T_0\cdot B^{-1} \mod n) &\mod n\\ &= T^{(1)}_{2l-1}B^{2l-2} + \dots + T^{(1)}_l\cdot B^{l-1} + T^{(1)}_{l-1}\cdot B^{l-2} + \dots + T^{(1)}_1 &\mod n \end{align*}

And after the next partial reduction, we will have:

T(2)=TB2modn=T2l1(1)B2l3++Tl(1)Bl2+Tl1(1)Bl3++T2(1)+(T1(1)B1modn)modn=T2l1(2)B2l3++Tl(2)Bl2+Tl1(2)Bl3++T2(2)modn\begin{align*} T^{(2)}=T\cdot B^{-2} \mod n &= T^{(1)}_{2l-1}\cdot B^{2l-3} + \dots + T^{(1)}_l\cdot B^{l-2} + T^{(1)}_{l-1}\cdot B^{l-3} + \dots + T^{(1)}_2+(T^{(1)}_1\cdot B^{-1} \mod n) &\mod n\\ &=T^{(2)}_{2l-1}\cdot B^{2l-3} + \dots + T^{(2)}_l\cdot B^{l-2} + T^{(2)}_{l-1}\cdot B^{l-3} + \dots + T^{(2)}_2 &\mod n\\ \end{align*}

and so on. After ll iterations, we have TBlmodnT\cdot B^{-l}\mod n with ll limbs. The values of limbs change at each round because Ti(i)B1modnT^{(i)}_i \cdot B^{-1} \mod n is also a multi-precision integer.

How to multiply by the inverse of BB?

At round ii, we need to calculate (Ti1(i1)B1modn)(T^{(i-1)}_{i-1}\cdot B^{-1} \mod n). This can be done efficiently, similar to the single-precision case by adding some multiple of nn to make it divisible by BB:

  1. Precompute nn1modBn'\coloneqq -n^{-1}\mod B.

  2. Calculate miTi1(i1)nmodBm_i\coloneqq T^{(i-1)}_{i-1}\cdot n' \mod B. Then we have Ti1(i1)+minTi1(i1)Ti1(i1)n1n0modBT^{(i-1)}_{i-1} + m_i\cdot n \equiv T^{(i-1)}_{i-1} -T^{(i-1)}_{i-1}\cdot n^{-1}\cdot n \equiv 0\mod B

  3. Calculate Ti1(i1)B1(Ti1(i1)+min)B1(Ti1(i1)+min)wmodnT^{(i-1)}_{i-1}\cdot B^{-1} \equiv (T^{(i-1)}_{i-1} + m_i\cdot n)\cdot B^{-1}\equiv (T^{(i-1)}_{i-1} + m_i\cdot n)\gg w\mod n .

Since mi<Bm_i < B and Ti1(i1)<BT^{(i-1)}_{i-1} < B, we have

(Ti1(i1)+min)/B(B1+(B1)n)/B=n+(B1n)/B<n(T^{(i-1)}_{i-1} +m_i\cdot n) / B \leq (B-1 + (B-1)\cdot n)/B=n + (B-1-n)/B < n

which means after step 3, we don't have to worry about subtracting nn.

How many limbs are needed?

While trying to reduce, we add minm_i\cdot n every round which can cause potential overflow in the largest limb. Let's calculate how much we added in total. If we omit dividing by BB, at each round, we can see that at each round ii, we add minBi1m_i \cdot n\cdot B^{i-1}. This can be confusing but if you think of round ii as adding some value minm_i\cdot n to convert the (i1)(i-1)-th limb to 0, it can be more easier to see.

Total added=i=1lminBi1<ni=1lBBi1=nB(Bl1B1)<nR\text{Total added}=\sum_{i=1}^lm_i\cdot n \cdot B^{i-1}<n\sum_{i=1}^lB \cdot B^{i-1}=n\cdot B\bigg(\frac{B^l-1}{B-1}\bigg)<n\cdot R

This gives that total sum can be T+nR<R2+nR<2R2=222lw=22lw+1T+n\cdot R <R^2+n\cdot R < 2R^2=2\cdot2^{2lw}=2^{2lw+1} which needs 2l+12l+1 limbs. For TT, we already need 2l2l limbs and after the first step, the least significant limb will be 0's which can be discarded. In the first round, we have a maximum value of T+n<nR+n=n(R+1)(R1)(R+1)=R21T+n<n\cdot R + n=n(R+1)\leq(R-1)(R+1)=R^2-1 so it fits within the 2l2l limbs. So, if we discard the least significant limb after first round, we don't actually need extra limb space to handle the overflowing.

Summary of the Iteration step

A complete round ii of partial reduction can be summarized by the following steps:

  1. Compute mi=nTi1(i1)modRm_i=n' \cdot T^{(i-1)}_{i-1} \mod R (which costs 1×11\times 1 limb multiplication).

  2. Compute minm_i\cdot n (which costs 1×l1 \times l limb multiplication).

  3. Set T(i)=(T(i1)+min)wT^{(i)}=(T^{(i-1)} + m_i\cdot n ) \gg w.

Now, let's calculate the upper bound on the final result of the reductions. First of all, we saw above that, the total added value has an upper bound of nRn\cdot R, and using T<nRT<n\cdot R, we have:

T(l)<T+nRR<2nRR<2nT^{(l)} < \frac{T+n\cdot R}{R}<\frac{2n\cdot R}{R}<2n

Therefore, the reduced value may require one additional subtraction to bring the value within the range [0,,n1][0,\dots,n-1]. Since we need l+1l+1 limb multiplications for steps 1 and 2, over ll rounds, we will need a total of l2+ll^2+l limb multiplications.

SOS with Ingonyama Optimization

We have seen that we need to do l2+ll^2+l multiplications per round for multi-precision Montgomery Reduction. However, Ingonyama showed that we can reduce it to l2+1l^2+1. Let's see what happens if we just precompute B=B1modnB'=B^{-1} \mod n and multiply the free coefficients with BB'. Then, the partial reduction round ii becomes:

  1. Compute mi=Ti1(i1)Bm'_i=T^{(i-1)}_{i-1}\cdot B' which is 1×l1\times l limb multiplication, resulting in a l+1l+1 limb integer.

  2. Set T(i)=(T(i1)w)+miT^{(i)}=(T^{(i-1)}\gg w) + m'_i.

Then, similar to the previous calculation, the upper bound on the total addition will become:

Total added=i=1lmiBi1<i=1lnBBi1=nB(Bl1B1)<nR\text{Total added}=\sum_{i=1}^l m'_i \cdot B^{i-1}<\sum_{i=1}^l n\cdot B \cdot B^{i-1}=n\cdot B\bigg(\frac{B^l-1}{B-1}\bigg)<n\cdot R

So it hasn't changed. Which means the upper bound on the value after ll reductions will also be the same:

T(l)<T+nRR<2nRR<2nT^{(l)} < \frac{T+n\cdot R}{R}<\frac{2n\cdot R}{R}<2n

This gives us the total cost of ll=l2l\cdot l =l^2 limb multiplications. Ingonyama's article mentions that this reduction cannot be applied in the last step because T(l)=(T(l1)w)+miT^{(l)}=(T^{(l-1)}\gg w) + m'_i will result in a l+1l+1 limb integer while in the traditional case, we have ll limbs. Indeed it is true since we add it after the bit shift unlike the original version but if the result is less than 2n2\cdot n, it shouldn't matter.

Coarsely Integrated Operands Scanning (CIOS)

In the SOS method above, we iteratively divided tt by BB ll times where BB denotes the limb base (2w2^w) and ll denotes the number of the limbs. ll iterations gives the desired result tR1modnt \cdot R^{-1} \mod n since R=BlR = B^l. Now we can discard the lower ll limbs, which is technically equivalent to dividing by RR.

Naively speaking, SOS algorithm follows this sequence:

  1. Multiply two large integers in Montgomery form using a textbook method.

  2. Perform Montgomery reduction to produce cˉ=aˉbˉ\bar{c} = \bar{a} \cdot \bar{b} from aˉbˉR\bar{a} \cdot \bar{b} \cdot R.

which results in its namesake "separated" in SOS.

CIOS, on the other hand, is different from SOS in that it interleaves the multiplication and reduction steps, limb by limb.

Pseudocode

Now we alternate between iterations of the outer loops for multiplication and reduction.

for i = 0 to l-1
    C := 0
    for j = 0 to l-1                         // a * b[i]
        (C, t[j]) := t[j] + a[j]*b[i] + C  
    (t[l+1], t[l]) := t[l] + C.              
    C := 0
    m := t[0]*n'[0] mod W                    // Partial m for t[0] only
    (C, _) := t[0] + m*n[0]
    for j = 1 to s-1                         // Add(t, m*n) >> w 
        (C, t[j-1]) := t[j] + m*n[j] + C
    (C, t[l-1]) := t[l] + C                   
    t[l] := t[l+1] + C                       

Interleaving multiplication and reduction is valid since the value of mm in the ii-th iteration of the outer loop for reduction depends only on the value t[i]t[i], which is completely computed by the ii-th iteration of the outer loop for the multiplication.

Cost analysis

CIOS
SOS
SOS + Ingonyama

Addition

4l2+4l+24l^2 + 4l + 2

4l2+4l+24l^2 + 4l + 2

4l2+4l+24l^2 + 4l + 2

Multiplication

2l2+l2l^2 +l

2l2+l2l^2 +l

2l2+12l^2 +1

Memory space

l+3l+3

2l+22l+ 2

2l+22l+ 2

SOS with Ingonyama's optimization saves l1l-1 multiplications (2l2+12l^2 + 1 vs. 2l2+l2l^2 + l) whereas CIOS saves l1l-1 memory space compared to SOS (2l+22l+2 vs. l+3l+3) owing to not storing the total tt. CIOS can be further optimized using EdMSM's trick.

References

Written by BATZORIG ZORIGOO and Carson Lee from A41

Last updated