LogoLogo
LogoLogo
  • Introduction
    • About Us
    • Notations & Definitions
      • MPC
      • ZK
    • Contribute to this Site!
  • Primitives
    • Multiplication
      • Karatsuba Multiplication
      • Toom-Cook Multiplication
    • NAF (Non-adjacent form)
    • Chinese Remainder Theorem (CRT)
    • Euclidean Algorithm
      • Extended Euclidean Algorithm
      • Binary Euclidean Algorithm
      • Extended Binary Euclidean Algorithm
    • Coding Theory
      • Linear Code
    • Number Theoretic Transform
    • Abstract Algebra
      • Group
        • -Morphisms
        • Batch Inverse
      • Elliptic Curve
        • Weierstrass Curve
          • Coordinate Forms
          • Fast Elliptic Curve Arithmetic and Improved WEIL Pairing Evaluation
        • Edwards Curve
          • Coordinate Forms
          • Twisted Edwards ↔ Short Weierstrass Transformation
        • Batch Inverse for Batch Point Additions
        • Scalar Multiplication
          • Double-and-add
          • GLV Decomposition
        • MSM
          • Pippenger's Algorithm
          • Signed Bucket Index
          • CycloneMSM
          • EdMSM
          • cuZK
        • 2-Chain and 2-Cycle of Elliptic Curves
    • Encryption Scheme
      • ElGamal Encryption
    • Modular Arithmetic
      • Modular Reduction
        • Barrett Reduction
        • Montgomery Reduction
      • Modular Inverse
        • Bernstein-Yang's Inverse
    • Multiset Check
    • Sumcheck
    • Commitment Scheme
      • Fflonk
      • SHPlonk
      • Zeromorph
  • MPC
    • Yao's Garbled Circuits
    • GMW
    • BMR
  • ZK
    • Arithmetization
      • R1CS
      • PLONK
      • AIR
    • Folding
      • LatticeFold
      • Nova
        • Nova over Cycles of Curves
    • Lookup
      • Lasso
      • LogUp-GKR
    • SNARK
      • Groth16
      • HyperPlonk
      • Spartan
        • SPARK
    • STARK
      • Additive NTT
      • Basefold
      • Binius
      • Brakedown
      • CircleSTARK
      • FRI
        • FRI Security Features and Optimizations
      • DEEP FRI
      • STIR
      • WHIR
    • Distributed ZK
      • Ryan's Trick for Distributed Groth16
  • Application
    • zkLogin
    • zkHoldem
    • zkTLS
      • DECO
      • Proxying is enough
  • zkVM
Powered by GitBook
On this page
  • Multiplication Optimization
  • Normal CIOS
  • Optimization
  • Pippenger optimization
  • 2-chain and 2-cycle of elliptic curves
  • Pippenger's Algorithm (Bucket method)
  • Known optimizations
  • Curve forms and coordinate systems
  • References
Export as PDF
  1. Primitives
  2. Abstract Algebra
  3. Elliptic Curve
  4. MSM

EdMSM

ZPrize "Accelerating MSM on Mobile" winner

PreviousCycloneMSMNextcuZK

Last updated 27 days ago

The two main optimizations contributed by this work are listed as follows:

  1. Improvement in modular multiplication which is one method of multi-precision Montgomery multiplication.

  2. Variant of with optimizations tailored for .

This article will first introduce the optimization trick applicable to CIOS when the MSB of the modulus' most significant word is 0 and not all of the remaining bits are set (with 64-bit machine words, the most significant word of the modulus should be at most 0x7FFFFFFFFFFFFFFE).

Multiplication Optimization

For an explanation of the CIOS method of multi-precision Montgomery multiplication, refer to the written by us.

Normal CIOS

Implementation in Python

def cios(a, b, n, n_prime, w, t, l):
    """
    Montgomery Multiplication Step.
    
    Args:
        a (list[int]): \bar{a} = aR
        b (list[int]): \bar{b} = bR
        n (list[int]): modulus
        n_prime (list[int]): n' = -inv(n) mod R
        w (int): bit size of limb
        t (list[int]): REDC(toMont(a) * toMont(b)) = toMont(ab)
        l (int): number of limbs
    
    Returns:
        list[int]: result after multiplication using CIOS
    """
    for i in range(l):
        carry = 0

        # Step 1: t[j] = t[j] + a[j] * b[i] + carry
        for j in range(l):
            total = t[j] + a[j] * b[i] + carry
            # divmod(p, q) divides p by q and returns a tuple (quotient, remainder)
            carry, t[j] = divmod(total, w)
            
        # (1)
        t[l + 1], t[l] = divmod(t[l] + carry, w)

        # Step 2: Compute m
        m = (t[0] * n_prime[0]) % w

        # (carry, _) := t[0] + m * n[0]
        total = t[0] + m * n[0]
        carry, _ = divmod(total, w)

        # (t + m * n) >> w
        for j in range(1, l):
            total = t[j] + m * n[j] + carry
            carry, t[j - 1] = divmod(total, w)

        # (carry, t[l - 1]) := t[l] + carry
        total = t[l] + carry
        carry, t[l - 1] = divmod(total, w)

        # (2) 
        t[l] = t[l + 1] + carry

    return t

Optimization

Proof Sketch

Hence we can remove the addition from line (1).

Hence we can remove the line (2) as well.

Performance improvement

Experiments show that the new algorithm yields 5-10% improvement over the normal CIOS algorithm.

CIOS
CIOS + EdMSM
SOS
SOS + Ingonyama

Addition

Multiplication

Memory space

Pippenger optimization

2-chain and 2-cycle of elliptic curves

The key property of such 2-chains of elliptic curves that we are going to exploit, in particular for those whose inner curve is a BLS curve, is shown as a lemma below:

Lemma 2.

Pippenger's Algorithm (Bucket method)

Known optimizations

Curve forms and coordinate systems

However, it appears that a twisted Edwards (tEd) form is appealing for the bucket method since it has the lowest cost for the mixed addition in extended coordinates. Furthermore, the arithmetic on this form is complete, i.e. it eliminates the need of branching in case of adding or doubling compared to a SW form. We showed in Lemma 2 that all inner BLS curves admit a tEd form.

Dedicated addition can be applied only when whether the operands in the addition is the same (doubling) or not (addition) is precisely known in advance.

  • For Window Reduction step, they uses unified additions (9m) and dedicated doublings (4m + 4s).

  • For Bucket Reduction step, they used unified additions (9m) to keep track of the running sum and unified mixed additions (8m) to keep track of the total sum.

Performance improvement


References

Now we will show that when the highest word of the modulus ppp is at most (B−1)2−1\frac{(B-1)}{2} - 12(B−1)​−1 (which holds when for 64-bit machine words, the significant word of the modulus is at most 0x7FFFFFFFFFFFFFFE), the two highlighted additions (1) and (2) can be safely omitted.

Observe that for the last iterations of two inner loops, the results have the form of (carry_out,tnew):=told+x1⋅x2+carry_in\mathsf{(carry\_out}, t_{new}) := t_{old} + x_1 \cdot x_2 + \mathsf{carry\_in} (carry_out,tnew​):=told​+x1​⋅x2​+carry_in. We can derive the lemma below with a simple calculation.

Lemma 1. If one of x1x_1x1​ and x2x_2x2​x1x_1x1​ is at most (B−1)2−1\frac{(B-1)}{2} - 12(B−1)​−1, then carry_out≤(B−1)2\mathsf{carry\_out} \le \frac{(B-1)}{2}carry_out≤2(B−1)​.

From this lemma, we can derive that if the highest word of nnn is at most (B−1)2−1\frac{(B-1)}{2} - 12(B−1)​−1 then the variables t[l]t[l]t[l] and t[l+1]t[l+1]t[l+1] always store the value 000 at the beginning of the iteration. This is shown by a proof of induction.

The base case (i=0i=0i=0) is trivial (ttt is always initialized to 000).

At iteration iii, suppose that t[l]=t[l+1]=0t[l] = t[l+1] = 0t[l]=t[l+1]=0 and trace the first inner loop execution through the iteration. As aˉ<n\bar{a} < naˉ<n and the highest word of nnn is smaller than (B−1)2\frac{(B-1)}{2}2(B−1)​, Lemma 1 can be used to see that the carry out from the last iteration of the first inner loop is at most (B−1)2\frac{(B-1)}{2}2(B−1)​. Then line (1) sets

t[l]=carry_outt[l+1]=0t[l] = \mathsf{carry\_out}\\ t[l+1] = 0t[l]=carry_outt[l+1]=0

The same reasoning applies for the second inner loop as the highest word of nnn is smaller than (B−1)2\frac{(B-1)}{2}2(B−1)​. As t[l]+carry_outt[l] + \mathsf{carry\_out}t[l]+carry_out has the maximum value of (B−1)2+(B−1)2=(B−1)\frac{(B-1)}{2} + \frac{(B-1)}{2} = (B-1)2(B−1)​+2(B−1)​=(B−1), we can assume that it falls into one limb and the updated carry_out\mathsf{carry\_out}carry_out is 000, and we know that t[l+1]t[l+1]t[l+1] is untouched during the loop, which altogether keeps the invariant true going through the line (2) :

t[l+1]=0t[l]=t[l+1]+0=0t[l+1] = 0 \\ t[l] = t[l+1] + 0 = 0 t[l+1]=0t[l]=t[l+1]+0=0

SOS saves l−1l-1l−1 multiplications whereas CIOS saves 5l+25l + 25l+2 additions and l−1l-1l−1 memory space.

Refer to the for the explanation of 2-chain and 2-cycle of elliptic curves.

All inner BLS curves admit a Short Weierstrass form of : y2=x3+1y^2 = x^3 + 1y2=x3+1

and they always admit a : ay2+x2=1+dx2y2ay^2 + x^2 = 1 + dx^2y^2ay2+x2=1+dx2y2 (with a=23−3a = 2\sqrt{3} - 3a=23​−3 and d=−23−3d = - 2\sqrt{3} - 3d=−23​−3 over Fp\mathbb{F}_pFp​).

Refer to the for the proof. Note that this implies both curves in a 2-cycle admit a twisted Edwards form following the same reasoning.

Refer to the written by us for the full background on Pippenger's.

Let nnn be the size of MSM, λ\lambdaλ be the maximum bit length of scalars, sss be the window size, PiP_iPi​ be the bases points (i=[n]i = [n]i=[n]), WiW_iWi​ be the window sum (i=[λ/s]i = [\lambda / s]i=[λ/s]), and BiB_iBi​ be the bucket sums (i=[2s−1]i = [2^{s-1}]i=[2s−1]).

Using parallelism, we can achieve best improvement when we have λ/s\lambda/sλ/s cores and let each core compute each window sum WiW_iWi​ in Bucket Reduction step. But this increases memory usage as each core needs to keep 2s−12^s - 12s−1 buckets.

Precomputation also may improve the performance when the bases P1, ..., PnP_1, \ ..., \ P_nP1​, ..., Pn​ are known in advance. For example, for each base PiP_iPi​ we can choose kkk as big as the storage allows and precompute kkk points [2s−k]P, ..., [2s−1]P[2^s - k]P, \ ..., \ [2^s-1]P[2s−k]P, ..., [2s−1]P so that can skip first kkk buckets. However large MSM instances already utilize their memory to extreme. Hence the precomputation approach yields negligible improvement in our case.

The number of buckets can be reduced by adopting the . This method decreases the bucket number by half, hence giving the final total cost ≈λs(n+2s−1)\approx \frac{\lambda}{s}(n+2^{s-1})≈sλ​(n+2s−1) group operations.

is an additional optimization that can be applied only when the curves of interest also have the endomorphism property.

To minimize the overall cost of storage but also run time, one can store the bases PiP_iPi​ in affine coordinates (in tuples (xi,yi)(x_i, y_i)(xi​,yi​)).

As Bucket Accumulation is the only step whose cost includes a factor nnn, for large MSM instances the dominating cost is in summing up each bucket. Note that the additions here are mixed addition, which refers to the addition between a affine point (PiP_iPi​) and a projective point (partial sum of BiB_iBi​), which is faster than normal projective-projective addition.

Over the short Weierstrass (SW) form, we commonly choose extended Jacobian coordinates {X,Y,Z2,Z3}\{X, Y, Z^2, Z^3\}{X,Y,Z2,Z3} where x=XZ2x = \frac{X}{Z^2}x=Z2X​ and y=YZ3y = \frac{Y} {Z^3}y=Z3Y​, trading-off memory for run time compared to the usual Jacobian coordinates.

We take the example of BLS12-377 for which a=−1 a = -1a=−1 :

For Bucket Accumulation step, they used unified mixed additions with some precomputations. Instead of storing PiP_iPi​ in affine coordinates they store them in a custom coordinates system (X,Y,T)(X, Y, T)(X,Y,T) where y−x=X,y+x=Yy − x = X, y + x = Y y−x=X,y+x=Y and 2d′⋅x⋅y=T2d' \cdot x \cdot y = T2d′⋅x⋅y=T.

The conversion of all the PiP_iPi​ points given on a SW curve with affine coordinates to points on a tEd curve with the custom coordinates (X,Y,T)(X, Y, T )(X,Y,T) is a one-time computation dominated by a single inverse using the Montgomery batch trick.

The implementation of EdMSM shows that an MSM instance of size 2162^{16}216 on the BLS12-377 curve is 30% faster when PiP_iPi​ points are given on a tEd curve with the custom coordinates compared to the extended-Jacobian-based version which takes points in affine coordinates on a SW curve.

Written by from

4l2+4l+24l^2 + 4l + 24l2+4l+2
4l2−l4l^2 - l4l2−l
4l2+4l+24l^2 + 4l + 24l2+4l+2
4l2+4l+24l^2 + 4l + 24l2+4l+2
2l2+l2l^2 +l2l2+l
2l2+l2l^2 +l2l2+l
2l2+l2l^2 +l2l2+l
2l2+12l^2 +12l2+1
l+3l+3l+3
l+3l+ 3l+3
2l+22l+ 22l+2
2l+22l+ 22l+2
page
Section 3.1 of the paper
page
GLV decomposition
Analyzing and Comparing Montgomery Multiplication Algorithms
EdMSM: Multi-Scalar-Multiplication for SNARKs and Faster Montgomery multiplication
https://www.hyperelliptic.org/EFD/g1p/auto-shortw-xyzz.html
twisted Edwards form
NAF approach
twisted Edwards curves
Pippenger MSM
A41
Carson Lee
CIOS (Coarsely Integrated Operand Scanning)
page