EdMSM

ZPrize "Accelerating MSM on Mobile" winner

The two main optimizations contributed by this work are listed as follows:

  1. Improvement in CIOS (Coarsely Integrated Operand Scanning) modular multiplication which is one method of multi-precision Montgomery multiplication.

  2. Variant of Pippenger MSM with optimizations tailored for twisted Edwards curves.

This article will first introduce the optimization trick applicable to CIOS when the MSB of the modulus' most significant word is 0 and not all of the remaining bits are set (with 64-bit machine words, the most significant word of the modulus should be at most 0x7FFFFFFFFFFFFFFE).

Multiplication Optimization

For an explanation of the CIOS method of multi-precision Montgomery multiplication, refer to the page written by us.

Normal CIOS

Implementation in Python

def cios(a, b, n, n_prime, w, t, l):
    """
    Montgomery Multiplication Step.
    
    Args:
        a (list[int]): \bar{a} = aR
        b (list[int]): \bar{b} = bR
        n (list[int]): modulus
        n_prime (list[int]): n' = -inv(n) mod R
        w (int): bit size of limb
        t (list[int]): REDC(toMont(a) * toMont(b)) = toMont(ab)
        l (int): number of limbs
    
    Returns:
        list[int]: result after multiplication using CIOS
    """
    for i in range(l):
        carry = 0

        # Step 1: t[j] = t[j] + a[j] * b[i] + carry
        for j in range(l):
            total = t[j] + a[j] * b[i] + carry
            # divmod(p, q) divides p by q and returns a tuple (quotient, remainder)
            carry, t[j] = divmod(total, w)
            
        # (1)
        t[l + 1], t[l] = divmod(t[l] + carry, w)

        # Step 2: Compute m
        m = (t[0] * n_prime[0]) % w

        # (carry, _) := t[0] + m * n[0]
        total = t[0] + m * n[0]
        carry, _ = divmod(total, w)

        # (t + m * n) >> w
        for j in range(1, l):
            total = t[j] + m * n[j] + carry
            carry, t[j - 1] = divmod(total, w)

        # (carry, t[l - 1]) := t[l] + carry
        total = t[l] + carry
        carry, t[l - 1] = divmod(total, w)

        # (2) 
        t[l] = t[l + 1] + carry

    return t

Optimization

Now we will show that when the highest word of the modulus pp is at most (B−1)2−1\frac{(B-1)}{2} - 1 (which holds when for 64-bit machine words, the significant word of the modulus is at most 0x7FFFFFFFFFFFFFFE), the two highlighted additions (1) and (2) can be safely omitted.

Proof Sketch

Observe that for the last iterations of two inner loops, the results have the form of (carry_out,tnew):=told+x1â‹…x2+carry_in\mathsf{(carry\_out}, t_{new}) := t_{old} + x_1 \cdot x_2 + \mathsf{carry\_in} . We can derive the lemma below with a simple calculation.

Lemma 1. If one of x1x_1 and x2x_2x1x_1 is at most (B−1)2−1\frac{(B-1)}{2} - 1, then carry_out≤(B−1)2\mathsf{carry\_out} \le \frac{(B-1)}{2}.

From this lemma, we can derive that if the highest word of nn is at most (B−1)2−1\frac{(B-1)}{2} - 1 then the variables t[l]t[l] and t[l+1]t[l+1] always store the value 00 at the beginning of the iteration. This is shown by a proof of induction.

The base case (i=0i=0) is trivial (tt is always initialized to 00).

At iteration ii, suppose that t[l]=t[l+1]=0t[l] = t[l+1] = 0 and trace the first inner loop execution through the iteration. As aˉ<n\bar{a} < n and the highest word of nn is smaller than (B−1)2\frac{(B-1)}{2}, Lemma 1 can be used to see that the carry out from the last iteration of the first inner loop is at most (B−1)2\frac{(B-1)}{2}. Then line (1) sets

t[l]=carry_outt[l+1]=0t[l] = \mathsf{carry\_out}\\ t[l+1] = 0

Hence we can remove the addition from line (1).

The same reasoning applies for the second inner loop as the highest word of nn is smaller than (B−1)2\frac{(B-1)}{2}. As t[l]+carry_outt[l] + \mathsf{carry\_out} has the maximum value of (B−1)2+(B−1)2=(B−1)\frac{(B-1)}{2} + \frac{(B-1)}{2} = (B-1), we can assume that it falls into one limb and the updated carry_out\mathsf{carry\_out} is 00, and we know that t[l+1]t[l+1] is untouched during the loop, which altogether keeps the invariant true going through the line (2) :

t[l+1]=0t[l]=t[l+1]+0=0t[l+1] = 0 \\ t[l] = t[l+1] + 0 = 0

Hence we can remove the line (2) as well.

Performance improvement

Experiments show that the new algorithm yields 5-10% improvement over the normal CIOS algorithm.

CIOS
CIOS + EdMSM
SOS
SOS + Ingonyama

Addition

4l2+4l+24l^2 + 4l + 2

4l2−l4l^2 - l

4l2+4l+24l^2 + 4l + 2

4l2+4l+24l^2 + 4l + 2

Multiplication

2l2+l2l^2 +l

2l2+l2l^2 +l

2l2+l2l^2 +l

2l2+12l^2 +1

Memory space

l+3l+3

l+3l+ 3

2l+22l+ 2

2l+22l+ 2

SOS saves l−1l-1 multiplications whereas CIOS saves 5l+25l + 2 additions and l−1l-1 memory space.

Pippenger optimization

2-chain and 2-cycle of elliptic curves

Refer to the page for the explanation of 2-chain and 2-cycle of elliptic curves.

The key property of such 2-chains of elliptic curves that we are going to exploit, in particular for those whose inner curve is a BLS curve, is shown as a lemma below:

Lemma 2.

All inner BLS curves admit a Short Weierstrass form of : y2=x3+1y^2 = x^3 + 1

and they always admit a twisted Edwards form : ay2+x2=1+dx2y2ay^2 + x^2 = 1 + dx^2y^2 (with a=23−3a = 2\sqrt{3} - 3 and d=−23−3d = - 2\sqrt{3} - 3 over Fp\mathbb{F}_p).

Refer to the Section 3.1 of the paper for the proof. Note that this implies both curves in a 2-cycle admit a twisted Edwards form following the same reasoning.

Pippenger's Algorithm (Bucket method)

Refer to the page written by us for the full background on Pippenger's.

Let nn be the size of MSM, λ\lambda be the maximum bit length of scalars, ss be the window size, PiP_i be the bases points (i=[n]i = [n]), WiW_i be the window sum (i=[λ/s]i = [\lambda / s]), and BiB_i be the bucket sums (i=[2s−1]i = [2^{s-1}]).

Known optimizations

Using parallelism, we can achieve best improvement when we have λ/s\lambda/s cores and let each core compute each window sum WiW_i in Bucket Reduction step. But this increases memory usage as each core needs to keep 2s−12^s - 1 buckets.

Precomputation also may improve the performance when the bases P1, ..., PnP_1, \ ..., \ P_n are known in advance. For example, for each base PiP_i we can choose kk as big as the storage allows and precompute kk points [2s−k]P, ..., [2s−1]P[2^s - k]P, \ ..., \ [2^s-1]P so that can skip first kk buckets. However large MSM instances already utilize their memory to extreme. Hence the precomputation approach yields negligible improvement in our case.

The number of buckets can be reduced by adopting the NAF approach. This method decreases the bucket number by half, hence giving the final total cost ≈λs(n+2s−1)\approx \frac{\lambda}{s}(n+2^{s-1}) group operations.

GLV decomposition is an additional optimization that can be applied only when the curves of interest also have the endomorphism property.

Curve forms and coordinate systems

To minimize the overall cost of storage but also run time, one can store the bases PiP_i in affine coordinates (in tuples (xi,yi)(x_i, y_i)).

As Bucket Accumulation is the only step whose cost includes a factor nn, for large MSM instances the dominating cost is in summing up each bucket. Note that the additions here are mixed addition, which refers to the addition between a affine point (PiP_i) and a projective point (partial sum of BiB_i), which is faster than normal projective-projective addition.

Over the short Weierstrass (SW) form, we commonly choose extended Jacobian coordinates {X,Y,Z2,Z3}\{X, Y, Z^2, Z^3\} where x=XZ2x = \frac{X}{Z^2} and y=YZ3y = \frac{Y} {Z^3}, trading-off memory for run time compared to the usual Jacobian coordinates.

However, it appears that a twisted Edwards (tEd) form is appealing for the bucket method since it has the lowest cost for the mixed addition in extended coordinates. Furthermore, the arithmetic on this form is complete, i.e. it eliminates the need of branching in case of adding or doubling compared to a SW form. We showed in Lemma 2 that all inner BLS curves admit a tEd form.

Dedicated addition can be applied only when whether the operands in the addition is the same (doubling) or not (addition) is precisely known in advance.

We take the example of BLS12-377 for which a=−1 a = -1 :

  • For Window Reduction step, they uses unified additions (9m) and dedicated doublings (4m + 4s).

  • For Bucket Reduction step, they used unified additions (9m) to keep track of the running sum and unified mixed additions (8m) to keep track of the total sum.

  • For Bucket Accumulation step, they used unified mixed additions with some precomputations. Instead of storing PiP_i in affine coordinates they store them in a custom coordinates system (X,Y,T)(X, Y, T) where y−x=X,y+x=Yy − x = X, y + x = Y and 2d′⋅xâ‹…y=T2d' \cdot x \cdot y = T.

The conversion of all the PiP_i points given on a SW curve with affine coordinates to points on a tEd curve with the custom coordinates (X,Y,T)(X, Y, T ) is a one-time computation dominated by a single inverse using the Montgomery batch trick.

Performance improvement

The implementation of EdMSM shows that an MSM instance of size 2162^{16} on the BLS12-377 curve is 30% faster when PiP_i points are given on a tEd curve with the custom coordinates compared to the extended-Jacobian-based version which takes points in affine coordinates on a SW curve.


References

Written by Carson Lee from A41

Last updated