EdMSM
ZPrize "Accelerating MSM on Mobile" winner
Last updated
ZPrize "Accelerating MSM on Mobile" winner
Last updated
The two main optimizations contributed by this work are listed as follows:
Improvement in modular multiplication which is one method of multi-precision Montgomery multiplication.
Variant of with optimizations tailored for .
This article will first introduce the optimization trick applicable to CIOS when the MSB of the modulus' most significant word is 0 and not all of the remaining bits are set (with 64-bit machine words, the most significant word of the modulus should be at most 0x7FFFFFFFFFFFFFFE).
For an explanation of the CIOS method of multi-precision Montgomery multiplication, refer to the written by us.
Hence we can remove the addition from line (1).
Hence we can remove the line (2) as well.
Experiments show that the new algorithm yields 5-10% improvement over the normal CIOS algorithm.
Addition
Multiplication
Memory space
The key property of such 2-chains of elliptic curves that we are going to exploit, in particular for those whose inner curve is a BLS curve, is shown as a lemma below:
However, it appears that a twisted Edwards (tEd) form is appealing for the bucket method since it has the lowest cost for the mixed addition in extended coordinates. Furthermore, the arithmetic on this form is complete, i.e. it eliminates the need of branching in case of adding or doubling compared to a SW form. We showed in Lemma 2 that all inner BLS curves admit a tEd form.
Dedicated addition can be applied only when whether the operands in the addition is the same (doubling) or not (addition) is precisely known in advance.
For Window Reduction step, they uses unified additions (9m) and dedicated doublings (4m + 4s).
For Bucket Reduction step, they used unified additions (9m) to keep track of the running sum and unified mixed additions (8m) to keep track of the total sum.
Now we will show that when the highest word of the modulus is at most (which holds when for 64-bit machine words, the significant word of the modulus is at most 0x7FFFFFFFFFFFFFFE), the two highlighted additions (1) and (2) can be safely omitted.
Observe that for the last iterations of two inner loops, the results have the form of . We can derive the lemma below with a simple calculation.
Lemma 1. If one of and is at most , then .
From this lemma, we can derive that if the highest word of is at most then the variables and always store the value at the beginning of the iteration. This is shown by a proof of induction.
The base case () is trivial ( is always initialized to ).
At iteration , suppose that and trace the first inner loop execution through the iteration. As and the highest word of is smaller than , Lemma 1 can be used to see that the carry out from the last iteration of the first inner loop is at most . Then line (1) sets
The same reasoning applies for the second inner loop as the highest word of is smaller than . As has the maximum value of , we can assume that it falls into one limb and the updated is , and we know that is untouched during the loop, which altogether keeps the invariant true going through the line (2) :
SOS saves multiplications whereas CIOS saves additions and memory space.
Refer to the for the explanation of 2-chain and 2-cycle of elliptic curves.
All inner BLS curves admit a Short Weierstrass form of :
and they always admit a : (with and over ).
Refer to the for the proof. Note that this implies both curves in a 2-cycle admit a twisted Edwards form following the same reasoning.
Refer to the written by us for the full background on Pippenger's.
Let be the size of MSM, be the maximum bit length of scalars, be the window size, be the bases points (), be the window sum (), and be the bucket sums ().
Using parallelism, we can achieve best improvement when we have cores and let each core compute each window sum in Bucket Reduction step. But this increases memory usage as each core needs to keep buckets.
Precomputation also may improve the performance when the bases are known in advance. For example, for each base we can choose as big as the storage allows and precompute points so that can skip first buckets. However large MSM instances already utilize their memory to extreme. Hence the precomputation approach yields negligible improvement in our case.
The number of buckets can be reduced by adopting the . This method decreases the bucket number by half, hence giving the final total cost group operations.
is an additional optimization that can be applied only when the curves of interest also have the endomorphism property.
To minimize the overall cost of storage but also run time, one can store the bases in affine coordinates (in tuples ).
As Bucket Accumulation is the only step whose cost includes a factor , for large MSM instances the dominating cost is in summing up each bucket. Note that the additions here are mixed addition, which refers to the addition between a affine point () and a projective point (partial sum of ), which is faster than normal projective-projective addition.
Over the short Weierstrass (SW) form, we commonly choose extended Jacobian coordinates where and , trading-off memory for run time compared to the usual Jacobian coordinates.
We take the example of BLS12-377 for which :
For Bucket Accumulation step, they used unified mixed additions with some precomputations. Instead of storing in affine coordinates they store them in a custom coordinates system where and .
The conversion of all the points given on a SW curve with affine coordinates to points on a tEd curve with the custom coordinates is a one-time computation dominated by a single inverse using the Montgomery batch trick.
The implementation of EdMSM shows that an MSM instance of size on the BLS12-377 curve is 30% faster when points are given on a tEd curve with the custom coordinates compared to the extended-Jacobian-based version which takes points in affine coordinates on a SW curve.
Written by from