Note: this method is best used when negation for a group is cheap such as for an elliptic curve.
Motivation
When following a bucket-based method like , the amount of memory required for the buckets is demanding. How can we reduce the number of buckets per window from 2s−1 to 2s−1, where s is the number of bits per window, and reduce overhead?
Explanation
Naive
Let's take a look at an example of creating buckets with . Let's say the number of windows ⌈sλ⌉=2 and the bit size per window s=3. We'll look at the following example where k denotes scalars and P refers to points on an elliptic curve for the problem.
Note that any computation kP can be reduced to (k−2s)P+2sP where s is the bit width of k.
This means we can define the following NAF bucket method k′=NAF(k,P) for window Wj as such:
If 0<k<2s−1:
k′=k
Accumulate P into the k-th bucket Bk,j
If 2s−1≤k<2s:
k′=k−2s, then kP=−k′(−P)+2sP
Accumulate −P into the −k′-th bucket B−k′,j
Add P into the next window
Step 2c computes the 2sP part of the kP reduction and is also known as the "carry."
Let's change our naive Pippenger's to follow this method and generate our buckets and windows:
Notice that this method requires the presence of an extra carry bucket B1,3. This extra carry bucket should not exist. We'll take a look later at how to prevent this after this Naive Pippenger's to NAF method is fully shown.
The formation of an extra carry bucket depends on the true bit size slast of the final window in relation to λ, the total bits of k, and s, the bits per windows.
Safe Case: slast<s−1
Example: λ=8, s=5 ⇒ k=k1+25k2
Then slast=3, and 0≤k2<23
Since k2+1<24, carry cannot occur ⇒ It’s safe to assume the final carry is 0.
Carry Case: slast≥s−1
Example: λ=8, s=4 ⇒ k=k1+24k2
Here slast=4, and 0≤k2<24
Since it is possible that k2+1≥23, carry can occur
Thus, if the restriction slast<s−1 is held, we can ensure there will be no new carry bucket created.
2. Naive Fix: Use Larger Bucket Only for Final Window
Use a larger bit-width bucket for the final window to absorb potential carry.
Cons: This is more complex and can be inefficient due to extra bucket space.
3. Use a Larger λ Than the Actual Modulus Bit Size
Let’s say we are working with BN254, where λ=256 and s=16. The actual scalar field is 254 bits.
We express k as:
k=k1+216k2+⋯+216×15k16
Let M be the modulus of BN254. Since the maximum value of k is M−1, we have:
1≤k16+1=⌊216×15M−1⌋+1<214
This ensures that no overflow occurs in the final window.
Yrrid, the Zprize winner in 2022 for the category "Prize 1a: Accelerating MSM Operations on GPU" explains one more optimization they used for their solution to guarantee that no carry occurs in the final window. When the MSB of k is set to 1, they apply the following transformation to reset the MSB to 0, thereby preventing any carry:
kP=(−k)(−P)=(M−k)(−P)
Here, M is the modulus of the scalar field (i.e., curve order), and k∈[0,M).
Let mlast denote the s-bit chunk in the final window. Yrrid's trick is said to ensure that mlast+1 doesn't overflow.
Though this is what they state, it appears as if overflow can still occur as shown in the following example:
For example, in F11 with λ=4, the values where the MSB of k is set to 1 are 8, 9, and 10. This means that -8, -9, and -10 will be transformed into 3, 2, and 1, respectively. With this transformation, the new maximum value of k becomes 23−1=7. If s=2, then:
k=k1+22k2
In this case with the maximum value of k as 7, k2 will be at most 1, meaning it does not satisfy its limit constraints:
1≤k2+1<22−1=2
Code Example
void FillDigits(uint32_t kLambda, uint32_t s, const ScalarField& f, std::vector<int64_t>* digits) {
// The size of `digits` must be equal to ceil(lambda / s)
// (lambda: total number of bits, s: window size in bits)
CHECK_EQ(digits->size(), static_cast<size_t>(std::ceil(kLambda / s)));
// `radix` is 2^s — the maximum range per window
uint64_t radix = uint64_t{1} << s;
// `carry` to be propagated to the next window
uint64_t carry = 0;
size_t bit_offset = 0;
for (size_t i = 0; i < digits->size(); ++i) {
// mᵢ: s-bit value extracted from the current window, including previous carry
uint64_t m_i = f.ExtractBits(bit_offset, s) + carry;
// Compute carry: if mᵢ is greater than or equal to radix / 2, carry = 1;
//. otherwise carry = 0
// This is based on the NAF rounding threshold (radix / 2)
carry = (m_i + (radix >> 1)) >> s;
// Convert mᵢ to its NAF form: mᵢ - carry * 2^s
(*digits)[i] = static_cast<int64_t>(m_i) - static_cast<int64_t>(carry << s);
bit_offset += s;
}
// Apply the final carry to the last digit
// This adjustment preserves the total scalar value
digits->back() += (carry << s);
}
void AddBasesToBuckets(uint32_t s, const std::vector<Point> bases,
const std::vector<std::vector<int64_t>>& digits,
size_t window_index, std::vector<Bucket>& buckets) {
for (size_t i = 0; i < scalar_digits.size(); ++i) {
int64_t scalar = scalar_digits[i][window_index];
if (0 < scalar) {
// If the scalar is positive:
// Place the corresponding base into the (scalar - 1)-th bucket
//
// Why `scalar - 1`?
// Because scalars range from 1 to 2^{s-1} - 1 in NAF,
// but bucket indices are 0-based (0 to 2^{s-1} - 2)
// so we subtract 1 to get the correct index.
buckets[static_cast<uint64_t>(scalar - 1)] += bases[i];
} else if (0 > scalar) {
// If the scalar is negative:
// Place the *negated* base into the (-scalar - 1)-th bucket
//
// Why `-scalar - 1`?
// Because negative scalars range from -1 to -2^{s-1},
// and we want to map -1 → index 0, -2 → index 1, ..., just like for positives.
buckets[static_cast<uint64_t>(-scalar - 1)] -= bases[i];
}
}
}
References
: I'm unsure why they don't just add ki,j=2s−1 into the Bk,j bucket. Perhaps this is a source of optimization when implementing MSM.
’s Trick
Here's a code snippet from implementing NAF for bucket MSM: