Montgomery Reduction
Last updated
Last updated
Montgomery reduction provides an efficient way to perform modular arithmetic, particularly modular multiplication . This is crucial when the modulus is large and many operations are needed, as in cryptography. Montgomery reduction excels when is odd and aims to replace slow divisions by n.
The method shifts calculations into a special "Montgomery domain" defined by an auxiliary modulus . Arithmetic happens within this domain, where the reduction step is cleverly transformed into a division by . Since is chosen as a power of 2, this division is just a fast bit shift and the remainder is equivalent to its last few bits. Results are converted back to the standard representation when the sequence of operations is complete.
We choose an auxiliary modulus (where is the base, usually 2) such that . Critically, and must be coprime: . In this case, this just implies must be odd.
Definition 1. The Montgomery representation of an integer (where ) is .
Since and are coprime, we can define the following conversion functions:
: Converts back to standard form: . This operation is also called Montgomery Reduction.
: Converts into Montgomery form: . This calculation itself can be optimized with Montgomery Reduction.
Lemma 1. Remainder of the division by can be efficiently calculated as where takes the least significant bits of .
Lemma 2. Quotient can be efficiently computed using bit shift: .
Given two numbers in Montgomery form, and . We would like to calculate . This can be calculated using since:
To do the Montgomery Multiplication efficiently, we need to be able to calculate:
And we have:
REDC(T)
:And after the next partial reduction, we will have:
Then, similar to the previous calculation, the upper bound on the total addition will become:
Naively speaking, SOS algorithm follows this sequence:
which results in its namesake "separated" in SOS.
CIOS, on the other hand, is different from SOS in that it interleaves the multiplication and reduction steps, limb by limb.
Now we alternate between iterations of the outer loops for multiplication and reduction.
Addition
Multiplication
Memory space
To get the actual , we will have to additionally convert back from montgomery form using .
Definition 2. Montgomery Reduction is an efficient reduction method to calculate for a given . This operation can also be denoted as:
The key idea to perform Montgomery Reduction efficiently is to find an integer such that is exactly divisible by . Then the desired result can be computed as:
What will happen if we precompute to compute this?
We need to find such that
To calculate efficiently, we use a precomputed value . This exists because . Now, we have and can be defined as:
so
so
Hence, so we just need to subtract once if to take the modulus.
Given , and precomputed .
Calculate . (Effectively: multiply the lowest bits of by , take the lowest bits of the product).
Compute . (Involves one multiplication , one addition, and one right shift by ).
Return if , otherwise return .
With some precomputation and using Montgomery Reduction , we can optimize the , , and operations as following:
Precompute and :
In a multi-precision setting, suppose the modulo has limbs. Let , where is limb bit width and . To calculate , the multi-precision Montgomery Reduction operates iteratively on:
where represents limb values. At each iteration, the limbs of are reduced one by one which is done by partial reduction. The first partial reduction looks like so:
and so on. After iterations, we have with limbs. The values of limbs change at each round because is also a multi-precision integer.
At round , we need to calculate . This can be done efficiently, similar to the single-precision case by adding some multiple of to make it divisible by :
Precompute .
Calculate . Then we have
Calculate .
Since and , we have
which means after step 3, we don't have to worry about subtracting .
While trying to reduce, we add every round which can cause potential overflow in the largest limb. Let's calculate how much we added in total. If we omit dividing by , at each round, we can see that at each round , we add . This can be confusing but if you think of round as adding some value to convert the -th limb to 0, it can be more easier to see.
This gives that total sum can be which needs limbs. For , we already need limbs and after the first step, the least significant limb will be 0's which can be discarded. In the first round, we have a maximum value of so it fits within the limbs. So, if we discard the least significant limb after first round, we don't actually need extra limb space to handle the overflowing.
A complete round of partial reduction can be summarized by the following steps:
Compute (which costs limb multiplication).
Compute (which costs limb multiplication).
Set .
Now, let's calculate the upper bound on the final result of the reductions. First of all, we saw above that, the total added value has an upper bound of , and using , we have:
Therefore, the reduced value may require one additional subtraction to bring the value within the range . Since we need limb multiplications for steps 1 and 2, over rounds, we will need a total of limb multiplications.
We have seen that we need to do multiplications per round for multi-precision Montgomery Reduction. However, that we can reduce it to . Let's see what happens if we just precompute and multiply the free coefficients with . Then, the partial reduction round becomes:
Compute which is limb multiplication, resulting in a limb integer.
Set .
So it hasn't changed. Which means the upper bound on the value after reductions will also be the same:
This gives us the total cost of limb multiplications. Ingonyama's article mentions that this reduction cannot be applied in the last step because will result in a limb integer while in the traditional case, we have limbs. Indeed it is true since we add it after the bit shift unlike the original version but if the result is less than , it shouldn't matter.
In the SOS method above, we iteratively divided by times where denotes the limb base () and denotes the number of the limbs. iterations gives the desired result since . Now we can discard the lower limbs, which is technically equivalent to dividing by .
Multiply two large integers in Montgomery form using a .
Perform Montgomery reduction to produce from .
Interleaving multiplication and reduction is valid since the value of in the -th iteration of the outer loop for reduction depends only on the value , which is completely computed by the -th iteration of the outer loop for the multiplication.
SOS with Ingonyama's optimization saves multiplications ( vs. ) whereas CIOS saves memory space compared to SOS ( vs. ) owing to not storing the total . CIOS can be further optimized using .
Written by and from