CycloneMSM
Presentation: https://youtu.be/djA3mzn7BPg
Last updated
Presentation: https://youtu.be/djA3mzn7BPg
Last updated
CycloneMSM is an FPGA-accelerated architecture for Multi-Scalar Multiplication (MSM) over the BLS12-377 elliptic curve, developed by and . Optimized for the case of a fixed base point, this implementation achieves high-performance execution, completing 4 million-point MSMs in under one second and 64 million-point MSMs in approximately six seconds—delivering over 6× speedup compared to other state-of-the-art software. At its core, CycloneMSM leverages the bucket method of the Pippenger algorithm, and applies aggressive pipeline optimization and scheduling techniques to perform curve operations without bottlenecks.
To dramatically accelerate MSM computation, CycloneMSM leverages Field Programmable Gate Arrays (FPGAs). Unlike CPUs or GPUs, FPGAs offer a fundamentally different computation model, making them particularly well-suited for exploiting the parallelism and pipeline structure inherent in MSM.
CPU
General-purpose processor; optimized for sequential tasks
❌ Too slow
GPU
Designed for parallel computation with many execution units
⭕ Reasonable performance
FPGA
Custom logic design and deep pipelining capabilities
✅ Ideal choice
CPUs excel at complex control flow and branching, but are inefficient for highly repetitive operations like point additions and multiplications.
GPUs offer strong parallelism but struggle with fine-grained scheduling and memory access control.
FPGAs, on the other hand, allow direct circuit-level customization, enabling deeply pipelined designs that sustain a continuous flow of operations with minimal control overhead.
MSM involves millions of point additions, so the efficiency of the addition operation plays a crucial role in overall performance. In particular, require an inverse operation per addition, which can significantly increase the total computation cost. (Reference: )
Accordingly, CycloneMSM adopts the following coordinate system strategy:
Software implementation: Uses Extended Jacobian coordinates for inverse-free operations
FPGA hardware implementation: Uses Extended projective coordinates for pipelining optimization
This coordinate system selection is a key architectural decision aimed at maximizing computational efficiency in each environment.
Among these, Bucket Accumulation accounts for the majority of the total computation. Therefore, optimizing this stage is critical for improving overall MSM performance.
CycloneMSM is designed around an FPGA-based architecture, enabling fully pipelined execution of computations. In particular, the Bucket Accumulation phase, which dominates MSM computation, is highly accelerated by this pipelining structure.
The addition at each cycle is defined as:
Definition of Terms
Based on this, the Scheduler must satisfy two key objectives:
Correctness: Prevent pipeline conflicts by adhering to the constraint above.
Efficiency: Avoid excessive reordering of points; most operations should proceed in near-original order.
CycloneMSM explores two strategies to meet these goals:
Delayed Scheduler
Idea: When a conflict is detected, defer processing of that point by placing it in a FIFO queue and retry it later.
In practice: Most points are processed successfully within 2–3 passes.
Advantages: Simple architecture and easy to implement in hardware
Drawbacks:
Requires maintaining a queue of size proportional to the number of conflicts, which can be memory-intensive
In worst-case scenarios (e.g., non-uniform scalars), frequent collisions could degrade performance
CycloneMSM adopts this approach in its actual FPGA implementation.
Greedy Scheduler
Advantages: Queue is drained quickly, leading to smaller queue size
Drawbacks:
Requires tracking last access times for all buckets, which complicates state management
Increased memory updates and lookups each cycle place heavy demands on hardware resources
Due to memory access overhead, this strategy was not adopted in CycloneMSM’s hardware environment.
When implementing MSM in software, non-Affine coordinate systems such as Extended Jacobian are typically used. The main reason is that point addition in Affine coordinates requires field inversion, which is computationally expensive. For instance, consider a typical Bucket Accumulation where multiple points are added sequentially:
If Affine coordinates were used, each addition would involve an inverse operation, leading to significant performance overhead.
However, thanks to the Scheduler introduced in CycloneMSM, points can now be grouped into batches for addition without conflicts between buckets:
This Affine optimization has been adopted in actual open-source libraries:
Thanks to this optimization, a 10–20% performance improvement over Extended Jacobian-based implementations has been reported, significantly enhancing MSM computation efficiency in practice.
By avoiding modular reductions in these cases, resource usage and processing delays in the pipeline are further minimized.
In Montgomery representation, two constant multiplications are required as follows:
Fewer additions in the double-and-add process → improved speed
Shallower and faster adder trees in the FPGA design
This optimization plays a critical role in improving both performance and resource efficiency in the CycloneMSM hardware pipeline.
Pipelined Structure
Clock speed: 250 MHz → 1 cycle = 4 ns
Cycle latency:
MixedAdd Operation (Refer to the left side of Figure 4)
Full Add Operation (Refer to the right side of Figure 4)
Stage 0
Stage 1
Stage 40 (After the stages required for a single Montgomery Multiplier)
Invoke MixedAdd with outputs below:
This design converts a full Add into a MixedAdd call, effectively reusing the existing pipeline while precomputing necessary multiplications to maintain efficiency.
The MSM operation in CycloneMSM is executed through coordinated interaction between the FPGA and the host, following the flow of functions outlined below:
MSM_INIT()
:
The converted points are transferred to the FPGA and stored in DDR memory
Since the conversion involves inversion, batch inversion is used to optimize performance
MSM()
:
For each window, the following functions are executed:
ACCUMULATE(j)
: performs bucket accumulation
AGGREGATE(j)
: performs bucket reduction
The final result of all windows is computed by the host via Window Reduction
ACCUMULATE(j)
:
The host invokes FPGA.ACCUMULATE(j)
Scalars are reduced to 16-bit reduced scalars and streamed to the FPGA
Data is batched and transmitted in 64-bit chunks (8 scalars per batch)
FPGA.ACCUMULATE(j)
:
The FPGA invokes SCHED()
to read points from DDR memory
If a conflict is detected, the scalar and point are stored back into DDR via FIFO
Otherwise, the addition is performed and the result is stored in the corresponding bucket in SRAM
AGGREGATE()
:
The host invokes FPGA.AGGREGATE()
FPGA.AGGREGATE()
:
CycloneMSM leverages the architectural strengths of FPGAs and the inherent parallelism of MSM operations to deliver a high-performance system that significantly outpaces traditional software-based MSM implementations. The system's achievements are underpinned by the following design philosophies and optimization strategies:
Optimized Coordinate System Selection:
On hardware, the Twisted Edwards coordinate system was chosen for its uniform operation count, which maximizes pipelining efficiency. On the software side, performance was improved through affine coordinates combined with batch inversion.
Pipelined Curve Adder:
Conflict-Free Scheduling:
By applying a Delayed Scheduler, most points were processed without collisions. This also enabled batch inversion in the affine coordinate system, contributing to performance gains on the software side.
Modular and Parallel MSM Flow:
The MSM process—initialization, bucket accumulation, reduction, and window reduction—was modularized and optimized, with roles distributed efficiently between the FPGA and host. This maximized pipeline utilization across the system.
Superior Performance and Energy Efficiency:
CycloneMSM completes MSM on 64M points in under 6 seconds, achieving 4–6× speedups compared to software, while consuming only one-tenth the power.
In summary, CycloneMSM is more than just a hardware implementation—it is a deeply integrated system combining algorithmic and architectural design, demonstrating the real potential of hardware-accelerated MSM for more complex ZK and large-scale proof systems in the future.
While for enable inverse-free operations, the formulas used depend on the context (e.g., addition vs doubling). This results in variable computational cost, making it difficult to pipeline efficiently in hardware.
In contrast, for form of elliptic curves support both addition and doubling with a single unified formula, resulting in constant computational cost. This property makes them highly suitable for hardware pipeline design.
The is a classic technique for efficiently performing MSM (Multi-Scalar Multiplication). Given pairs of (point, scalar) and a window size , each window requires the following operations:
Bucket Accumulation:
Bucket Reduction:
Window Reduction:
The Curve Adder, a core computational unit of CycloneMSM, is a pipelined structure designed to efficiently perform point addition on Twisted Edwards curves. This module supports both and operations and is optimized for execution on FPGA hardware.
The Curve Adder takes cycles to complete a single addition. In CycloneMSM, . A new operation is fed into the pipeline every cycle, allowing multiple point additions across different buckets to run in parallel, as illustrated below:
: The pipeline time step (i.e., cycle index). At each , a new operation enters the pipeline.
: The point being added at time .
: The reduced scalar index corresponding to point . It determines the bucket.
: The bucket that accumulates all points whose reduced scalar index is .
However, for this pipeline to operate correctly, no two additions in the pipeline should access the same bucket within a -cycle window. This is enforced by the following constraint:
Estimated conflict count: Approximately .
Example: With , , , about 204,800 conflicts arise (~0.3% of points).
Idea: Attempt to process delayed points at the earliest valid time (e.g., ), even if conflicts occurred earlier.
mentioned that batch inversion can be applied using a tree-structured addition approach, which reduces the number of operations to
This structure allows for batch addition of Affine points, making it possible to apply batch inversion and significantly reduce the overall computation cost.
Affine addition (1 time):
Affine additions (naive):
Affine additions (with batch inversion):
Extended Joaobian addtions:
Thus, the number of inversion operations is reduced from to 1, while the rest is replaced by multiplications, leading to a dramatic boost in efficiency. If , then Affine addition (with batch inversion) is more efficient.
Arithmetic in the finite field is performed using 377-bit operations in , with a Montgomery parameter of . Field multiplication is implemented using three 384-bit integer multiplications based on the
Additionally, since , the system can safely operate on inputs for multiplication. This characteristic allows certain modular reductions to be skipped when implementing the Curve Adder module:
When , the sum → modular reduction is unnecessary
To handle subtraction safely, compute → modular reduction is also unnecessary
Instead of using the standard double-and-add approach for multiplying constants, CycloneMSM employs NAF (Non-Adjacent Form). This reduces the Hamming weight of the constant from to approximately , resulting in several advantages:
Specifically, adder tree depth is reduced from to , where denotes the Hamming weight of the constant .
outputs are produced after 96 cycles (= 384 ns)
outputs are produced after 39 cycles (= 156 ns)
In Twisted Edwards curves, uses the following two inputs:
(extended projective)
(extended affine)
This operation involves 7 field multiplications, and it is optimized using formulas from []. Due to its pipelined nature, the adder can accept new points every cycle.
A general operation requires 9 multiplications. CycloneMSM implements this as a 2-cycle pipelined process. The processing flow is as follows:
Inputs: , → passed to the multiplier
Inputs: , curve constant → passed to the multiplier
where and
The user provides points in Weierstrass affine coordinates
Typically affine → extended affine:
The host splits each scalar into 16-bit windows () using PREPROCESS()
A total of windows are created
The FPGA accumulates buckets in a sequential manner using the following formula:
This operation is performed using a 2-cycle pipeline implemented on the FPGA
A 96-stage pipeline running at 250MHz accepts new inputs every 4ns, enabling the core operation of MSM——to be processed in every clock cycle.
Written by from