CycloneMSM

Presentation: https://youtu.be/djA3mzn7BPg

Introduction

CycloneMSM is an FPGA-accelerated architecture for Multi-Scalar Multiplication (MSM) over the BLS12-377 elliptic curve, developed by Jump Trading and Jump Crypto. Optimized for the case of a fixed base point, this implementation achieves high-performance execution, completing 4 million-point MSMs in under one second and 64 million-point MSMs in approximately six seconds—delivering over 6× speedup compared to other state-of-the-art software. At its core, CycloneMSM leverages the bucket method of the Pippenger algorithm, and applies aggressive pipeline optimization and scheduling techniques to perform curve operations without bottlenecks.

Background

FPGA

To dramatically accelerate MSM computation, CycloneMSM leverages Field Programmable Gate Arrays (FPGAs). Unlike CPUs or GPUs, FPGAs offer a fundamentally different computation model, making them particularly well-suited for exploiting the parallelism and pipeline structure inherent in MSM.

Comparison: CPU vs GPU vs FPGA

Architecture
Characteristics
Suitability for MSM

CPU

General-purpose processor; optimized for sequential tasks

❌ Too slow

GPU

Designed for parallel computation with many execution units

⭕ Reasonable performance

FPGA

Custom logic design and deep pipelining capabilities

✅ Ideal choice

  • CPUs excel at complex control flow and branching, but are inefficient for highly repetitive operations like point additions and multiplications.

  • GPUs offer strong parallelism but struggle with fine-grained scheduling and memory access control.

  • FPGAs, on the other hand, allow direct circuit-level customization, enabling deeply pipelined designs that sustain a continuous flow of operations with minimal control overhead.

Coordinate Systems

MSM involves millions of point additions, so the efficiency of the addition operation plays a crucial role in overall performance. In particular, affine coordinates require an inverse operation per addition, which can significantly increase the total computation cost. (Reference: Explicit Formulas Database)

While Extended Jacobian coordinates for Weierstrass Curves enable inverse-free operations, the formulas used depend on the context (e.g., addition vs doubling). This results in variable computational cost, making it difficult to pipeline efficiently in hardware.

In contrast, Extended Projective coordinates for Twisted Edwards form of elliptic curves support both addition and doubling with a single unified formula, resulting in constant computational cost. This property makes them highly suitable for hardware pipeline design.

Accordingly, CycloneMSM adopts the following coordinate system strategy:

  • Software implementation: Uses Extended Jacobian coordinates for inverse-free operations

  • FPGA hardware implementation: Uses Extended projective coordinates for pipelining optimization

This coordinate system selection is a key architectural decision aimed at maximizing computational efficiency in each environment.

Pippenger Algorithm

The Pippenger algorithm is a classic technique for efficiently performing MSM (Multi-Scalar Multiplication). Given NN pairs of (point, scalar) and a window size cc, each window WW requires the following operations:

  • Bucket Accumulation: N×MixedAddN \times \mathsf{MixedAdd}

  • Bucket Reduction: 2c×Add2^c \times \mathsf{Add}

  • Window Reduction: c×Double+1×Addc \times \mathsf{Double} + 1 \times \mathsf{Add}

Among these, Bucket Accumulation accounts for the majority of the total computation. Therefore, optimizing this stage is critical for improving overall MSM performance.

Protocol Explanation

Scheduler: Optimizing Operation Order to Avoid Pipeline Conflicts

CycloneMSM is designed around an FPGA-based architecture, enabling fully pipelined execution of computations. In particular, the Bucket Accumulation phase, which dominates MSM computation, is highly accelerated by this pipelining structure.

How the Pipeline Works

The Curve Adder, a core computational unit of CycloneMSM, is a pipelined structure designed to efficiently perform point addition on Twisted Edwards curves. This module supports both MixedAdd\mathsf{MixedAdd} and Add\mathsf{Add} operations and is optimized for execution on FPGA hardware.

The Curve Adder takes TT cycles to complete a single addition. In CycloneMSM, T=96T = 96. A new operation is fed into the pipeline every cycle, allowing multiple point additions across different buckets to run in parallel, as illustrated below:

tt

t+1t + 1

t+2t + 2

\dots

t+T1t + {T-1}

t+Tt + T

Bk(t)B_{k^{(t)}}

P(t)P^{(t)}

P(t+T)P^{(t + T)}

Bk(t+1)B_{k^{(t + 1)}}

P(t+1)P^{(t + 1)}

Bk(t+2)B_{k^{(t + 2)}}

P(t+2)P^{(t + 2)}

\vdots

\ddots

Bk(t+T1)B_{k^{(t + T - 1)}}​

P(t+T1)P^{(t + T - 1)}

The addition at each cycle is defined as:

Bk(t)Bk(t)+P(t)B_{k^{(t)}} \leftarrow B_{k^{(t)}} + P^{(t)}

Definition of Terms

  • tt: The pipeline time step (i.e., cycle index). At each tt, a new operation enters the pipeline.

  • P(t)P^{(t)}: The point being added at time tt.

  • k(t)k^{(t)}: The reduced scalar index corresponding to point P(t)P^{(t)}. It determines the bucket.

  • Bk(t)B_{k^{(t)}}​: The bucket that accumulates all points whose reduced scalar index is k(t)k^{(t)}.

However, for this pipeline to operate correctly, no two additions in the pipeline should access the same bucket BkB_k within a TT-cycle window. This is enforced by the following constraint:

k(t1)k(t2) if t1t2<Tk^{(t_1)} \ne k^{(t_2)} \quad \text{ if } |t_1 - t_2| < T

Design Goals of the Scheduler

Based on this, the Scheduler must satisfy two key objectives:

  1. Correctness: Prevent pipeline conflicts by adhering to the constraint above.

  2. Efficiency: Avoid excessive reordering of points; most operations should proceed in near-original order.

Scheduler Implementation Strategies

CycloneMSM explores two strategies to meet these goals:

  1. Delayed Scheduler

  • Idea: When a conflict is detected, defer processing of that point by placing it in a FIFO queue and retry it later.

  • Estimated conflict count: Approximately NT2c1\frac{NT}{2^{c-1}}.

  • Example: With N=226N = 2^{26}, c=16c = 16, T=100T = 100, about 204,800 conflicts arise (~0.3% of points).

  • In practice: Most points are processed successfully within 2–3 passes.

  • Advantages: Simple architecture and easy to implement in hardware

  • Drawbacks:

    • Requires maintaining a queue of size proportional to the number of conflicts, which can be memory-intensive

    • In worst-case scenarios (e.g., non-uniform scalars), frequent collisions could degrade performance

CycloneMSM adopts this approach in its actual FPGA implementation.

  1. Greedy Scheduler

  • Idea: Attempt to process delayed points at the earliest valid time (e.g., t+T+1t + T + 1), even if conflicts occurred earlier.

  • Advantages: Queue is drained quickly, leading to smaller queue size

  • Drawbacks:

    • Requires tracking last access times for all buckets, which complicates state management

    • Increased memory updates and lookups each cycle place heavy demands on hardware resources

Due to memory access overhead, this strategy was not adopted in CycloneMSM’s hardware environment.

Software-Side Affine Optimization

When implementing MSM in software, non-Affine coordinate systems such as Extended Jacobian are typically used. The main reason is that point addition in Affine coordinates requires field inversion, which is computationally expensive. For instance, consider a typical Bucket Accumulation where multiple points are added sequentially:

PP1+P2++Pn1+PnP \leftarrow P_1 + P_2 + \dots + P_{n-1} + P_n

If Affine coordinates were used, each addition would involve an inverse operation, leading to significant performance overhead.

BATZORIG ZORIGOO mentioned that batch inversion can be applied using a tree-structured addition approach, which reduces the number of operations to log2n×Inv\log_2 n \times \mathsf{Inv}

Scheduler Enables the Use of Affine Coordinates

However, thanks to the Scheduler introduced in CycloneMSM, points can now be grouped into batches for addition without conflicts between buckets:

Bk(t)Bk(t)+P(t)Bk(t+1)Bk(t+1)+P(t+1)Bk(t+T1)Bk(t+T1)+P(t+T1)B_{k^{(t)}} \leftarrow B_{k^{(t)}} + P^{(t)} \\ B_{k^{(t + 1)}} \leftarrow B_{k^{(t + 1)}} + P^{(t + 1)} \\ \vdots \\ B_{k^{(t + T - 1)}} \leftarrow B_{k^{(t + T - 1)}} + P^{(t + T - 1)} \\

This structure allows for batch addition of TT Affine points, making it possible to apply batch inversion and significantly reduce the overall computation cost.

Comparison of Operation Costs

  • Affine addition (1 time): 3×Mul+1×Inv3 \times \mathsf{Mul} + 1 \times \mathsf{Inv}

  • TT Affine additions (naive): 3T×Mul+T×Inv3T \times \mathsf{Mul} + T \times \mathsf{Inv}

  • TT Affine additions (with batch inversion): 6T×Mul+1×Inv6T \times \mathsf{Mul} + 1 \times \mathsf{Inv}

  • TT Extended Joaobian addtions: 10T×Mul10T \times \mathsf{Mul}

Thus, the number of inversion operations is reduced from TT to 1, while the rest is replaced by multiplications, leading to a dramatic boost in efficiency. If 4T×Mul>1×Inv4T \times \mathsf{Mul} > 1 \times \mathsf{Inv}, then TT Affine addition (with batch inversion) is more efficient.

Real-World Implementations

This Affine optimization has been adopted in actual open-source libraries:

Thanks to this optimization, a 10–20% performance improvement over Extended Jacobian-based implementations has been reported, significantly enhancing MSM computation efficiency in practice.

FPGA Design

Field Arithmetic

Arithmetic in the finite field Fq\mathbb{F}_q is performed using 377-bit operations in Montgomery representation, with a Montgomery parameter of R=2384R = 2^{384}. Field multiplication is implemented using three 384-bit integer multiplications based on the Karatsuba algorithm.

Additionally, since R4qR \ge 4q, the system can safely operate on inputs a,b[0,2q)a, b \in [0, 2q) for multiplication. This characteristic allows certain modular reductions to be skipped when implementing the Curve Adder module:

  1. When a,b[0,q)a, b \in [0, q), the sum a+b[0,2q)a + b \in [0, 2q)modular reduction is unnecessary

  2. To handle subtraction safely, compute q+ab[0,2q)q + a - b \in [0, 2q)modular reduction is also unnecessary

By avoiding modular reductions in these cases, resource usage and processing delays in the pipeline are further minimized.

Constant Multiplier

In Montgomery representation, two constant multiplications are required as follows:

a×(R2modq)×(R1modq)a \times (R^2 \mod q) \times (R^{-1} \mod q)

Instead of using the standard double-and-add approach for multiplying constants, CycloneMSM employs NAF (Non-Adjacent Form). This reduces the Hamming weight of the constant from 12\frac{1}{2} to approximately 13\frac{1}{3}, resulting in several advantages:

  • Fewer additions in the double-and-add process → improved speed

  • Shallower and faster adder trees in the FPGA design

    • Specifically, adder tree depth is reduced from log2W(b)\log_2 W(b) to log3W(b)\log_3 W(b), where W(b)W(b) denotes the Hamming weight of the constant bb.

This optimization plays a critical role in improving both performance and resource efficiency in the CycloneMSM hardware pipeline.

Curve Arithmetic

Pipelined Structure

  • Clock speed: 250 MHz → 1 cycle = 4 ns

  • Cycle latency:

    • MixedAdd\mathsf{MixedAdd} outputs are produced after 96 cycles (= 384 ns)

    • mul\mathsf{mul} outputs are produced after 39 cycles (= 156 ns)

MixedAdd Operation (Refer to the left side of Figure 4)

In Twisted Edwards curves, MixedAdd\mathsf{MixedAdd} uses the following two inputs:

  • P1=(X1:Y1:Z1:T1)P_1 = (X_1 : Y_1 : Z_1 : T_1) (extended projective)

  • P2=(x2,y2,u2=kx2y2)P_2 = (x_2, y_2, u_2 = kx_2y_2) (extended affine)

This operation involves 7 field multiplications, and it is optimized using formulas from [HWCD08, Sec. 4.2]. Due to its pipelined nature, the adder can accept new points every cycle.

Full Add Operation (Refer to the right side of Figure 4)

A general Add\mathsf{Add} operation requires 9 multiplications. CycloneMSM implements this as a 2-cycle pipelined process. The processing flow is as follows:

  1. Stage 0

    Inputs: Z1Z_1, Z2Z_2 → passed to the multiplier

  2. Stage 1

    Inputs: T2T_2, curve constant kk → passed to the multiplier

  3. Stage 40 (After the stages required for a single Montgomery Multiplier)

    Invoke MixedAdd with outputs below:

MixedAdd(P1=(X1:Y1:r0:T1), P2=(X2,Y2,r1))\mathsf{MixedAdd}\left(P_1 = (X_1 : Y_1 : r_0 : T_1),\ P_2 = (X_2, Y_2, r_1)\right)

where r0=Z1Z2r_0 = Z_1 \cdot Z_2 and r1=kT2r_1 = k \cdot T_2

This design converts a full Add into a MixedAdd call, effectively reusing the existing pipeline while precomputing necessary multiplications to maintain efficiency.

MSM Acceleration

They use SkS_k for bucket instead of BkB_k.

The MSM operation in CycloneMSM is executed through coordinated interaction between the FPGA and the host, following the flow of functions outlined below:

  • MSM_INIT():

    • The user provides NN points in Weierstrass affine coordinates

    • The host converts these points to the Twisted Edwards coordinate system

      • Typically affine → extended affine: (x,y)(x,y,u=kxy)(x, y) \rightarrow (x, y, u = kxy)

    • The converted points are transferred to the FPGA and stored in DDR memory

    • Since the conversion involves inversion, batch inversion is used to optimize performance

  • MSM():

    • The host splits each scalar into 16-bit windows (c=16c = 16) using PREPROCESS()

      • A total of W=16W = 16 windows are created

    • For each window, the following functions are executed:

      1. ACCUMULATE(j): performs bucket accumulation

      2. AGGREGATE(j): performs bucket reduction

    • The final result of all windows is computed by the host via Window Reduction

  • ACCUMULATE(j):

    • The host invokes FPGA.ACCUMULATE(j)

    • Scalars are reduced to 16-bit reduced scalars and streamed to the FPGA

      • Data is batched and transmitted in 64-bit chunks (8 scalars per batch)

  • FPGA.ACCUMULATE(j):

    • The FPGA invokes SCHED() to read points from DDR memory

      • If a conflict is detected, the scalar and point are stored back into DDR via FIFO

      • Otherwise, the addition is performed and the result is stored in the corresponding bucket in SRAM

  • AGGREGATE():

    • The host invokes FPGA.AGGREGATE()

  • FPGA.AGGREGATE():

    • The FPGA accumulates 2c12^{c-1} buckets in a sequential manner using the following formula:

      R=BK+(BK+BK1)++(B1++BK)R = B_K + (B_K + B_{K-1}) + \dots + (B_1 + \dots + B_K)

    • This operation is performed using a 2-cycle Add\mathsf{Add} pipeline implemented on the FPGA

Conclusion

CycloneMSM leverages the architectural strengths of FPGAs and the inherent parallelism of MSM operations to deliver a high-performance system that significantly outpaces traditional software-based MSM implementations. The system's achievements are underpinned by the following design philosophies and optimization strategies:

  • Optimized Coordinate System Selection:

    On hardware, the Twisted Edwards coordinate system was chosen for its uniform operation count, which maximizes pipelining efficiency. On the software side, performance was improved through affine coordinates combined with batch inversion.

  • Pipelined Curve Adder:

    A 96-stage pipeline running at 250MHz accepts new inputs every 4ns, enabling the core operation of MSM—MixedAdd\mathsf{MixedAdd}—to be processed in every clock cycle.

  • Conflict-Free Scheduling:

    By applying a Delayed Scheduler, most points were processed without collisions. This also enabled batch inversion in the affine coordinate system, contributing to performance gains on the software side.

  • Modular and Parallel MSM Flow:

    The MSM process—initialization, bucket accumulation, reduction, and window reduction—was modularized and optimized, with roles distributed efficiently between the FPGA and host. This maximized pipeline utilization across the system.

  • Superior Performance and Energy Efficiency:

    CycloneMSM completes MSM on 64M points in under 6 seconds, achieving 4–6× speedups compared to software, while consuming only one-tenth the power.

In summary, CycloneMSM is more than just a hardware implementation—it is a deeply integrated system combining algorithmic and architectural design, demonstrating the real potential of hardware-accelerated MSM for more complex ZK and large-scale proof systems in the future.

References

Written by ryan Kim from A41

Last updated