LogoLogo
LogoLogo
  • Introduction
    • About Us
    • Notations & Definitions
      • MPC
      • ZK
    • Contribute to this Site!
  • Primitives
    • Multiplication
      • Karatsuba Multiplication
      • Toom-Cook Multiplication
    • NAF (Non-adjacent form)
    • Chinese Remainder Theorem (CRT)
    • Euclidean Algorithm
      • Extended Euclidean Algorithm
      • Binary Euclidean Algorithm
      • Extended Binary Euclidean Algorithm
    • Coding Theory
      • Linear Code
    • Number Theoretic Transform
    • Abstract Algebra
      • Group
        • -Morphisms
        • Batch Inverse
      • Elliptic Curve
        • Weierstrass Curve
          • Coordinate Forms
          • Fast Elliptic Curve Arithmetic and Improved WEIL Pairing Evaluation
        • Edwards Curve
          • Coordinate Forms
          • Twisted Edwards ↔ Short Weierstrass Transformation
        • Batch Inverse for Batch Point Additions
        • Scalar Multiplication
          • Double-and-add
          • GLV Decomposition
        • MSM
          • Pippenger's Algorithm
          • Signed Bucket Index
          • CycloneMSM
          • EdMSM
          • cuZK
        • 2-Chain and 2-Cycle of Elliptic Curves
    • Encryption Scheme
      • ElGamal Encryption
    • Modular Arithmetic
      • Modular Reduction
        • Barrett Reduction
        • Montgomery Reduction
      • Modular Inverse
        • Bernstein-Yang's Inverse
    • Multiset Check
    • Sumcheck
    • Commitment Scheme
      • Fflonk
      • SHPlonk
      • Zeromorph
  • MPC
    • Yao's Garbled Circuits
    • GMW
    • BMR
  • ZK
    • Arithmetization
      • R1CS
      • PLONK
      • AIR
    • Folding
      • LatticeFold
      • Nova
        • Nova over Cycles of Curves
    • Lookup
      • Lasso
      • LogUp-GKR
    • SNARK
      • Groth16
      • HyperPlonk
      • Spartan
        • SPARK
    • STARK
      • Additive NTT
      • Basefold
      • Binius
      • Brakedown
      • CircleSTARK
      • FRI
        • FRI Security Features and Optimizations
      • DEEP FRI
      • STIR
      • WHIR
    • Distributed ZK
      • Ryan's Trick for Distributed Groth16
  • Application
    • zkLogin
    • zkHoldem
    • zkTLS
      • DECO
      • Proxying is enough
  • zkVM
Powered by GitBook
On this page
  • Introduction
  • Background
  • FPGA
  • Coordinate Systems
  • Pippenger Algorithm
  • Protocol Explanation
  • Scheduler: Optimizing Operation Order to Avoid Pipeline Conflicts
  • Software-Side Affine Optimization
  • FPGA Design
  • Conclusion
  • References
Export as PDF
  1. Primitives
  2. Abstract Algebra
  3. Elliptic Curve
  4. MSM

CycloneMSM

Presentation: https://youtu.be/djA3mzn7BPg

PreviousSigned Bucket IndexNextEdMSM

Last updated 1 month ago

Introduction

CycloneMSM is an FPGA-accelerated architecture for Multi-Scalar Multiplication (MSM) over the BLS12-377 elliptic curve, developed by and . Optimized for the case of a fixed base point, this implementation achieves high-performance execution, completing 4 million-point MSMs in under one second and 64 million-point MSMs in approximately six seconds—delivering over 6× speedup compared to other state-of-the-art software. At its core, CycloneMSM leverages the bucket method of the Pippenger algorithm, and applies aggressive pipeline optimization and scheduling techniques to perform curve operations without bottlenecks.

Background

FPGA

To dramatically accelerate MSM computation, CycloneMSM leverages Field Programmable Gate Arrays (FPGAs). Unlike CPUs or GPUs, FPGAs offer a fundamentally different computation model, making them particularly well-suited for exploiting the parallelism and pipeline structure inherent in MSM.

Comparison: CPU vs GPU vs FPGA

Architecture
Characteristics
Suitability for MSM

CPU

General-purpose processor; optimized for sequential tasks

❌ Too slow

GPU

Designed for parallel computation with many execution units

⭕ Reasonable performance

FPGA

Custom logic design and deep pipelining capabilities

✅ Ideal choice

  • CPUs excel at complex control flow and branching, but are inefficient for highly repetitive operations like point additions and multiplications.

  • GPUs offer strong parallelism but struggle with fine-grained scheduling and memory access control.

  • FPGAs, on the other hand, allow direct circuit-level customization, enabling deeply pipelined designs that sustain a continuous flow of operations with minimal control overhead.

Coordinate Systems

MSM involves millions of point additions, so the efficiency of the addition operation plays a crucial role in overall performance. In particular, require an inverse operation per addition, which can significantly increase the total computation cost. (Reference: )

Accordingly, CycloneMSM adopts the following coordinate system strategy:

  • Software implementation: Uses Extended Jacobian coordinates for inverse-free operations

  • FPGA hardware implementation: Uses Extended projective coordinates for pipelining optimization

This coordinate system selection is a key architectural decision aimed at maximizing computational efficiency in each environment.

Pippenger Algorithm

Among these, Bucket Accumulation accounts for the majority of the total computation. Therefore, optimizing this stage is critical for improving overall MSM performance.

Protocol Explanation

Scheduler: Optimizing Operation Order to Avoid Pipeline Conflicts

CycloneMSM is designed around an FPGA-based architecture, enabling fully pipelined execution of computations. In particular, the Bucket Accumulation phase, which dominates MSM computation, is highly accelerated by this pipelining structure.

How the Pipeline Works

The addition at each cycle is defined as:

Definition of Terms

Design Goals of the Scheduler

Based on this, the Scheduler must satisfy two key objectives:

  1. Correctness: Prevent pipeline conflicts by adhering to the constraint above.

  2. Efficiency: Avoid excessive reordering of points; most operations should proceed in near-original order.

Scheduler Implementation Strategies

CycloneMSM explores two strategies to meet these goals:

  1. Delayed Scheduler

  • Idea: When a conflict is detected, defer processing of that point by placing it in a FIFO queue and retry it later.

  • In practice: Most points are processed successfully within 2–3 passes.

  • Advantages: Simple architecture and easy to implement in hardware

  • Drawbacks:

    • Requires maintaining a queue of size proportional to the number of conflicts, which can be memory-intensive

    • In worst-case scenarios (e.g., non-uniform scalars), frequent collisions could degrade performance

CycloneMSM adopts this approach in its actual FPGA implementation.

  1. Greedy Scheduler

  • Advantages: Queue is drained quickly, leading to smaller queue size

  • Drawbacks:

    • Requires tracking last access times for all buckets, which complicates state management

    • Increased memory updates and lookups each cycle place heavy demands on hardware resources

Due to memory access overhead, this strategy was not adopted in CycloneMSM’s hardware environment.

Software-Side Affine Optimization

When implementing MSM in software, non-Affine coordinate systems such as Extended Jacobian are typically used. The main reason is that point addition in Affine coordinates requires field inversion, which is computationally expensive. For instance, consider a typical Bucket Accumulation where multiple points are added sequentially:

If Affine coordinates were used, each addition would involve an inverse operation, leading to significant performance overhead.

Scheduler Enables the Use of Affine Coordinates

However, thanks to the Scheduler introduced in CycloneMSM, points can now be grouped into batches for addition without conflicts between buckets:

Comparison of Operation Costs

Real-World Implementations

This Affine optimization has been adopted in actual open-source libraries:

Thanks to this optimization, a 10–20% performance improvement over Extended Jacobian-based implementations has been reported, significantly enhancing MSM computation efficiency in practice.

FPGA Design

Field Arithmetic

By avoiding modular reductions in these cases, resource usage and processing delays in the pipeline are further minimized.

Constant Multiplier

In Montgomery representation, two constant multiplications are required as follows:

  • Fewer additions in the double-and-add process → improved speed

  • Shallower and faster adder trees in the FPGA design

This optimization plays a critical role in improving both performance and resource efficiency in the CycloneMSM hardware pipeline.

Curve Arithmetic

Pipelined Structure

  • Clock speed: 250 MHz → 1 cycle = 4 ns

  • Cycle latency:

MixedAdd Operation (Refer to the left side of Figure 4)

Full Add Operation (Refer to the right side of Figure 4)

  1. Stage 0

  2. Stage 1

  3. Stage 40 (After the stages required for a single Montgomery Multiplier)

    Invoke MixedAdd with outputs below:

This design converts a full Add into a MixedAdd call, effectively reusing the existing pipeline while precomputing necessary multiplications to maintain efficiency.

MSM Acceleration

The MSM operation in CycloneMSM is executed through coordinated interaction between the FPGA and the host, following the flow of functions outlined below:

  • MSM_INIT():

    • The converted points are transferred to the FPGA and stored in DDR memory

    • Since the conversion involves inversion, batch inversion is used to optimize performance

  • MSM():

    • For each window, the following functions are executed:

      1. ACCUMULATE(j): performs bucket accumulation

      2. AGGREGATE(j): performs bucket reduction

    • The final result of all windows is computed by the host via Window Reduction

  • ACCUMULATE(j):

    • The host invokes FPGA.ACCUMULATE(j)

    • Scalars are reduced to 16-bit reduced scalars and streamed to the FPGA

      • Data is batched and transmitted in 64-bit chunks (8 scalars per batch)

  • FPGA.ACCUMULATE(j):

    • The FPGA invokes SCHED() to read points from DDR memory

      • If a conflict is detected, the scalar and point are stored back into DDR via FIFO

      • Otherwise, the addition is performed and the result is stored in the corresponding bucket in SRAM

  • AGGREGATE():

    • The host invokes FPGA.AGGREGATE()

  • FPGA.AGGREGATE():

Conclusion

CycloneMSM leverages the architectural strengths of FPGAs and the inherent parallelism of MSM operations to deliver a high-performance system that significantly outpaces traditional software-based MSM implementations. The system's achievements are underpinned by the following design philosophies and optimization strategies:

  • Optimized Coordinate System Selection:

    On hardware, the Twisted Edwards coordinate system was chosen for its uniform operation count, which maximizes pipelining efficiency. On the software side, performance was improved through affine coordinates combined with batch inversion.

  • Pipelined Curve Adder:

  • Conflict-Free Scheduling:

    By applying a Delayed Scheduler, most points were processed without collisions. This also enabled batch inversion in the affine coordinate system, contributing to performance gains on the software side.

  • Modular and Parallel MSM Flow:

    The MSM process—initialization, bucket accumulation, reduction, and window reduction—was modularized and optimized, with roles distributed efficiently between the FPGA and host. This maximized pipeline utilization across the system.

  • Superior Performance and Energy Efficiency:

    CycloneMSM completes MSM on 64M points in under 6 seconds, achieving 4–6× speedups compared to software, while consuming only one-tenth the power.

In summary, CycloneMSM is more than just a hardware implementation—it is a deeply integrated system combining algorithmic and architectural design, demonstrating the real potential of hardware-accelerated MSM for more complex ZK and large-scale proof systems in the future.

References

While for enable inverse-free operations, the formulas used depend on the context (e.g., addition vs doubling). This results in variable computational cost, making it difficult to pipeline efficiently in hardware.

In contrast, for form of elliptic curves support both addition and doubling with a single unified formula, resulting in constant computational cost. This property makes them highly suitable for hardware pipeline design.

The is a classic technique for efficiently performing MSM (Multi-Scalar Multiplication). Given NNN pairs of (point, scalar) and a window size ccc, each window WWW requires the following operations:

Bucket Accumulation: N×MixedAddN \times \mathsf{MixedAdd}N×MixedAdd

Bucket Reduction: 2c×Add2^c \times \mathsf{Add}2c×Add

Window Reduction: c×Double+1×Addc \times \mathsf{Double} + 1 \times \mathsf{Add}c×Double+1×Add

The Curve Adder, a core computational unit of CycloneMSM, is a pipelined structure designed to efficiently perform point addition on Twisted Edwards curves. This module supports both MixedAdd\mathsf{MixedAdd}MixedAdd and Add\mathsf{Add}Add operations and is optimized for execution on FPGA hardware.

The Curve Adder takes TTT cycles to complete a single addition. In CycloneMSM, T=96T = 96T=96. A new operation is fed into the pipeline every cycle, allowing multiple point additions across different buckets to run in parallel, as illustrated below:







​



Bk(t)←Bk(t)+P(t)B_{k^{(t)}} \leftarrow B_{k^{(t)}} + P^{(t)}Bk(t)​←Bk(t)​+P(t)

ttt: The pipeline time step (i.e., cycle index). At each ttt, a new operation enters the pipeline.

P(t)P^{(t)}P(t): The point being added at time ttt.

k(t)k^{(t)}k(t): The reduced scalar index corresponding to point P(t)P^{(t)}P(t). It determines the bucket.

Bk(t)B_{k^{(t)}}Bk(t)​​: The bucket that accumulates all points whose reduced scalar index is k(t)k^{(t)}k(t).

However, for this pipeline to operate correctly, no two additions in the pipeline should access the same bucket BkB_kBk​ within a TTT-cycle window. This is enforced by the following constraint:

k(t1)≠k(t2) if ∣t1−t2∣<Tk^{(t_1)} \ne k^{(t_2)} \quad \text{ if } |t_1 - t_2| < Tk(t1​)=k(t2​) if ∣t1​−t2​∣<T

Estimated conflict count: Approximately NT2c−1\frac{NT}{2^{c-1}}2c−1NT​.

Example: With N=226N = 2^{26}N=226, c=16c = 16c=16, T=100T = 100T=100, about 204,800 conflicts arise (~0.3% of points).

Idea: Attempt to process delayed points at the earliest valid time (e.g., t+T+1t + T + 1t+T+1), even if conflicts occurred earlier.

P←P1+P2+⋯+Pn−1+PnP \leftarrow P_1 + P_2 + \dots + P_{n-1} + P_nP←P1​+P2​+⋯+Pn−1​+Pn​

mentioned that batch inversion can be applied using a tree-structured addition approach, which reduces the number of operations to log⁡2n×Inv\log_2 n \times \mathsf{Inv}log2​n×Inv

Bk(t)←Bk(t)+P(t)Bk(t+1)←Bk(t+1)+P(t+1)⋮Bk(t+T−1)←Bk(t+T−1)+P(t+T−1)B_{k^{(t)}} \leftarrow B_{k^{(t)}} + P^{(t)} \\ B_{k^{(t + 1)}} \leftarrow B_{k^{(t + 1)}} + P^{(t + 1)} \\ \vdots \\ B_{k^{(t + T - 1)}} \leftarrow B_{k^{(t + T - 1)}} + P^{(t + T - 1)} \\Bk(t)​←Bk(t)​+P(t)Bk(t+1)​←Bk(t+1)​+P(t+1)⋮Bk(t+T−1)​←Bk(t+T−1)​+P(t+T−1)

This structure allows for batch addition of TTT Affine points, making it possible to apply batch inversion and significantly reduce the overall computation cost.

Affine addition (1 time): 3×Mul+1×Inv3 \times \mathsf{Mul} + 1 \times \mathsf{Inv}3×Mul+1×Inv

TTT Affine additions (naive): 3T×Mul+T×Inv3T \times \mathsf{Mul} + T \times \mathsf{Inv}3T×Mul+T×Inv

TTT Affine additions (with batch inversion): 6T×Mul+1×Inv6T \times \mathsf{Mul} + 1 \times \mathsf{Inv}6T×Mul+1×Inv

TTT Extended Joaobian addtions: 10T×Mul10T \times \mathsf{Mul}10T×Mul

Thus, the number of inversion operations is reduced from TTT to 1, while the rest is replaced by multiplications, leading to a dramatic boost in efficiency. If 4T×Mul>1×Inv4T \times \mathsf{Mul} > 1 \times \mathsf{Inv}4T×Mul>1×Inv, then TTT Affine addition (with batch inversion) is more efficient.

Arithmetic in the finite field Fq\mathbb{F}_qFq​ is performed using 377-bit operations in , with a Montgomery parameter of R=2384R = 2^{384}R=2384. Field multiplication is implemented using three 384-bit integer multiplications based on the

Additionally, since R≥4qR \ge 4qR≥4q, the system can safely operate on inputs a,b∈[0,2q)a, b \in [0, 2q)a,b∈[0,2q) for multiplication. This characteristic allows certain modular reductions to be skipped when implementing the Curve Adder module:

When a,b∈[0,q)a, b \in [0, q)a,b∈[0,q), the sum a+b∈[0,2q)a + b \in [0, 2q)a+b∈[0,2q) → modular reduction is unnecessary

To handle subtraction safely, compute q+a−b∈[0,2q)q + a - b \in [0, 2q)q+a−b∈[0,2q) → modular reduction is also unnecessary

a×(R2mod  q)×(R−1mod  q)a \times (R^2 \mod q) \times (R^{-1} \mod q)a×(R2modq)×(R−1modq)

Instead of using the standard double-and-add approach for multiplying constants, CycloneMSM employs NAF (Non-Adjacent Form). This reduces the Hamming weight of the constant from 12\frac{1}{2}21​ to approximately 13\frac{1}{3}31​, resulting in several advantages:

Specifically, adder tree depth is reduced from log⁡2W(b)\log_2 W(b)log2​W(b) to log⁡3W(b)\log_3 W(b)log3​W(b), where W(b)W(b)W(b) denotes the Hamming weight of the constant bbb.

MixedAdd\mathsf{MixedAdd}MixedAdd outputs are produced after 96 cycles (= 384 ns)

mul\mathsf{mul}mul outputs are produced after 39 cycles (= 156 ns)

In Twisted Edwards curves, MixedAdd\mathsf{MixedAdd}MixedAdd uses the following two inputs:

P1=(X1:Y1:Z1:T1)P_1 = (X_1 : Y_1 : Z_1 : T_1)P1​=(X1​:Y1​:Z1​:T1​) (extended projective)

P2=(x2,y2,u2=kx2y2)P_2 = (x_2, y_2, u_2 = kx_2y_2)P2​=(x2​,y2​,u2​=kx2​y2​) (extended affine)

This operation involves 7 field multiplications, and it is optimized using formulas from []. Due to its pipelined nature, the adder can accept new points every cycle.

A general Add\mathsf{Add}Add operation requires 9 multiplications. CycloneMSM implements this as a 2-cycle pipelined process. The processing flow is as follows:

Inputs: Z1Z_1Z1​, Z2Z_2Z2​ → passed to the multiplier

Inputs: T2T_2T2​, curve constant kkk → passed to the multiplier

MixedAdd(P1=(X1:Y1:r0:T1), P2=(X2,Y2,r1))\mathsf{MixedAdd}\left(P_1 = (X_1 : Y_1 : r_0 : T_1),\ P_2 = (X_2, Y_2, r_1)\right)MixedAdd(P1​=(X1​:Y1​:r0​:T1​), P2​=(X2​,Y2​,r1​))

where r0=Z1⋅Z2r_0 = Z_1 \cdot Z_2r0​=Z1​⋅Z2​ and r1=k⋅T2r_1 = k \cdot T_2r1​=k⋅T2​

The user provides NNN points in Weierstrass affine coordinates

Typically affine → extended affine: (x,y)→(x,y,u=kxy)(x, y) \rightarrow (x, y, u = kxy)(x,y)→(x,y,u=kxy)

The host splits each scalar into 16-bit windows (c=16c = 16c=16) using PREPROCESS()

A total of W=16W = 16W=16 windows are created

The FPGA accumulates 2c−12^{c-1}2c−1 buckets in a sequential manner using the following formula:

R=BK+(BK+BK−1)+⋯+(B1+⋯+BK)R = B_K + (B_K + B_{K-1}) + \dots + (B_1 + \dots + B_K)R=BK​+(BK​+BK−1​)+⋯+(B1​+⋯+BK​)

This operation is performed using a 2-cycle Add\mathsf{Add}Add pipeline implemented on the FPGA

A 96-stage pipeline running at 250MHz accepts new inputs every 4ns, enabling the core operation of MSM—MixedAdd\mathsf{MixedAdd}MixedAdd—to be processed in every clock cycle.

Written by from

ttt
t+1t + 1t+1
t+2t + 2t+2
…\dots…
t+T−1t + {T-1}t+T−1
t+Tt + Tt+T
Bk(t)B_{k^{(t)}}Bk(t)​
P(t)P^{(t)}P(t)
P(t+T)P^{(t + T)}P(t+T)
Bk(t+1)B_{k^{(t + 1)}}Bk(t+1)​
P(t+1)P^{(t + 1)}P(t+1)
Bk(t+2)B_{k^{(t + 2)}}Bk(t+2)​
P(t+2)P^{(t + 2)}P(t+2)
⋮\vdots⋮
⋱\ddots⋱
Bk(t+T−1)B_{k^{(t + T - 1)}}Bk(t+T−1)​
P(t+T−1)P^{(t + T - 1)}P(t+T−1)
gnark-crypto #261
halo2curves #130
HWCD08, Sec. 4.2
The host converts these points to the Twisted Edwards coordinate system
https://www.youtube.com/watch?v=aZ3CzKZBK38
https://eprint.iacr.org/2022/1396.pdf
Pippenger algorithm
A41
ryan Kim
BATZORIG ZORIGOO
Jump Trading
Jump Crypto
Explicit Formulas Database
affine coordinates
Weierstrass Curves
Extended Jacobian coordinates
Twisted Edwards
Extended Projective coordinates
Karatsuba algorithm.
Montgomery representation
They use SkS_kSk​ for bucket instead of BkB_kBk​.