LogoLogo
LogoLogo
  • Introduction
    • About Us
    • Notations & Definitions
      • MPC
      • ZK
    • Contribute to this Site!
  • Primitives
    • Multiplication
      • Karatsuba Multiplication
      • Toom-Cook Multiplication
    • NAF (Non-adjacent form)
    • Chinese Remainder Theorem (CRT)
    • Euclidean Algorithm
      • Extended Euclidean Algorithm
      • Binary Euclidean Algorithm
      • Extended Binary Euclidean Algorithm
    • Coding Theory
      • Linear Code
    • Number Theoretic Transform
    • Abstract Algebra
      • Group
        • -Morphisms
        • Batch Inverse
      • Elliptic Curve
        • Weierstrass Curve
          • Coordinate Forms
          • Fast Elliptic Curve Arithmetic and Improved WEIL Pairing Evaluation
        • Edwards Curve
          • Coordinate Forms
          • Twisted Edwards ↔ Short Weierstrass Transformation
        • Batch Inverse for Batch Point Additions
        • Scalar Multiplication
          • Double-and-add
          • GLV Decomposition
        • MSM
          • Pippenger's Algorithm
          • Signed Bucket Index
          • CycloneMSM
          • EdMSM
          • cuZK
        • 2-Chain and 2-Cycle of Elliptic Curves
    • Encryption Scheme
      • ElGamal Encryption
    • Modular Arithmetic
      • Modular Reduction
        • Barrett Reduction
        • Montgomery Reduction
      • Modular Inverse
        • Bernstein-Yang's Inverse
    • Multiset Check
    • Sumcheck
    • Commitment Scheme
      • Fflonk
      • SHPlonk
      • Zeromorph
  • MPC
    • Yao's Garbled Circuits
    • GMW
    • BMR
  • ZK
    • Arithmetization
      • R1CS
      • PLONK
      • AIR
    • Folding
      • LatticeFold
      • Nova
        • Nova over Cycles of Curves
    • Lookup
      • Lasso
      • LogUp-GKR
    • SNARK
      • Groth16
      • HyperPlonk
      • Spartan
        • SPARK
    • STARK
      • Additive NTT
      • Basefold
      • Binius
      • Brakedown
      • CircleSTARK
      • FRI
        • FRI Security Features and Optimizations
      • DEEP FRI
      • STIR
      • WHIR
    • Distributed ZK
      • Ryan's Trick for Distributed Groth16
  • Application
    • zkLogin
    • zkHoldem
    • zkTLS
      • DECO
      • Proxying is enough
  • zkVM
Powered by GitBook
On this page
  • Introduction
  • Background
  • Sparse Matrix Storage Format
  • Protocol Explanation
  • Notations
  • Naive Approaches
  • Overview
  • MUL Optimization in the Groth Protocol
  • Full zkSNARK Parallelization on GPU
  • Conclusion
  • References
Export as PDF
  1. Primitives
  2. Abstract Algebra
  3. Elliptic Curve
  4. MSM

cuZK

Presentation: https://youtu.be/D8b4yi_URJE

PreviousEdMSMNext2-Chain and 2-Cycle of Elliptic Curves

Last updated 23 days ago

Introduction

Modern GPUs are equipped with thousands of cores, offering immense parallel computing power. This naturally raises the question: can we fully exploit this parallelism to accelerate cryptographic protocols like zkSNARKs? In particular, Multi-Scalar Multiplication (MSM)—a core bottleneck in zkSNARK provers—appears to be a promising target for GPU acceleration.

However, existing MSM algorithms are often optimized for CPU execution and assume uniformly distributed scalars, which leads to load imbalance and inefficient GPU utilization in real-world scenarios where scalar distributions are highly non-uniform or clustered.

cuZK addresses this gap. It is a high-performance computation framework specifically designed for GPU-based zkSNARK systems, with the goal of maximizing the throughput of MSM.

Background

Sparse Matrix Storage Format

While there are other formats such as and , this section introduces only the ELL and CSR formats, as they are used in the paper.

ELL (ELLPACK)

ELL stores a fixed number of non-zero elements per row. Its highly regular structure makes it well-suited for GPU computation.

Structure

ELL consists of two 2D arrays and one 1D array:

Array
Description

values

Non-zero values in each row; padded with zeros if needed

col_indices

Column indices of the values; dummy indices used for padding

row_length

Actual number of non-zero elements in each row

→ All rows are padded to have the same number of elements.

Example

Given matrix:

ELL representation:

Pros

  • Regular memory layout → GPU-optimized (predictable access patterns)

Cons

  • If the number of non-zero elements varies across rows, excessive padding leads to wasted memory

  • Not suitable for dynamic or highly irregular matrices due to padding overhead

CSR (Compressed Sparse Row)

CSR is a widely used format that compresses sparse matrices row-wise. It is memory-efficient and versatile for computation.

Structure

CSR consists of three 1D arrays:

Array
Description

values

Non-zero values of the matrix

col_indices

Column indices for each value

row_ptr

Index in values where each row starts

Example

Given matrix:

CSR representation:


Pros

  • Memory-efficient representation

  • Optimized for row-wise operations, such as matrix-vector multiplication

Cons

  • Varying row lengths may cause imbalanced memory access patterns

Protocol Explanation

Notations

Naive Approaches

While Pippenger’s algorithm is originally optimized for large-scale MSM, several parallelization strategies are explored to adapt it to GPU environments. Below are three naive partitioning approaches along with a complexity analysis for each.

📌 Note: The computational complexity here slightly differs from what is presented in the paper.

Approach 1: Window-Level Parallelism

Core Idea

  • The full MSM is restructured as:

Advantages

  • Preserves the structure of Pippenger’s algorithm

Disadvantages

Computational Complexity (per thread)

Approach 2: Thread-Level MSM Partitioning

Core Idea

  • The full MSM is restructured as:

Advantages

  • Scales easily with the number of threads → good GPU resource utilization

Disadvantages

  • Each thread handles fewer data points, which weakens the bucket accumulation benefit of Pippenger

Computational Complexity (per thread)

Level 0: P1   P2   P3   P4   P5   P6   P7   P8
           \  /     \  /     \  /     \  /
Level 1:    A1       A2       A3       A4
                \     /         \     /
Level 2:         B1               B2
                      \         /
Level 3:               Final Result

Approach 3: Hybrid Parallelism (Approach 1 + 2)

Core Idea

  • The full MSM is restructured as:

where:

Advantages

  • Combines Pippenger’s structure with GPU-level parallelism

  • Enables use of all available GPU threads → achieves high parallelism

Disadvantages

  • Each thread handles fewer data points, which weakens the bucket accumulation benefit of Pippenger. (But it's better than Approach 2)

Computational Complexity (per thread)

Overview

cuZK designs a pipeline that fully leverages the bucket structure of the Pippenger algorithm to perform high-throughput large-scale MSM computation on GPUs. The overall workflow is outlined below.

  • Base points with zero scalars are omitted.

ELL format:

Optimizations

  • Points are stored in global memory via indices or pointers, so all threads can access them.

Step 2: Convert the Sparse Matrix from ELL to CSR

Why convert?

  1. ELL format introduces zero padding, making it memory-inefficient.

  2. CSR is better suited for matrix-vector multiplication and has well-established algorithms for efficient computation.

CSR format:

Step 3: Transpose the CSR Matrix

Implemetations:

Performed using a RadixSort-based scheme:

  1. Extract Row Indices:

  • Use CSR row_ptr to compute row positions

  • Store in RowIndex

Index
Value
ColIndex
RowIndex

0

4

0

1

3

1

2

7

2

3

4

2

4

3

3

  1. Triplet Formation & Sorting:

  • Create triplets: <ColIndex, Data, RowIndex>

  • Sort using ColIndex as key

  • Sorted outputs:

    • Data → becomes transposed matrix Data

    • RowIndex → becomes transposed ColIndex

Index
Before Sort
After Sort

0

1

2

3

4

  1. Generate Transposed row_ptr:

ColIndex
Count

0

0

1

0

2

0

3

2

4

2

5

0

6

0

7

1

⇒ prefix_sum([0, 0, 0, 2, 2, 0, 0, 1]) = [0, 0, 0, 0, 2, 4, 4, 4, 5])

Resulting Transposed CSR format:

The matrix is transposed to group points by bucket index:

Step 4: SPMV (Sparse Matrix-Vector Multiplication)

Perform multiplication:

Optimizations

To handle thread load imbalance (due to varying row lengths), cuZK introduces CSR-Balanced:

Step
Description

① Sort rows

By row length (ascending)

② Group rows

Rows are grouped so that each group has a similar workload, ensuring balanced warp-level execution.

One warp handles one group (ensures uniform workload)

④ Warp allocation

Warp count is proportional to the number of non-zeros in each group

✅ Ensures balanced thread workloads → maximizes GPU efficiency

For scalar SPMV, the sorting overhead outweighs the benefits. But in our case it's negligible compared to EC point operations so we can safely adopt CSR-Balanced.

Step 5: Reduce Bucket Points

Step 6: Reduce Windows

Computational Complexity (per thread)

MUL Optimization in the Groth Protocol

While many parallel SPMV (Sparse Matrix-Vector Product) techniques exist, no single scheme performs optimally across all matrix types.

Scheme
Description
Limitation

CSR-Scalar [Gar08]

Each thread handles one row

Severe load imbalance when row lengths vary significantly

CSR-Vector [BG09]

A row is divided across multiple threads

Only effective in specific scenarios

CSR-Balanced

Balances load based on row distribution

Requires row sorting, which increases complexity

Instead of using a fixed static scheme, cuZK dynamically selects the most suitable SPMV strategy based on the statistical characteristics of the R1CS matrix:

R1CS Matrix Profile
Selected SPMV Scheme

Low variance and low mean row length

CSR-Scalar

Low variance but high mean row length

CSR-Vector

High variance in row length

CSR-Balanced

This adaptive strategy allows cuZK to avoid performance degradation by choosing the optimal parallel execution path for the structure of each matrix. While this adaptive strategy incurs overhead—due to sorting and computing statistical properties like mean and variance—this is not an issue because the R1CS matrix is fixed per circuit and can be reused. As a result, these computations can be performed offline, introducing no runtime overhead during actual proof generation.

Full zkSNARK Parallelization on GPU

Executing All Operations on the GPU

Even lightweight operations (e.g., variable initialization) are executed using small GPU kernels, ensuring that even single-thread tasks run on the GPU. This design minimizes data movement between CPU and GPU.

Transferring Only Essential Data to the GPU

Only the three essential storage modules required for zkSNARK execution are transferred to the GPU:

Module
Description

R1CS instance

Three CSR matrices generated from the circuit

Function inputs

Input vector including intermediate computation results

Prover key

Large structure composed of elliptic curve (EC) point vectors

Among these, the R1CS instance and function inputs are required for the initial MUL operation, and must be transferred before computation begins. The prover key, being significantly larger, is transferred in parallel with computation using an overlapping approach.

Overlapping Data Transfer with Multi-Streaming

By leveraging the GPU’s multi-streaming capability, data transfers and computations are overlapped, thereby removing potential bottlenecks. This method eliminates almost all latency caused by prover key transfers, and once each MSM computation step is complete, unused EC point memory is immediately freed, saving GPU memory.

Conclusion

Table 2 presents various baseline MSM implementations used for performance comparisons. Each baseline supports different elliptic curves, and cuZK was benchmarked against each using the curves supported by the respective implementation.

Table 4 summarizes multi-GPU MSM performance:

Size
2 × V100 / 1 × V100
4 × V100 / 1 × V100
8 × V100 / 1 × V100

1.68

3.49

6.6

1.78

3.13

5.4

2.1

4.3

7.14

1.92

3.77

6.55

⚠️ However, performance does not scale perfectly linearly as the number of GPUs increases.

This is due to imbalanced subtask partitioning.


  • With 2 GPUs → assigned 6 and 7 windows respectively → total of 7 block times

  • With 4 GPUs → assigned 3, 3, 3, and 4 windows → only 4 block times

  • Thus, if the number of subtasks does not divide evenly among GPUs, it leads to load imbalance, which degrades scalability.


cuZK was also compared against two recent high-performance GPU MSM implementations:

While both deliver high performance via precomputation, they have limitations:

  • Require large storage space

  • Support only random scalars (not optimized for clustered scalars)

In Groth16 proof generation benchmarks:

  • cuZK achieved up to 2.94× speedup over Bellperson using 1 V100

  • And up to 4.86× speedup using 4 V100s

Its core contributions are as follows:

  1. A novel sparse matrix-based parallel MSM algorithm

    • By leveraging the sparse structure of the matrix, cuZK efficiently restructures the MSM computation and mitigates data bottlenecks on GPUs.

  2. Parallelization of MUL (Matrix-Vector Multiplication)

    • Multiplication operations are parallelized to maximize utilization of GPU cores.

  3. Optimization of CPU-GPU data transfer

    • Redundant data transfers are eliminated, and

    • Data transfers are overlapped with device-side computation to minimize latency and avoid performance degradation.

References

[1002000030000040]\begin{bmatrix} 10 & 0 & 20 & 0 \\ 0 & 0 & 30 & 0 \\ 0 & 0 & 0 & 40 \end{bmatrix}​1000​000​20300​0040​​

values = [1020300400]\begin{bmatrix} 10 & 20 \\ 30 & 0 \\ 40 & 0 \end{bmatrix}​103040​2000​​

col_indices = [022−13−1]\begin{bmatrix} 0 & 2 \\ 2 & -1 \\ 3 & -1 \end{bmatrix}​023​2−1−1​​

row_length = (2,1,1)(2, 1, 1)(2,1,1)

[1002000030000040]\begin{bmatrix} 10 & 0 & 20 & 0 \\ 0 & 0 & 30 & 0 \\ 0 & 0 & 0 & 40 \end{bmatrix}​1000​000​20300​0040​​

values = (10,20,30,40)(10, 20, 30, 40)(10,20,30,40)

col_indices = (0,2,2,3)(0, 2, 2, 3)(0,2,2,3)

row_ptr = (0,2,3,4)(0, 2, 3, 4)(0,2,3,4)

nnn: The number of points

λ\lambdaλ: The bit-width for scalar field

sss: The bit-width for each window

ttt: The number of threads

Decompose Pippenger into ⌈λs⌉\left\lceil \frac{\lambda}{s} \right\rceil⌈sλ​⌉ parts.

Each thread computes one window WjW_jWj​.

Q=∑i=1nkiPi=∑j=1⌈λs⌉2(j−1)sWjQ = \sum_{i=1}^{n} k_i P_i = \sum_{j=1}^{\left\lceil\frac{\lambda}{s}\right\rceil} 2^{(j-1)s} W_jQ=i=1∑n​ki​Pi​=j=1∑⌈sλ​⌉​2(j−1)sWj​

The number of windows ⌈λs⌉\left\lceil \frac{\lambda}{s} \right\rceil⌈sλ​⌉ is relatively small (e.g., λ=256\lambda = 256λ=256, s=16s = 16s=16 → only 16 windows), however modern GPUs have thousands of threads → low parallelism

λ×PointDouble+(n+2s+1+⌈λs⌉)×PointAdd\lambda \times \mathsf{PointDouble} + (n + 2^{s+1} + \left\lceil \frac{\lambda}{s}\right\rceil)\times \mathsf{PointAdd}λ×PointDouble+(n+2s+1+⌈sλ​⌉)×PointAdd

n×PointAddn \times \mathsf{PointAdd}n×PointAdd: Bucket accumulation

2s+1×PointAdd2^{s+1} \times \mathsf{PointAdd}2s+1×PointAdd: Bucket reduction

λ×PointDouble+⌈λs⌉×PointAdd\lambda \times \mathsf{PointDouble} + \left\lceil \frac{\lambda}{s}\right\rceil \times \mathsf{PointAdd}λ×PointDouble+⌈sλ​⌉×PointAdd: Window reduction

Decompose MSM computation into ttt parts.

Each thread performs serial Pippenger ∑ℓ=(j−1)nt+1jntkℓPℓ\sum_{\ell = (j-1)\frac{n}{t} +1}^{j \frac{n}{t}} k_{\ell} P_{\ell}∑ℓ=(j−1)tn​+1jtn​​kℓ​Pℓ​.

Q=∑i=1nkiPi=∑j=1t(∑ℓ=(j−1)nt+1jntkℓPℓ)Q = \sum_{i=1}^{n} k_i P_i = \sum_{j=1}^{t} \left( \sum_{\ell = (j-1)\frac{n}{t} +1}^{j \frac{n}{t}} k_{\ell} P_{\ell} \right)Q=i=1∑n​ki​Pi​=j=1∑t​​ℓ=(j−1)tn​+1∑jtn​​kℓ​Pℓ​​
λ×PointDouble+(⌈λs⌉(nt+2s+1)+log⁡t)×PointAdd\lambda \times \mathsf{PointDouble} + \left(\left\lceil\frac{\lambda}{s}\right\rceil\left(\frac{n}{t} + 2^{s+1}\right) + \log t\right)\times \mathsf{PointAdd}λ×PointDouble+(⌈sλ​⌉(tn​+2s+1)+logt)×PointAdd

⌈λs⌉nt×PointAdd\left\lceil \frac{\lambda}{s}\right\rceil \frac{n}{t} \times \mathsf{PointAdd}⌈sλ​⌉tn​×PointAdd: Bucket accumulation

⌈λs⌉2s+1×PointAdd\left\lceil \frac{\lambda}{s}\right\rceil 2^{s+1} \times \mathsf{PointAdd}⌈sλ​⌉2s+1×PointAdd: Bucket accumulation

λ×PointDouble\lambda \times \mathsf{PointDouble} λ×PointDouble: Window reduction (Shouldn't it be λ×PointDouble+⌈λs⌉×PointAdd\lambda \times \mathsf{PointDouble} + \left\lceil \frac{\lambda}{s}\right\rceil \times \mathsf{PointAdd}λ×PointDouble+⌈sλ​⌉×PointAdd? )

log⁡t×PointAdd\log t \times \mathsf{PointAdd}logt×PointAdd: Final Window reduction

Let T(t)T(t)T(t) be the time complexity for summing ttt values in parallel:

T(t)=log⁡2tT(t) = \log_2 tT(t)=log2​t

Decompose MSM computation into t⌈λs⌉\frac{t}{\left\lceil \frac{\lambda}{s}\right\rceil}⌈sλ​⌉t​ parts.

Each thread computes one window Wℓ,jW_{\ell, j}Wℓ,j​.

Q=∑i=1nkiPi=∑ℓ=1t⌈λs⌉(∑j=1⌈λs⌉2(j−1)sWℓ,j )Q = \sum_{i=1}^{n} k_i P_i = \sum_{\ell=1}^{\frac{t}{\left\lceil \frac{\lambda}{s}\right\rceil}} \left( \sum_{j = 1}^{\left\lceil \frac{\lambda}{s}\right\rceil} 2^{(j-1)s} W_{\ell, j}\ \right)Q=i=1∑n​ki​Pi​=ℓ=1∑⌈sλ​⌉t​​​j=1∑⌈sλ​⌉​2(j−1)sWℓ,j​ ​

Let Wℓ,jW_{\ell, j}Wℓ,j​​ denote the partial sum of the jjj-th window computed from the ℓ\ellℓ-th MSM partition. Specifically,

Wℓ,j=∑i∈IℓmijW_{\ell, j} = \sum_{i \in \mathcal{I}_\ell} m_{ij}Wℓ,j​=i∈Iℓ​∑​mij​

Iℓ\mathcal{I}_\ellIℓ​​ is the index set of base points and scalars assigned to the ℓ\ellℓ-th thread partition

mijm_{ij}mij​​ is the jjj-th sss-bit chunk of scalar kik_iki​

Cannot achieve perfect linear speedup (i.e., where parallel execution speed improves proportionally with the number of threads). In particular, the number of buckets 2s+12^{s+1}2s+1 are not divisible by the number of threads.

λ×PointDouble+(⌈λs⌉(nt+1)+2s+1+log⁡(t/⌈λ/s⌉))×PointAdd\lambda \times \mathsf{PointDouble} + \left(\left\lceil\frac{\lambda}{s}\right\rceil\left(\frac{n}{t} + 1 \right) + 2^{s+1} + \log (t / \left\lceil \lambda /s \right\rceil)\right)\times \mathsf{PointAdd}λ×PointDouble+(⌈sλ​⌉(tn​+1)+2s+1+log(t/⌈λ/s⌉))×PointAdd

⌈λs⌉nt×PointAdd\left\lceil \frac{\lambda}{s}\right\rceil \frac{n}{t} \times \mathsf{PointAdd}⌈sλ​⌉tn​×PointAdd: Bucket accumulation

2s+1×PointAdd2^{s+1} \times \mathsf{PointAdd}2s+1×PointAdd: Bucket reduction

λ×PointDouble+⌈λs⌉×PointAdd\lambda \times \mathsf{PointDouble} + \left\lceil \frac{\lambda}{s}\right\rceil \times \mathsf{PointAdd}λ×PointDouble+⌈sλ​⌉×PointAdd: Window reduction

log⁡(t/⌈λ/s⌉)×PointAdd\log (t / \left\lceil \lambda /s \right\rceil) \times \mathsf{PointAdd}log(t/⌈λ/s⌉)×PointAdd: Final Window reduction

Step 1: Store All Base Points PiP_iPi​ in an ELL Sparse Matrix

An empty sparse matrix of size t×(2s−1)t \times (2^s - 1)t×(2s−1) is created in ELL format, which is chosen for its suitability for parallel processing.

Example (n=8n = 8n=8, s=3s = 3s=3, t=4t = 4t=4):

G=4P1+0P2+7P3+3P4+0P5+3P6+4P7+3P8=4P1+7P3+3P4+3P6+4P7+3P8=(4P1)+(3P6)+(7P3+4P7)+(3(P4+P8)⏟Ps)\begin{align*} G&=4P_1 + 0P_2 + 7P_3 + 3P_4 + 0P_5 + 3P_6 + 4P_7 + 3P_8 \\ &= 4P_1 + 7P_3 + 3P_4 + 3P_6 + 4P_7 + 3P_8 \\ &=(4P_1) + (3P_6) + (7P_3 + 4P_7) + (3\underbrace{(P_4 +P_8)}_{P_s}) \end{align*}G​=4P1​+0P2​+7P3​+3P4​+0P5​+3P6​+4P7​+3P8​=4P1​+7P3​+3P4​+3P6​+4P7​+3P8​=(4P1​)+(3P6​)+(7P3​+4P7​)+(3Ps​(P4​+P8​)​​)​

Points like P4+P8P_4 + P_8P4​+P8​ are accumulated and stored as a bucket point (PsP_sPs​).

M=[0000P1000000P600000000P700P3000Ps0000]M = \begin{bmatrix} 0 & 0 & 0 & 0 & P_1 & 0 & 0 & 0 \\ 0 & 0 & 0 & P_6 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & P_7 & 0 & 0 & P_3 \\ 0 & 0 & 0 & P_s & 0 & 0 & 0 & 0 \\ \end{bmatrix}M=​0000​0000​0000​0P6​0Ps​​P1​0P7​0​0000​0000​00P3​0​​

values = [P10P60P3P7Ps0]\begin{bmatrix} P_1 & 0 \\ P_6 & 0 \\ P_3 & P_7 \\ P_s & 0 \end{bmatrix}​P1​P6​P3​Ps​​00P7​0​​

col_indices = [4−13−1743−1]\begin{bmatrix} 4 & -1 \\ 3 & -1 \\ 7 & 4 \\ 3 & -1 \end{bmatrix}​4373​−1−14−1​​

row_length = (1,1,2,1)(1, 1, 2, 1)(1,1,2,1)

To avoid the latency of naive memory transfers, .

Assuming complexity from PointAdd\mathsf{PointAdd}PointAdd and PointDouble\mathsf{PointDouble}PointDouble, optimal sss is chosen for fixed λ\lambdaλ, nnn, and ttt. → For example, with λ=255\lambda = 255λ=255, n=220n = 2^{20}n=220, and t=5120t = 5120t=5120 (V100 GPU), s=16s = 16s=16 was found to be optimal.

values = (P1,P6,P3,P7,Ps)(P_1, P_6, P_3, P_7, P_s)(P1​,P6​,P3​,P7​,Ps​)

col_indices = (4,3,7,4,3)(4, 3, 7, 4, 3)(4,3,7,4,3)

row_ptr = (0,1,2,4,5)(0, 1, 2, 4, 5)(0,1,2,4,5)

Count the occurrences of the ColIndex and compute the

MT=[0000000000000P60PsP10P700000000000P30]M^{\mathsf{T}} = \begin{bmatrix} 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & P_6 & 0 & P_s \\ P_1 & 0 & P_7 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & P_3 & 0 \\ \end{bmatrix}MT=​0000P1​000​000P6​0000​0000P7​00P3​​000Ps​0000​​

values = (P6,Ps,P1,P7,P3)(P_6, P_s, P_1, P_7, P_3)(P6​,Ps​,P1​,P7​,P3​)

col_indices = (1,3,0,2,2)(1, 3, 0, 2, 2)(1,3,0,2,2)

row_ptr = (0,0,0,0,2,4,4,4,5)(0, 0, 0, 0, 2, 4, 4, 4, 5)(0,0,0,0,2,4,4,4,5)

MT⋅[1111]=[0000000000000P60PsP10P700000000000P30]⋅[1111]=[000P6+PsP1+P700P3]=[B0B1B2B3B4B5B6B7]M^{\mathsf{T}} \cdot \begin{bmatrix} 1 \\ 1 \\ 1 \\ 1 \end{bmatrix} = \begin{bmatrix} 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & P_6 & 0 & P_s \\ P_1 & 0 & P_7 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & P_3 & 0 \\ \end{bmatrix} \cdot \begin{bmatrix} 1 \\ 1 \\ 1 \\ 1 \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \\ 0 \\ P_6 + P_s \\ P_1 + P_7 \\ 0 \\ 0 \\ P_3 \end{bmatrix} = \begin{bmatrix} B_0 \\ B_1 \\ B_2 \\ B_3 \\ B_4 \\ B_5 \\ B_6 \\ B_7 \\ \end{bmatrix}MT⋅​1111​​=​0000P1​000​000P6​0000​0000P7​00P3​​000Ps​0000​​⋅​1111​​=​000P6​+Ps​P1​+P7​00P3​​​=​B0​B1​B2​B3​B4​B5​B6​B7​​​

Where each BℓB_\ellBℓ​ is a bucket sum.

③ scheduling

Decompose bucket reduction into ttt parts:

Wj=∑ℓ=12s−1ℓBℓ,j=∑i=1t(∑ℓ=(i−1)(2s−1)t+1min⁡(i(2s−1)t,2s−1)ℓBℓ,j)W_j = \sum_{\ell = 1}^{2^s - 1}\ell B_{\ell, j} = \sum_{i = 1}^{t} \left( \sum_{\ell = (i - 1)\frac{(2^s - 1)}{t} + 1}^{\min( i\frac{(2^s - 1)}{t}, 2^s - 1)} \ell B_{\ell, j} \right)Wj​=ℓ=1∑2s−1​ℓBℓ,j​=i=1∑t​​ℓ=(i−1)t(2s−1)​+1∑min(it(2s−1)​,2s−1)​ℓBℓ,j​​

Use ttt threads to perform parallel reduction of 2s−12^s - 12s−1 EC points into a single WjW_jWj​.

For example, if s=3,t=4s = 3, t = 4s=3,t=4, this bucket reduction can be computed as follows:

Wj=1B1,j+2B2,j+3B3,j+4B4,j+B5,j+6B6,j+7B7,j=1B1,j+2B2,j+0(B1,j+B2,j)+(B3,j+2B4,j)+2(B3,j+B4,j)+(B5,j+2B6,j)+4(B5,j+B6,j)+B7,j+6B7,j=∑i=14(B2i−1+2B2i)+2(i−1)(B2i−1+B2i)\begin{align*} W_j &= 1B_{1,j} + 2B_{2, j} + 3B_{3, j} + 4B_{4, j} + B_{5, j} + 6B_{6, j} + 7B_{7, j} \\ &= 1B_{1,j} + 2B_{2, j} + 0(B_{1, j} + B_{2, j}) + (B_{3, j} + 2B_{4, j}) + 2(B_{3, j} + B_{4, j}) + (B_{5, j} + 2B_{6, j}) + 4(B_{5, j} + B_{6, j}) + B_{7, j} + 6B_{7, j} \\ &= \sum_{i=1}^4 (B_{2i - 1} + 2B_{2i}) + 2(i-1)(B_{2i - 1} + B_{2i}) \end{align*}Wj​​=1B1,j​+2B2,j​+3B3,j​+4B4,j​+B5,j​+6B6,j​+7B7,j​=1B1,j​+2B2,j​+0(B1,j​+B2,j​)+(B3,j​+2B4,j​)+2(B3,j​+B4,j​)+(B5,j​+2B6,j​)+4(B5,j​+B6,j​)+B7,j​+6B7,j​=i=1∑4​(B2i−1​+2B2i​)+2(i−1)(B2i−1​+B2i​)​
Q=∑j=1⌈λs⌉2(s−1)jWjQ = \sum_{j = 1}^{\left\lceil \frac{\lambda}{s}\right\rceil} 2^{(s - 1)j} W_jQ=j=1∑⌈sλ​⌉​2(s−1)jWj​
(λ+s)×PointDouble+(⌈λs⌉(nt+2s+1t)+s+log⁡t)PointAdd(\lambda + s) \times \mathsf{PointDouble} + \left(\left\lceil \frac{\lambda}{s} \right\rceil \left( \frac{n}{t} + \frac{2^{s+1}}{t} \right) + s + \log t \right) \mathsf{PointAdd}(λ+s)×PointDouble+(⌈sλ​⌉(tn​+t2s+1​)+s+logt)PointAdd

⌈λs⌉nt×PointAdd\left\lceil \frac{\lambda}{s}\right\rceil \frac{n}{t} \times \mathsf{PointAdd}⌈sλ​⌉tn​×PointAdd: Storing points into ELL matrix.

s×PointDouble+(⌈λs⌉(2s+1t−1)+s+log⁡t)×PointAdds \times \mathsf{PointDouble} + \left(\left\lceil \frac{\lambda}{s}\right\rceil \left(\frac{2^{s+1}}{t} -1\right) + s + \log t \right) \times \mathsf{PointAdd} s×PointDouble+(⌈sλ​⌉(t2s+1​−1)+s+logt)×PointAdd: SPMV(Bucket Accumulation) + Bucket reduction

λ×PointDouble+⌈λs⌉×PointAdd\lambda \times \mathsf{PointDouble} + \left\lceil \frac{\lambda}{s}\right\rceil \times \mathsf{PointAdd}λ×PointDouble+⌈sλ​⌉×PointAdd: Window reduction

As shown in the figure above, the protocol performs a Matrix-Vector Multiplication (MUL) operation based on the R1CS (Rank-1 Constraint System) matrix generated by compiling the target function (prior to the INTT step). This R1CS matrix is typically very large and sparse. For example, in Filecoin, less than 0.1% of the matrix entries are non-zero. To handle such sparsity efficiently, sparse matrix optimization techniques are essential. Accordingly, cuZK stores the R1CS matrix using the CSR (Compressed Sparse Row) format.

In addition, cuZK integrates with well-known high-speed implementations (, , ), enabling complete GPU execution of the entire Groth16 proving pipeline.

Table 1 above shows the experimental setup. The environment uses 8 NVIDIA V100 GPUs connected via Nvidia , which enables high-speed GPU-to-GPU communication without routing through the CPU, thereby maximizing parallel processing efficiency.

Table 3 shows MSM performance comparisons on a single GPU across different elliptic curves. cuZK outperforms the previous state-of-the-art (Bellperson) by 2.08× to 2.29×. In particular, the performance for Mina is significantly lower, since it uses a Straus-based MSM (Refer to ), which is less efficient for large-scale MSM.

As shown in the figure above, the NVIDIA V100 features 5120 CUDA cores, and cuZK achieves near-perfect linear speedup up to the range where 212<t<2132^{12} < t < 2^{13}212<t<213.

For example, if λ=255\lambda = 255λ=255 and s=20s = 20s=20, we have ⌈25520⌉=13\left\lceil \frac{255}{20} \right\rceil = 13⌈20255​⌉=13 windows (subtasks). (See line 17 in the image above)

🔗 GitHub (Open Source):

Written by from

P1P_1P1​
P6P_6P6​
P3P_3P3​
P7P_7P7​
PsP_sPs​
(4,P1,0)(4, P_1, 0)(4,P1​,0)
(3,P6,1)(3, P_6, 1)(3,P6​,1)
(3,P6,1)(3, P_6, 1)(3,P6​,1)
(3,Ps,3)(3, P_s, 3)(3,Ps​,3)
(7,P3,2)(7, P_3, 2)(7,P3​,2)
(4,P1,0)(4, P_1, 0)(4,P1​,0)
(4,P7,2)(4, P_7, 2)(4,P7​,2)
(4,P7,2)(4, P_7, 2)(4,P7​,2)
(3,Ps,3)(3, P_s, 3)(3,Ps​,3)
(7,P3,2)(7, P_3, 2)(7,P3​,2)
2202^{20}220
2222^{22}222
2242^{24}224
2262^{26}226
prefix sum
Groth16
NTT (Number Theoretic Transform)
GJCC20
KJPA20
GXW21
NVLink
Bernstein's survey
Yrrid
MatterLabs
https://github.com/speakspeak/cuZK
https://eprint.iacr.org/2022/1321
multi-streaming is used to overlap CPU-GPU data transfer and computation
🤔
🤔
A41
ryan Kim
Warp
COO
CSC