cuZK
Presentation: https://youtu.be/D8b4yi_URJE
Last updated
Presentation: https://youtu.be/D8b4yi_URJE
Last updated
Modern GPUs are equipped with thousands of cores, offering immense parallel computing power. This naturally raises the question: can we fully exploit this parallelism to accelerate cryptographic protocols like zkSNARKs? In particular, Multi-Scalar Multiplication (MSM)—a core bottleneck in zkSNARK provers—appears to be a promising target for GPU acceleration.
However, existing MSM algorithms are often optimized for CPU execution and assume uniformly distributed scalars, which leads to load imbalance and inefficient GPU utilization in real-world scenarios where scalar distributions are highly non-uniform or clustered.
cuZK addresses this gap. It is a high-performance computation framework specifically designed for GPU-based zkSNARK systems, with the goal of maximizing the throughput of MSM.
While there are other formats such as and , this section introduces only the ELL and CSR formats, as they are used in the paper.
ELL stores a fixed number of non-zero elements per row. Its highly regular structure makes it well-suited for GPU computation.
Structure
ELL consists of two 2D arrays and one 1D array:
values
Non-zero values in each row; padded with zeros if needed
col_indices
Column indices of the values; dummy indices used for padding
row_length
Actual number of non-zero elements in each row
→ All rows are padded to have the same number of elements.
Example
Given matrix:
ELL representation:
Pros
Regular memory layout → GPU-optimized (predictable access patterns)
Cons
If the number of non-zero elements varies across rows, excessive padding leads to wasted memory
Not suitable for dynamic or highly irregular matrices due to padding overhead
CSR is a widely used format that compresses sparse matrices row-wise. It is memory-efficient and versatile for computation.
Structure
CSR consists of three 1D arrays:
values
Non-zero values of the matrix
col_indices
Column indices for each value
row_ptr
Index in values
where each row starts
Example
Given matrix:
CSR representation:
Memory-efficient representation
Optimized for row-wise operations, such as matrix-vector multiplication
Varying row lengths may cause imbalanced memory access patterns
While Pippenger’s algorithm is originally optimized for large-scale MSM, several parallelization strategies are explored to adapt it to GPU environments. Below are three naive partitioning approaches along with a complexity analysis for each.
📌 Note: The computational complexity here slightly differs from what is presented in the paper.
Core Idea
The full MSM is restructured as:
Advantages
Preserves the structure of Pippenger’s algorithm
Disadvantages
Computational Complexity (per thread)
Core Idea
The full MSM is restructured as:
Advantages
Scales easily with the number of threads → good GPU resource utilization
Disadvantages
Each thread handles fewer data points, which weakens the bucket accumulation benefit of Pippenger
Computational Complexity (per thread)
Core Idea
The full MSM is restructured as:
where:
Advantages
Combines Pippenger’s structure with GPU-level parallelism
Enables use of all available GPU threads → achieves high parallelism
Disadvantages
Each thread handles fewer data points, which weakens the bucket accumulation benefit of Pippenger. (But it's better than Approach 2)
Computational Complexity (per thread)
cuZK designs a pipeline that fully leverages the bucket structure of the Pippenger algorithm to perform high-throughput large-scale MSM computation on GPUs. The overall workflow is outlined below.
Base points with zero scalars are omitted.
ELL format:
Optimizations
Points are stored in global memory via indices or pointers, so all threads can access them.
Why convert?
ELL format introduces zero padding, making it memory-inefficient.
CSR is better suited for matrix-vector multiplication and has well-established algorithms for efficient computation.
CSR format:
Implemetations:
Performed using a RadixSort-based scheme:
Extract Row Indices:
Use CSR row_ptr
to compute row positions
Store in RowIndex
0
4
0
1
3
1
2
7
2
3
4
2
4
3
3
Triplet Formation & Sorting:
Create triplets: <ColIndex, Data, RowIndex>
Sort using ColIndex
as key
Sorted outputs:
Data
→ becomes transposed matrix Data
RowIndex
→ becomes transposed ColIndex
0
1
2
3
4
Generate Transposed row_ptr
:
0
0
1
0
2
0
3
2
4
2
5
0
6
0
7
1
⇒ prefix_sum([0, 0, 0, 2, 2, 0, 0, 1]) = [0, 0, 0, 0, 2, 4, 4, 4, 5])
Resulting Transposed CSR format:
The matrix is transposed to group points by bucket index:
Perform multiplication:
Optimizations
To handle thread load imbalance (due to varying row lengths), cuZK introduces CSR-Balanced:
① Sort rows
By row length (ascending)
② Group rows
Rows are grouped so that each group has a similar workload, ensuring balanced warp-level execution.
One warp handles one group (ensures uniform workload)
④ Warp allocation
Warp count is proportional to the number of non-zeros in each group
✅ Ensures balanced thread workloads → maximizes GPU efficiency
For scalar SPMV, the sorting overhead outweighs the benefits. But in our case it's negligible compared to EC point operations so we can safely adopt CSR-Balanced.
While many parallel SPMV (Sparse Matrix-Vector Product) techniques exist, no single scheme performs optimally across all matrix types.
CSR-Scalar [Gar08]
Each thread handles one row
Severe load imbalance when row lengths vary significantly
CSR-Vector [BG09]
A row is divided across multiple threads
Only effective in specific scenarios
CSR-Balanced
Balances load based on row distribution
Requires row sorting, which increases complexity
Instead of using a fixed static scheme, cuZK dynamically selects the most suitable SPMV strategy based on the statistical characteristics of the R1CS matrix:
Low variance and low mean row length
CSR-Scalar
Low variance but high mean row length
CSR-Vector
High variance in row length
CSR-Balanced
This adaptive strategy allows cuZK to avoid performance degradation by choosing the optimal parallel execution path for the structure of each matrix. While this adaptive strategy incurs overhead—due to sorting and computing statistical properties like mean and variance—this is not an issue because the R1CS matrix is fixed per circuit and can be reused. As a result, these computations can be performed offline, introducing no runtime overhead during actual proof generation.
Even lightweight operations (e.g., variable initialization) are executed using small GPU kernels, ensuring that even single-thread tasks run on the GPU. This design minimizes data movement between CPU and GPU.
Only the three essential storage modules required for zkSNARK execution are transferred to the GPU:
R1CS instance
Three CSR matrices generated from the circuit
Function inputs
Input vector including intermediate computation results
Prover key
Large structure composed of elliptic curve (EC) point vectors
Among these, the R1CS instance and function inputs are required for the initial MUL operation, and must be transferred before computation begins. The prover key, being significantly larger, is transferred in parallel with computation using an overlapping approach.
By leveraging the GPU’s multi-streaming capability, data transfers and computations are overlapped, thereby removing potential bottlenecks. This method eliminates almost all latency caused by prover key transfers, and once each MSM computation step is complete, unused EC point memory is immediately freed, saving GPU memory.
Table 2 presents various baseline MSM implementations used for performance comparisons. Each baseline supports different elliptic curves, and cuZK was benchmarked against each using the curves supported by the respective implementation.
Table 4 summarizes multi-GPU MSM performance:
1.68
3.49
6.6
1.78
3.13
5.4
2.1
4.3
7.14
1.92
3.77
6.55
⚠️ However, performance does not scale perfectly linearly as the number of GPUs increases.
This is due to imbalanced subtask partitioning.
With 2 GPUs → assigned 6 and 7 windows respectively → total of 7 block times
With 4 GPUs → assigned 3, 3, 3, and 4 windows → only 4 block times
Thus, if the number of subtasks does not divide evenly among GPUs, it leads to load imbalance, which degrades scalability.
cuZK was also compared against two recent high-performance GPU MSM implementations:
While both deliver high performance via precomputation, they have limitations:
Require large storage space
Support only random scalars (not optimized for clustered scalars)
In Groth16 proof generation benchmarks:
cuZK achieved up to 2.94× speedup over Bellperson using 1 V100
And up to 4.86× speedup using 4 V100s
Its core contributions are as follows:
A novel sparse matrix-based parallel MSM algorithm
By leveraging the sparse structure of the matrix, cuZK efficiently restructures the MSM computation and mitigates data bottlenecks on GPUs.
Parallelization of MUL (Matrix-Vector Multiplication)
Multiplication operations are parallelized to maximize utilization of GPU cores.
Optimization of CPU-GPU data transfer
Redundant data transfers are eliminated, and
Data transfers are overlapped with device-side computation to minimize latency and avoid performance degradation.
values
=
col_indices
=
row_length
=
values
=
col_indices
=
row_ptr
=
: The number of points
: The bit-width for scalar field
: The bit-width for each window
: The number of threads
Decompose Pippenger into parts.
Each thread computes one window .
The number of windows is relatively small (e.g., , → only 16 windows), however modern GPUs have thousands of threads → low parallelism
: Bucket accumulation
: Bucket reduction
: Window reduction
Decompose MSM computation into parts.
Each thread performs serial Pippenger .
: Bucket accumulation
: Bucket accumulation
: Window reduction (Shouldn't it be ? )
: Final Window reduction
Let be the time complexity for summing values in parallel:
Decompose MSM computation into parts.
Each thread computes one window .
Let denote the partial sum of the -th window computed from the -th MSM partition. Specifically,
is the index set of base points and scalars assigned to the -th thread partition
is the -th -bit chunk of scalar
Cannot achieve perfect linear speedup (i.e., where parallel execution speed improves proportionally with the number of threads). In particular, the number of buckets are not divisible by the number of threads.
: Bucket accumulation
: Bucket reduction
: Window reduction
: Final Window reduction
An empty sparse matrix of size is created in ELL format, which is chosen for its suitability for parallel processing.
Example (, , ):
Points like are accumulated and stored as a bucket point ().
values
=
col_indices
=
row_length
=
To avoid the latency of naive memory transfers, .
Assuming complexity from and , optimal is chosen for fixed , , and . → For example, with , , and (V100 GPU), was found to be optimal.
values
=
col_indices
=
row_ptr
=
Count the occurrences of the ColIndex
and compute the
values
=
col_indices
=
row_ptr
=
Where each is a bucket sum.
③ scheduling
Decompose bucket reduction into parts:
Use threads to perform parallel reduction of EC points into a single .
For example, if , this bucket reduction can be computed as follows:
: Storing points into ELL matrix.
: SPMV(Bucket Accumulation) + Bucket reduction
: Window reduction
As shown in the figure above, the protocol performs a Matrix-Vector Multiplication (MUL) operation based on the R1CS (Rank-1 Constraint System) matrix generated by compiling the target function (prior to the INTT
step). This R1CS matrix is typically very large and sparse. For example, in Filecoin, less than 0.1% of the matrix entries are non-zero. To handle such sparsity efficiently, sparse matrix optimization techniques are essential. Accordingly, cuZK stores the R1CS matrix using the CSR (Compressed Sparse Row) format.
In addition, cuZK integrates with well-known high-speed implementations (, , ), enabling complete GPU execution of the entire Groth16 proving pipeline.
Table 1 above shows the experimental setup. The environment uses 8 NVIDIA V100 GPUs connected via Nvidia , which enables high-speed GPU-to-GPU communication without routing through the CPU, thereby maximizing parallel processing efficiency.
Table 3 shows MSM performance comparisons on a single GPU across different elliptic curves. cuZK outperforms the previous state-of-the-art (Bellperson) by 2.08× to 2.29×. In particular, the performance for Mina is significantly lower, since it uses a Straus-based MSM (Refer to ), which is less efficient for large-scale MSM.
As shown in the figure above, the NVIDIA V100 features 5120 CUDA cores, and cuZK achieves near-perfect linear speedup up to the range where .
For example, if and , we have windows (subtasks). (See line 17 in the image above)
🔗 GitHub (Open Source):
Written by from