LogoLogo
LogoLogo
  • Introduction
    • About Us
    • Notations & Definitions
      • MPC
      • ZK
    • Contribute to this Site!
  • Primitives
    • Multiplication
      • Karatsuba Multiplication
      • Toom-Cook Multiplication
    • NAF (Non-adjacent form)
    • Chinese Remainder Theorem (CRT)
    • Euclidean Algorithm
      • Extended Euclidean Algorithm
      • Binary Euclidean Algorithm
      • Extended Binary Euclidean Algorithm
    • Coding Theory
      • Linear Code
    • Number Theoretic Transform
    • Abstract Algebra
      • Group
        • -Morphisms
        • Batch Inverse
      • Elliptic Curve
        • Weierstrass Curve
          • Coordinate Forms
          • Fast Elliptic Curve Arithmetic and Improved WEIL Pairing Evaluation
        • Edwards Curve
          • Coordinate Forms
          • Twisted Edwards ↔ Short Weierstrass Transformation
        • Batch Inverse for Batch Point Additions
        • Scalar Multiplication
          • Double-and-add
          • GLV Decomposition
        • MSM
          • Pippenger's Algorithm
          • Signed Bucket Index
          • CycloneMSM
          • EdMSM
          • cuZK
        • 2-Chain and 2-Cycle of Elliptic Curves
    • Encryption Scheme
      • ElGamal Encryption
    • Modular Arithmetic
      • Modular Reduction
        • Barrett Reduction
        • Montgomery Reduction
      • Modular Inverse
        • Bernstein-Yang's Inverse
    • Multiset Check
    • Sumcheck
    • Commitment Scheme
      • Fflonk
      • SHPlonk
      • Zeromorph
  • MPC
    • Yao's Garbled Circuits
    • GMW
    • BMR
  • ZK
    • Arithmetization
      • R1CS
      • PLONK
      • AIR
    • Folding
      • LatticeFold
      • Nova
        • Nova over Cycles of Curves
    • Lookup
      • Lasso
      • LogUp-GKR
    • SNARK
      • Groth16
      • HyperPlonk
      • Spartan
        • SPARK
    • STARK
      • Additive NTT
      • Basefold
      • Binius
      • Brakedown
      • CircleSTARK
      • FRI
        • FRI Security Features and Optimizations
      • DEEP FRI
      • STIR
      • WHIR
    • Distributed ZK
      • Ryan's Trick for Distributed Groth16
  • Application
    • zkLogin
    • zkHoldem
    • zkTLS
      • DECO
      • Proxying is enough
  • zkVM
Powered by GitBook
On this page
  • Bottlenecks
  • Partial Witness Reduction + Partial Inverse FFT
  • Motivation
  • Protocol
  • Examples
  • Full FFT
  • Motivation
  • Protocol
  • Example
  • Partial MSM
  • Motivation
  • Protocol
  • Example
  • Put Together
  • Analysis: RISC0 Groth16 proof
  • Reduce Analysis
  • Computation Cost
  • Communication Cost
  • Estimation
  • Assumptions
  • Computation
  • Input Size
  • Runtime Memory Size
Export as PDF
  1. ZK
  2. Distributed ZK

Ryan's Trick for Distributed Groth16

PreviousDistributed ZKNextzkLogin

Last updated 25 days ago

Please read beforehand!

Bottlenecks

These are the three most computation-intensive parts that must be distributed:

1. Witness Reduction

This is required to compute a(X),b(X),c(X)a(X), b(X), c(X)a(X),b(X),c(X):

∑i=0m(Aj,i⋅zi),∑i=0m(Bj,i⋅zi),∑i=0m(Cj,i⋅zi)\sum_{i=0}^m (A_{j, i} \cdot z_i), \quad \sum_{i=0}^m (B_{j, i} \cdot z_i), \quad \sum_{i=0}^m (C_{j, i} \cdot z_i)i=0∑m​(Aj,i​⋅zi​),i=0∑m​(Bj,i​⋅zi​),i=0∑m​(Cj,i​⋅zi​)

2. FFT / Inverse FFT

This is needed to compute hih_ihi​:

a(X)=∑j=0n−1Lj(X)(∑i=0mAj,i⋅zi),a=(a′(ω0),…,a′(ωn−1))b(X)=∑j=0n−1Lj(X)(∑i=0mBj,i⋅zi),b=(b′(ω0),…,b′(ωn−1))c(X)=∑j=0n−1Lj(X)(∑i=0mCj,i⋅zi),c=(c′(ω0),…,c′(ωn−1))\begin{aligned} a(X) &= \sum_{j = 0}^{n-1} L_j(X) \left(\sum_{i = 0}^{m} A_{j, i} \cdot z_i\right),\quad \bm{a} = \left(a'(\omega^0), \dots, a'(\omega^{n-1})\right) \\ b(X) &= \sum_{j = 0}^{n-1} L_j(X) \left(\sum_{i = 0}^{m} B_{j, i} \cdot z_i\right),\quad \bm{b} = \left(b'(\omega^0), \dots, b'(\omega^{n-1})\right) \\ c(X) &= \sum_{j = 0}^{n-1} L_j(X) \left(\sum_{i = 0}^{m} C_{j, i} \cdot z_i\right),\quad \bm{c} = \left(c'(\omega^0), \dots, c'(\omega^{n-1})\right) \end{aligned}a(X)b(X)c(X)​=j=0∑n−1​Lj​(X)(i=0∑m​Aj,i​⋅zi​),a=(a′(ω0),…,a′(ωn−1))=j=0∑n−1​Lj​(X)(i=0∑m​Bj,i​⋅zi​),b=(b′(ω0),…,b′(ωn−1))=j=0∑n−1​Lj​(X)(i=0∑m​Cj,i​⋅zi​),c=(c′(ω0),…,c′(ωn−1))​

3. MSM

This is used to compute ([A]1,[B]2,[C]1)([A]_1, [B]_2, [C]_1)([A]1​,[B]2​,[C]1​), involving five MSMs, one of which depends on hih_ihi​:

Partial Witness Reduction + Partial Inverse FFT

Motivation

Protocol

Examples

Full FFT

Motivation

Protocol

We use Reduce instead of AllReduce for the Full FFT step in order to minimize data communication cost. As a result, a single device is responsible for performing the element-wise multiplication and the forward FFT on the fully reduced polynomials.

Example

Partial MSM

Motivation

Protocol

Example

const int field_element_size = 32; // 32 bytes per BN254 field element
const int num_coeffs_per_gpu = ...; // e.g., 2^23 / num_gpus
const size_t chunk_size = num_coeffs_per_gpu * field_element_size;

if (rank == 0) {
    // `full_poly_data` points to the full polynomial coefficients (uint8_t*)
    for (int i = 0; i < num_gpus; ++i) {
        if (i ! = rank) {
            uint8_t* chunk_ptr = full_poly_data + i * chunk_size;
            ncclSend(chunk_ptr, chunk_size, ncclUint8, i, comm, stream);
        }
    }
} else {
    // `my_poly_chunk` will hold this GPU's assigned coefficients
    ncclRecv(my_poly_chunk, chunk_size, ncclUint8, 0, comm, stream);
}

Put Together

In the following expressions, red highlights indicate the parts that must be reduced across devices. The remaining parts can be computed on the host to finalize the Groth16 proof:

Communication Step
Communication Cost

Reduce

Send & Recv

AllReduce

DIZK's AllToAll

Analysis: RISC0 Groth16 proof

> circom --r1cs stark_verify.circom 
template instances: 349
non-linear constraints: 5676573
linear constraints: 0
public inputs: 0
private inputs: 25749 (22333 belong to witness)
public outputs: 5
wires: 5635930
labels: 10298490
Written successfully: ./stark_verify.r1cs
Everything went okay

Reduce Analysis

We now analyze the cost of the first and most expensive communication step in our distributed system: the Reduce operation.

  • 3 polynomials per GPU × 256 MB each to be sent to the leader GPU = 768 MB per GPU

  • Total communication volume on receiver side of leader GPU = 3 GPUs × 768 MB = 2304 MB

Computation Cost

Communication Cost

PCIe Transfer Speed

Item
Value

Interface

PCIe Gen4 x16

Theoretical Bandwidth

32 GB/s

Measured Bandwidth (via NCCL)

24 ~ 28 GB/s

Receive time by the leader GPU

2304 MB / (24 ~ 28) GB/s ≈ 82 ~ 96 ms

Estimated with NCCL overhead

92 ~ 106 ms

We chose PCIe Gen4 x16 because it offers a well-balanced trade-off between performance, cost, and ecosystem stability. Gen4 provides up to 32 GB/s bidirectional bandwidth, which is sufficient for most real-world proving workloads, especially when combined with smart overlapping strategies between computation and communication.

While bandwidth can become a performance bottleneck in some high-throughput scenarios, upgrading to newer generations like PCIe Gen5 or Gen6 introduces significant trade-offs in cost, complexity, and platform requirements. For now, Gen4 remains the most practical and widely supported option, but we remain open to adopting higher PCIe generations if communication overhead proves to be a critical limiting factor.

Performance Summary

Task
Time

Field addition

< 3 ms

PCIe communication (Reduce)

92 ~ 106 ms

Total Execution Time

95 ~ 109 ms

Estimation

Assumptions

  1. The total proving time for Groth16 is the sum of the time taken for 5 MSMs (387 ms), 1 FFT ( 10 ms), and 1 IFFT (10 ms).

  2. The time for Reduce, Recv and Send is negligible.

Computation

If everything is computed serially, the total time is:

If we instead use the proposed scheme across 4 GPUs, the time becomes:

This shows that the proving time is reduced by approximately a factor of 4.

Input Size

Runtime Memory Size

∑i=0mzi[ai(x)]1,∑i=0mzi[bi(x)]1,∑i=0mzi[bi(x)]2,∑i=ℓ+1mzi[βai(x)+αbi(x)+ci(x)δ]1,∑i=0n−2hi[L′2i+1(x)δ]1\begin{aligned} &\sum_{i=0}^m z_i [a_i(x)]_1, \quad \sum_{i=0}^m z_i [b_i(x)]_1, \quad \sum_{i=0}^m z_i [b_i(x)]_2, \\ &\sum_{i = \ell + 1}^{m} z_i \left[ \frac{\beta a_i(x) + \alpha b_i(x) + c_i(x)}{\delta} \right]_1, \\ &\sum_{i = 0}^{n - 2} h_i \left[ \frac{{L'}_{2i + 1}(x)}{\delta} \right]_1 \end{aligned}​i=0∑m​zi​[ai​(x)]1​,i=0∑m​zi​[bi​(x)]1​,i=0∑m​zi​[bi​(x)]2​,i=ℓ+1∑m​zi​[δβai​(x)+αbi​(x)+ci​(x)​]1​,i=0∑n−2​hi​[δL′2i+1​(x)​]1​​

The primary motivation for performing Partial Witness Reduction and Partial FFT is to partition the data structures A,B,C,zA, B, C, \bm{z}A,B,C,z across ddd devices. By doing so, each device only stores a fraction of the full data, significantly reducing memory overhead and enabling large-scale proving even when the entire witness cannot fit into a single GPU’s memory.

In Partial Witness Reduction and Partial Inverse FFT, the matrices A,B,CA, B, CA,B,C are typically sparse, making it difficult to precisely estimate the memory savings from their partitioning. However, the vector z\bm{z}z is dense and evenly split across ddd devices, so its memory overhead is reduced by a factor of ddd.

Assuming m+1m + 1m+1 is divisible by ddd (the number of devices), the polynomials can be decomposed as follows:

a(X)=∑k=0d−1(∑j=0n−1Lj(X)(∑i=k(m+1)/d(k+1)(m+1)/dAj,i⋅zi))⏟a(k)(X)b(X)=∑k=0d−1(∑j=0n−1Lj(X)(∑i=k(m+1)/d(k+1)(m+1)/dBj,i⋅zi))⏟b(k)(X)c(X)=∑k=0d−1(∑j=0n−1Lj(X)(∑i=k(m+1)/d(k+1)(m+1)/dCj,i⋅zi))⏟c(k)(X)\begin{aligned} a(X) &= \sum_{k = 0}^{d-1} \underbrace{\left(\sum_{j = 0}^{n-1} L_j(X) \left(\sum_{i = k(m + 1) / d}^{(k + 1)(m + 1) / d} A_{j, i} \cdot z_i\right)\right)}_{a^{(k)}(X)} \\ b(X) &= \sum_{k = 0}^{d-1} \underbrace{\left(\sum_{j = 0}^{n-1} L_j(X) \left(\sum_{i = k(m + 1) / d}^{(k + 1)(m + 1) / d} B_{j, i} \cdot z_i\right)\right)}_{b^{(k)}(X)} \\ c(X) &= \sum_{k = 0}^{d-1} \underbrace{\left(\sum_{j = 0}^{n-1} L_j(X) \left(\sum_{i = k(m + 1) / d}^{(k + 1)(m + 1) / d} C_{j, i} \cdot z_i\right)\right)}_{c^{(k)}(X)} \end{aligned}a(X)b(X)c(X)​=k=0∑d−1​a(k)(X)​j=0∑n−1​Lj​(X)​i=k(m+1)/d∑(k+1)(m+1)/d​Aj,i​⋅zi​​​​​=k=0∑d−1​b(k)(X)​j=0∑n−1​Lj​(X)​i=k(m+1)/d∑(k+1)(m+1)/d​Bj,i​⋅zi​​​​​=k=0∑d−1​c(k)(X)​j=0∑n−1​Lj​(X)​i=k(m+1)/d∑(k+1)(m+1)/d​Cj,i​⋅zi​​​​​​

Thus, matrices and vectors can be partitioned into ddd parts:

A→(Ak)k=0d−1∈(Fn×m+1d)dA \to (A_k)_{k=0}^{d-1} \in \left(\mathbb{F}^{n \times \frac{m+1}{d}}\right)^dA→(Ak​)k=0d−1​∈(Fn×dm+1​)d

B→(Bk)k=0d−1∈(Fn×m+1d)dB \to (B_k)_{k=0}^{d-1} \in \left(\mathbb{F}^{n \times \frac{m+1}{d}}\right)^dB→(Bk​)k=0d−1​∈(Fn×dm+1​)d

C→(Ck)k=0d−1∈(Fn×m+1d)dC \to (C_k)_{k=0}^{d-1} \in \left(\mathbb{F}^{n \times \frac{m+1}{d}}\right)^dC→(Ck​)k=0d−1​∈(Fn×dm+1​)d

z→(zk)k=0d−1∈(Fm+1d)d\bm{z} \to (z_k)_{k=0}^{d-1} \in \left(\mathbb{F}^{\frac{m + 1}{d}}\right)^dz→(zk​)k=0d−1​∈(Fdm+1​)d

Suppose we are given the following matrices A,B,CA, B, CA,B,C, and vector z\bm{z}z:

A=[A0,0A0,1A1,0A1,1A2,0A2,1A3,0A3,1],B=[B0,0B0,1B1,0B1,1B2,0B2,1B3,0B3,1],C=[C0,0C0,1C1,0C1,1C2,0C2,1C3,0C3,1],z=(z0,z1)A = \begin{bmatrix} A_{0, 0} & A_{0, 1} \\ A_{1, 0} & A_{1, 1} \\ A_{2, 0} & A_{2, 1} \\ A_{3, 0} & A_{3, 1} \\ \end{bmatrix}, \quad B = \begin{bmatrix} B_{0, 0} & B_{0, 1} \\ B_{1, 0} & B_{1, 1} \\ B_{2, 0} & B_{2, 1} \\ B_{3, 0} & B_{3, 1} \\ \end{bmatrix}, \quad C = \begin{bmatrix} C_{0, 0} & C_{0, 1} \\ C_{1, 0} & C_{1, 1} \\ C_{2, 0} & C_{2, 1} \\ C_{3, 0} & C_{3, 1} \\ \end{bmatrix}, \quad \bm{z} = (z_0, z_1)A=​A0,0​A1,0​A2,0​A3,0​​A0,1​A1,1​A2,1​A3,1​​​,B=​B0,0​B1,0​B2,0​B3,0​​B0,1​B1,1​B2,1​B3,1​​​,C=​C0,0​C1,0​C2,0​C3,0​​C0,1​C1,1​C2,1​C3,1​​​,z=(z0​,z1​)

Then the polynomials a(X),b(X),c(X)a(X), b(X), c(X)a(X),b(X),c(X) are constructed as follows:

a(X)=L0(X)(A0,0z0+A0,1z1)+L1(X)(A1,0z0+A1,1z1)+L2(X)(A2,0z0+A2,1z1)+L3(X)(A3,0z0+A3,1z1)=L0(X)(A0,0z0)+L1(X)(A1,0z0)+L2(X)(A2,0z0)+L3(X)(A3,0z0)⏟a(0)(X)+L0(X)(A0,1z1)+L1(X)(A1,1z1)+L2(X)(A2,1z1)+L3(X)(A3,1z1)⏟a(1)(X)b(X)=L0(X)(B0,0z0+B0,1z1)+L1(X)(B1,0z0+B1,1z1)+L2(X)(B2,0z0+B2,1z1)+L3(X)(B3,0z0+B3,1z1)=L0(X)(B0,0z0)+L1(X)(B1,0z0)+L2(X)(B2,0z0)+L3(X)(B3,0z0)⏟b(0)(X)+L0(X)(B0,1z1)+L1(X)(B1,1z1)+L2(X)(B2,1z1)+L3(X)(B3,1z1)⏟b(1)(X)c(X)=L0(X)(C0,0z0+C0,1z1)+L1(X)(C1,0z0+C1,1z1)+L2(X)(C2,0z0+C2,1z1)+L3(X)(C3,0z0+C3,1z1)=L0(X)(C0,0z0)+L1(X)(C1,0z0)+L2(X)(C2,0z0)+L3(X)(C3,0z0)⏟c(0)(X)+L0(X)(C0,1z1)+L1(X)(C1,1z1)+L2(X)(C2,1z1)+L3(X)(C3,1z1)⏟c(1)(X)\begin{align*} a(X) &= L_0(X)(A_{0, 0}z_0 + A_{0, 1}z_1) + L_1(X)(A_{1, 0}z_0 + A_{1, 1}z_1) + L_2(X)(A_{2, 0}z_0 + A_{2, 1}z_1) + L_3(X)(A_{3, 0}z_0 + A_{3, 1}z_1) \\ &= \underbrace{L_0(X)(A_{0, 0}z_0) + L_1(X)(A_{1, 0}z_0) + L_2(X)(A_{2, 0}z_0) + L_3(X)(A_{3, 0}z_0)}_{a^{(0)(X)}} \\ &\quad + \underbrace{L_0(X)(A_{0, 1}z_1) + L_1(X)(A_{1, 1}z_1) + L_2(X)(A_{2, 1}z_1) + L_3(X)(A_{3, 1}z_1)}_{a^{{(1)(X)}}} \\ b(X) &= L_0(X)(B_{0, 0}z_0 + B_{0, 1}z_1) + L_1(X)(B_{1, 0}z_0 + B_{1, 1}z_1) + L_2(X)(B_{2, 0}z_0 + B_{2, 1}z_1) + L_3(X)(B_{3, 0}z_0 + B_{3, 1}z_1) \\ &= \underbrace{L_0(X)(B_{0, 0}z_0) + L_1(X)(B_{1, 0}z_0) + L_2(X)(B_{2, 0}z_0) + L_3(X)(B_{3, 0}z_0)}_{b^{(0)(X)}} \\ &\quad + \underbrace{L_0(X)(B_{0, 1}z_1) + L_1(X)(B_{1, 1}z_1) + L_2(X)(B_{2, 1}z_1) + L_3(X)(B_{3, 1}z_1)}_{b^{{(1)(X)}}} \\ c(X) &= L_0(X)(C_{0, 0}z_0 + C_{0, 1}z_1) + L_1(X)(C_{1, 0}z_0 + C_{1, 1}z_1) + L_2(X)(C_{2, 0}z_0 + C_{2, 1}z_1) + L_3(X)(C_{3, 0}z_0 + C_{3, 1}z_1) \\ &= \underbrace{L_0(X)(C_{0, 0}z_0) + L_1(X)(C_{1, 0}z_0) + L_2(X)(C_{2, 0}z_0) + L_3(X)(C_{3, 0}z_0)}_{c^{(0)(X)}} \\ &\quad + \underbrace{L_0(X)(C_{0, 1}z_1) + L_1(X)(C_{1, 1}z_1) + L_2(X)(C_{2, 1}z_1) + L_3(X)(C_{3, 1}z_1)}_{c^{{(1)(X)}}} \\ \end{align*}a(X)b(X)c(X)​=L0​(X)(A0,0​z0​+A0,1​z1​)+L1​(X)(A1,0​z0​+A1,1​z1​)+L2​(X)(A2,0​z0​+A2,1​z1​)+L3​(X)(A3,0​z0​+A3,1​z1​)=a(0)(X)L0​(X)(A0,0​z0​)+L1​(X)(A1,0​z0​)+L2​(X)(A2,0​z0​)+L3​(X)(A3,0​z0​)​​+a(1)(X)L0​(X)(A0,1​z1​)+L1​(X)(A1,1​z1​)+L2​(X)(A2,1​z1​)+L3​(X)(A3,1​z1​)​​=L0​(X)(B0,0​z0​+B0,1​z1​)+L1​(X)(B1,0​z0​+B1,1​z1​)+L2​(X)(B2,0​z0​+B2,1​z1​)+L3​(X)(B3,0​z0​+B3,1​z1​)=b(0)(X)L0​(X)(B0,0​z0​)+L1​(X)(B1,0​z0​)+L2​(X)(B2,0​z0​)+L3​(X)(B3,0​z0​)​​+b(1)(X)L0​(X)(B0,1​z1​)+L1​(X)(B1,1​z1​)+L2​(X)(B2,1​z1​)+L3​(X)(B3,1​z1​)​​=L0​(X)(C0,0​z0​+C0,1​z1​)+L1​(X)(C1,0​z0​+C1,1​z1​)+L2​(X)(C2,0​z0​+C2,1​z1​)+L3​(X)(C3,0​z0​+C3,1​z1​)=c(0)(X)L0​(X)(C0,0​z0​)+L1​(X)(C1,0​z0​)+L2​(X)(C2,0​z0​)+L3​(X)(C3,0​z0​)​​+c(1)(X)L0​(X)(C0,1​z1​)+L1​(X)(C1,1​z1​)+L2​(X)(C2,1​z1​)+L3​(X)(C3,1​z1​)​​​

In protocols like DIZK, distributing FFT requires 3 communications per FFT, which leads to substantial overhead. However, according to data from , the dominant cost in Groth16 proving is MSM, not FFT. Therefore, we choose not to split the FFT, and instead perform a Full FFT after reconstructing. This design reduces communication complexity while focusing optimization efforts on the actual bottleneck.

After the partial inverse FFT, each device performs a operation to reconstruct the full polynomials a(X),b(X),c(X)a(X), b(X), c(X)a(X),b(X),c(X).

Suppose we have 4 devices, and each device holds polynomials a(k)(X),b(k)(X),c(k)(X)a^{(k)}(X), b^{(k)}(X), c^{(k)}(X)a(k)(X),b(k)(X),c(k)(X) for k=0,1,2,3k = 0, 1, 2, 3k=0,1,2,3. To proceed with proof generation, every device must eventually obtain the full sums:

a(X)=a(0)(X)+a(1)(X)+a(2)(X)+a(3)(X)b(X)=b(0)(X)+b(1)(X)+b(2)(X)+b(3)(X)c(X)=c(0)(X)+c(1)(X)+c(2)(X)+c(3)(X)a(X) = a^{(0)}(X) + a^{(1)}(X) + a^{(2)}(X) + a^{(3)}(X) \\ b(X) = b^{(0)}(X) + b^{(1)}(X) + b^{(2)}(X) + b^{(3)}(X) \\ c(X) = c^{(0)}(X) + c^{(1)}(X) + c^{(2)}(X) + c^{(3)}(X) \\a(X)=a(0)(X)+a(1)(X)+a(2)(X)+a(3)(X)b(X)=b(0)(X)+b(1)(X)+b(2)(X)+b(3)(X)c(X)=c(0)(X)+c(1)(X)+c(2)(X)+c(3)(X)

Each MSM can be intuitively split across devices. When performing the 5 MSMs involved in Groth16 proving, both the scalar inputs ziz_izi​​ and hih_ihi​ are partitioned across ddd devices, reducing their memory footprint by a factor of ddd. As a result, the corresponding base points used in each MSM are also reduced proportionally, leading to a significant decrease in per-device memory usage and enabling more efficient multi-GPU computation.

Assuming n−1n - 1n−1 is divisible by ddd:

∑i=0mzi[ai(x)]1=∑k=0d−1(∑i=k(m+1)/d(k+1)(m+1)/dzi[ai(x)]1)∑i=0mzi[bi(x)]1=∑k=0d−1(∑i=k(m+1)/d(k+1)(m+1)/dzi[bi(x)]1)∑i=0mzi[bi(x)]2=∑k=0d−1(∑i=k(m+1)/d(k+1)(m+1)/dzi[bi(x)]2)∑i=ℓ+1mzi[βai(x)+αbi(x)+ci(x)δ]1=∑k=0d−1(∑i=max⁡(k(m+1)/d,ℓ+1)(k+1)(m+1)/d⋯ )∑i=0n−2hi[L′2i+1(x)δ]1=∑k=0d−1(∑i=k(n−1)/d(k+1)(n−1)/d⋯ )\begin{aligned} &\sum_{i=0}^m z_i [a_i(x)]_1 = \sum_{k=0}^{d-1} \left( \sum_{i=k(m+1)/d}^{(k+1)(m+1)/d} z_i [a_i(x)]_1 \right) \\ &\sum_{i=0}^m z_i [b_i(x)]_1 = \sum_{k=0}^{d-1} \left( \sum_{i=k(m+1)/d}^{(k+1)(m+1)/d} z_i [b_i(x)]_1 \right) \\ &\sum_{i=0}^m z_i [b_i(x)]_2 = \sum_{k=0}^{d-1} \left( \sum_{i=k(m+1)/d}^{(k+1)(m+1)/d} z_i [b_i(x)]_2 \right) \\ &\sum_{i = \ell + 1}^{m} z_i \left[ \frac{\beta a_i(x) + \alpha b_i(x) + c_i(x)}{\delta} \right]_1 = \sum_{k=0}^{d-1} \left( \sum_{i= \max(k(m+1)/d, \ell + 1)}^{(k+1)(m+1)/d} \cdots \right) \\ &\sum_{i = 0}^{n - 2} h_i \left[ \frac{{L'}_{2i + 1}(x)}{\delta} \right]_1 = \sum_{k=0}^{d-1} \left( \sum_{i=k(n-1)/d}^{(k+1)(n-1)/d} \cdots \right) \\ \end{aligned} ​i=0∑m​zi​[ai​(x)]1​=k=0∑d−1​​i=k(m+1)/d∑(k+1)(m+1)/d​zi​[ai​(x)]1​​i=0∑m​zi​[bi​(x)]1​=k=0∑d−1​​i=k(m+1)/d∑(k+1)(m+1)/d​zi​[bi​(x)]1​​i=0∑m​zi​[bi​(x)]2​=k=0∑d−1​​i=k(m+1)/d∑(k+1)(m+1)/d​zi​[bi​(x)]2​​i=ℓ+1∑m​zi​[δβai​(x)+αbi​(x)+ci​(x)​]1​=k=0∑d−1​​i=max(k(m+1)/d,ℓ+1)∑(k+1)(m+1)/d​⋯​i=0∑n−2​hi​[δL′2i+1​(x)​]1​=k=0∑d−1​​i=k(n−1)/d∑(k+1)(n−1)/d​⋯​​

The first four MSMs can be computed independently. The last one requires a precomputed hih_ihi​, which each device can hold after the & operations.

[A]1=[α]1+∑i=0mzi[ai(x)]1⏟A+r[δ]1[B]2=[β]2+∑i=0mzi[bi(x)]2⏟B+s[δ]2[C]1=∑i=ℓ+1mzi[βai(x)+αbi(x)+ci(x)δ]1⏟C1+∑i=0n−2hi[L′2i+1(x)δ]1⏟C2+s[A]1+r[B]1⏟C3−rs[δ]1\begin{aligned} [A]_1 &= [\alpha]_1 + \textcolor{red}{\underbrace{\sum_{i=0}^m z_i [a_i(x)]_1}_{A}} + r [\delta]_1 \\ [B]_2 &= [\beta]_2 + \textcolor{red}{\underbrace{\sum_{i=0}^m z_i [b_i(x)]_2}_{B}} + s [\delta]_2 \\ [C]_1 &= \textcolor{red}{\underbrace{\sum_{i = \ell + 1}^{m} z_i \left[ \frac{\beta a_i(x) + \alpha b_i(x) + c_i(x)}{\delta} \right]_1}_{C_1} + \underbrace{\sum_{i = 0}^{n - 2} h_i \left[ \frac{{L'}_{2i + 1}(x)}{\delta} \right]_1}_{C_2}} + s[A]_1 + r\textcolor{red}{\underbrace{[B]_1}_{C_3}} - rs[\delta]_1 \end{aligned}[A]1​[B]2​[C]1​​=[α]1​+Ai=0∑m​zi​[ai​(x)]1​​​+r[δ]1​=[β]2​+Bi=0∑m​zi​[bi​(x)]2​​​+s[δ]2​=C1​i=ℓ+1∑m​zi​[δβai​(x)+αbi​(x)+ci​(x)​]1​​​+C2​i=0∑n−2​hi​[δL′2i+1​(x)​]1​​​+s[A]1​+rC3​[B]1​​​−rs[δ]1​​

In protocols like , distributing FFT requires many communications per FFT. In contrast, our protocol incurs less communications only during the Reduce, Send and Recv steps, while the subsequent AllReduce step involves only a constant number of group elements. As a result, the total communication cost is significantly lower than that of DIZK. Moreover, this distribution strategy can be efficiently implemented using .

Currently, the used in RISC0 has the following characteristics:

The number of rows in the A,B,CA, B, CA,B,C matrices is 5,676,573, which means an SRS of size 2232^{23}223 is required. The number of columns is 5,635,930.

Assume 4 GPUs each compute and store partial results of the polynomials a(k)(X),b(k)(X),c(k)(X)a^{(k)}(X), b^{(k)}(X), c^{(k)}(X)a(k)(X),b(k)(X),c(k)(X), and together, they must aggregate them into full polynomials of size 2232^{23}223 over the BN254 field. The total volume of data involved is:

The goal of Reduce is to sum these partial polynomials in the leader device, so that 1 leader GPU retains the complete 768 MB result (i.e., full a(X),b(X),c(X)a(X), b(X), c(X)a(X),b(X),c(X)).

Field additions in BN254 are extremely lightweight on GPUs. Each operation for ai(X),bi(X),ci(X)a_i(X), b_i(X), c_i(X)ai​(X),bi​(X),ci​(X) takes less than 1 ms and is negligible in the overall runtime.

According to ICICLE-Snark, a polynomial of degree 2232^{23}223 takes approximately 774 ms for MSM. With four devices, each handling a degree 2212^{21}221 polynomial, the per-device MSM time is around 193 ms. If we overlap the Reduce step with two of these MSMs, the communication overhead can be effectively hidden.

Let’s perform a simple estimation based on the data from .

nnn and mmm are 2222^{22}222

The runtime for MSM over G2\mathbb{G}_2G2​ is the same as that for G1\mathbb{G}_1G1​.

The runtime for G1\mathbb{G}_1G1​ MSM scales linearly with the degree.

5×387+10+10=1955 ms5 \times 387 + 10 + 10 = 1955\ \text{ms}5×387+10+10=1955 ms
5×(387/4)+10+10≈504 ms5 \times (387 / 4) + 10 + 10 \approx 504\ \text{ms}5×(387/4)+10+10≈504 ms

We do not include A,B,CA, B, CA,B,C in our input size estimation, as their sparsity makes it difficult to quantify preciesly. However, they will also contribute to reducing the overall memory requirement.

If everything is computed on a single device, the input size is around 166416641664 MB. Each component will consume memory as follows:

Witness vector z\bm{z}z: 222×322^{22} \times 32222×32 B =128= 128=128 MB

MSM base point ([ai(x)]1)i=0m\left([a_i(x)]_1\right)_{i = 0}^{m}([ai​(x)]1​)i=0m​: 222×642^{22} \times 64222×64 B =256= 256=256 MB

MSM base point ([bi(x)]1)i=0m\left([b_i(x)]_1\right)_{i = 0}^{m}([bi​(x)]1​)i=0m​: 222×642^{22} \times 64222×64 B =256= 256=256 MB

MSM base point ([bi(x)]2)i=0m\left([b_i(x)]_2\right)_{i = 0}^{m}([bi​(x)]2​)i=0m​: 222×1282^{22} \times 128222×128 B =512= 512=512 MB

MSM base point ([βai(x)+αbi(x)+ci(x)δ]1)i=ℓ+1m\left(\left[ \frac{\beta a_i(x) + \alpha b_i(x) + c_i(x)}{\delta} \right]_1\right)_{i = \ell + 1}^{m}([δβai​(x)+αbi​(x)+ci​(x)​]1​)i=ℓ+1m​: 222×642^{22} \times 64222×64 B ≈256\approx 256≈256 MB

MSM base point ([L′2i+1(x)δ]1)i=0n−2\left(\left[ \frac{{L'}_{2i + 1}(x)}{\delta} \right]_1\right)_{i = 0}^{n-2}([δL′2i+1​(x)​]1​)i=0n−2​: 222×642^{22} \times 64222×64 B ≈256\approx 256≈256 MB

If we instead use the proposed scheme across 4 GPUs, the input size becomes around 416416416 MB.

Here, we estimate only the memory required for MSM itself, excluding the additional memory needed for buckets in the algorithm.

If everything is computed on a single device, and intermediate memory is released immediately after use, the main bottleneck becomes the MSM in G2\mathbb{G}_2G2​, which requires 128+512=640128 + 512 = 640128+512=640 MB of memory.

However, with the proposed scheme using 4 GPUs, this memory requirement is reduced to 32+128=16032 + 128 = 16032+128=160 MB. The Full FFT, including twiddle factors, requires an additional 256256256 MB. Therefore, under this setup, the total memory required per device is approximately 256256256 MB.

Written by from

3×(d−1)×n3 \times (d - 1) \times n3×(d−1)×n
(d−1)×nd(d - 1) \times \frac{n}{d}(d−1)×dn​
O(d)O(d)O(d)
3×3×2×(d−1)×n3 \times 3 \times 2\times (d - 1) \times n3×3×2×(d−1)×n
ICICLE-Snark
DIZK
SPMD
stark_verify.circom
ICICLE-Snark
Pippenger
Groth16
Reduce
Send
Recv
A41
ryan Kim
Source:
https://www.marvell.com/content/dam/marvell/en/blogs/2024/01/PCIe-Gen6-IO-Bandwidth.png