DIZK

Presentation: https://youtu.be/jbWOxNntmPQ

Introduction

Generating a zero-knowledge proof (ZKProof) is computationally expensive and involves significant memory overhead. DIZK addresses this by leveraging Apache Spark to generate Groth16 proofs across multiple machines in a distributed cluster environment.

Background

Please read Distributed FFT!

Protocol Explanation

Distributed Setup

In the Groth16 proving system, the setup phase is critical for generating the proving and verification keys. The most computationally intensive part of this process is evaluating the QAP polynomials (ai(X),bi(X),ci(X))(a_i(X), b_i(X), c_i(X)) at a specific point xx. DIZK is designed to perform this operation efficiently in a distributed environment using Spark.

Step 1: Distributed Lagrange Evaluation

As part of the QAP instance reduction, we must evaluate the Lagrange basis polynomials at the point xx:

Li(X)=1nωiXωi(Xn1)L_i(X) = \frac{1}{n} \cdot \frac{\omega^i}{X - \omega^i} \cdot (X^n - 1)

Each executor computes Li(x)L_i(x) for its assigned index in parallel, and stores the results as an Resilient Distributed Dataset (RDD).

Step 2: QAP Instance Reduction

  • Input: AFn×(m+1)A \in \mathbb{F}^{n \times (m + 1)}, xFx \in \mathbb{F}

  • Output: (ai(x))i=0m\left(a_i(x)\right)_{i=0}^{m}, where ai(x):=j=0n1Aj,iLj(x)a_i(x) := \sum_{j=0}^{n-1} A_{j,i} \cdot L_j(x)


Naive (Strawman) Approach:

  1. Join Aj,iA_{j,i} with Lj(x)L_j(x) by index jj

  2. Map each pair (Aj,i,Lj(x))(A_{j,i}, L_j(x)) to the product Aj,iLj(x)A_{j,i} \cdot L_j(x)

  3. Reduce by ii to compute (ai(x)=j=0n1Aj,iLj(x))i=0m\left( a_i(x) = \sum_{j=0}^{n-1} A_{j,i} \cdot L_j(x) \right)_{i=0}^m

This join operation would produce a structure like:

(A0,0,L0(x))(A_{0,0}, L_0(x))

(A0,1,L0(x))(A_{0,1}, L_0(x))

\cdots

(A0,m,L0(x))(A_{0,m}, L_0(x))

(A1,0,L1(x))(A_{1,0}, L_1(x))

(A1,1,L1(x))(A_{1,1}, L_1(x))

\cdots

(A1,m,L1(x))(A_{1,m}, L_1(x))

\vdots

\vdots

\ddots

\vdots

(An1,0,Ln1(x))(A_{n-1,0}, L_{n-1}(x))

(An1,1,Ln1(x))(A_{n-1,1}, L_{n-1}(x))

\cdots

(An1,m,Ln1(x))(A_{n-1,m}, L_{n-1}(x))

However, the actual matrix AA is only almost sparse. For example:

(991,L0(x))(991, L_0(x))

(681,L0(x))(681, L_0(x))

\cdots

(517,L0(x))(517, L_0(x))

(0,L1(x))(0, L_1(x))

(2476,L1(x))(2476, L_1(x))

\cdots

(0,L1(x))(0, L_1(x))

\vdots

\vdots

\ddots

\vdots

(0,Ln1(x))(0, L_{n-1}(x))

(8629,Ln1(x))(8629, L_{n-1}(x))

\cdots

(0,Ln1(x))(0, L_{n-1}(x))

As a result, certain executors—such as those handling the first row in the example—become overloaded, causing out-of-memory errors or straggler tasks.


To address this imbalance, Spark offers techniques such as blockjoin and skewjoin.

  • Blockjoin replicates the smaller RDD (typically Lj(x)L_j(x)) across all executors so that tasks can be executed evenly. This results in a join structure like:

(A0,0,L0(x))(A_{0,0}, L_0(x))

(A0,1,L0(x))(A_{0,1}, L_0(x))

\cdots

(A0,m,L0(x))(A_{0,m}, L_0(x))

\vdots

\vdots

\ddots

\vdots

(A0,0,Ln1(x))(A_{0,0}, L_{n-1}(x))

(A0,1,Ln1(x))(A_{0,1}, L_{n-1}(x))

\cdots

(A0,m,Ln1(x))(A_{0,m}, L_{n-1}(x))

\vdots

\vdots

\ddots

\vdots

(An1,0,L0(x))(A_{n-1,0}, L_0(x))

(An1,1,L0(x))(A_{n-1,1}, L_0(x))

\cdots

(An1,m,L0(x)(A_{n-1,m}, L_0(x)

\vdots

\vdots

\ddots

\vdots

(An1,0,Ln1(x))(A_{n-1,0}, L_{n-1}(x))

(An1,1,Ln1(x))(A_{n-1,1}, L_{n-1}(x))

\cdots

(An1,m,Ln1(x)(A_{n-1,m}, L_{n-1}(x)

Then each machine is assigned the set of pairs (Ai,0,Li(x))i=0n1\left(A_{i, 0}, L_i(x)\right)_{i=0}^{n-1}. While this approach ensures even data distribution, it significantly increases memory usage for the executor performing the join.

  • Skewjoin identifies skewed keys and splits them across multiple partitions, replicating the corresponding RDD entries. However, since each Aj,iA_{j,i} is used only once, this ends up being equivalent to a naive join, offering no real benefit.


Solution: Hybrid Join Strategy

Phase 1: Identify Dense Rows

  • Tag any row with more than n\sqrt{n} non-zero entries as a dense row

  • Let JAJ_A be the set of all such dense row indices

Phase 2: Execute Hybrid Join

  • Dense rows (jJA)( j \in J_A ):

    • Join Aj,iA_{j,i} with Lj(x)L_j(x)

    • Map each pair to Aj,iLj(x)A_{j,i} \cdot L_j(x)

  • Sparse rows (jJA)(j \notin J_A):

    • Same join and map as strawman

  • Final aggregation:

    • Union both results: (Aj,iLj(x))=(Aj,iLj(x))jJA(Aj,iLj(x))jJA(A_{j,i} \cdot L_j(x)) = (A_{j,i} \cdot L_j(x))_{j \in J_A} \cup (A_{j,i} \cdot L_j(x))_{j \notin J_A}

    • Reduce by ii to compute each (ai(x)=j=0n1Aj,iLj(x))i=0m\left( a_i(x) = \sum_{j=0}^{n-1} A_{j,i} \cdot L_j(x) \right)_{i=0}^m

Step 3: Distributed fixMSM

In the final part of the setup, the evaluated ai(x)a_i(x) values must be converted into group elements on an elliptic curve. This is done using Fixed-base Multi-Scalar Multiplication (fixMSM).

In fixMSM , given a vector of scalars sFn\bm{s} \in \mathbb{F}^n and a group element PGP \in \mathbb{G}, the goal is to compute the vector sPGn\bm{s} \cdot P \in \mathbb{G}^n, where each entry is the scalar multiple siPs_i \cdot P.

  1. Lookup Table Generation: Precompute scalar multiples of a fixed base point PP:

(P,2P,4P,,2kP)( P, 2P, 4P, \dots, 2^kP )
  1. Broadcast: Distribute this lookup table to all executors

  2. Parallel Evaluation: Each executor uses the lookup table to compute scalar multiplications for its assigned chunk

This method aligns well with Spark's execution model, ensuring balanced workload across executors. By precomputing and broadcasting the table, DIZK avoids redundant computation and achieves both memory and runtime efficiency.

Distributed Prover

In Groth16, the prover takes as input the proving key pk\mathsf{pk}, public input xFkx \in \mathbb{F}^k, and witness wFmkw \in \mathbb{F}^{m-k}, and generates the proof π\pi. This process consists of two main stages:

  1. Generate the QAP witness vector h\bm{h} based on xx and ww

  2. Construct the proof π\pi by combining x,w,hx, w, \bm{h} with pk\mathsf{pk} through linear combinations

Step 1: QAP Witness h\bm{h} Generation

In Groth16’s QAP-based construction, the witness h\bm{h} corresponds to the coefficient vector of the rational function h(X)h(X), defined as:

h(X)=(i=0mai(X)zi)(i=0mbi(X)zi)(i=0mci(X)zi)t(X)h(X) = \frac{\left( \sum_{i=0}^m a_i(X) z_i \right) \cdot \left( \sum_{i=0}^m b_i(X) z_i \right) - (\sum_{i=0}^m c_i(X) z_i)}{t(X)}
  • t(X)t(X) is a vanishing polynomial over domain DD, carefully chosen so that it does not vanish over an auxiliary domain DD'.

  • The numerator consists of three composite polynomial terms, which are computationally intensive to evaluate.

To compute this efficiently, DIZK performs the following steps:

  1. Evaluate the numerator polynomials over the domain DD

  2. Convert the evaluations to coefficients via inverse FFT

  3. Re-evaluate over a disjoint domain DD' via forward FFT

  4. Perform component-wise division by t(X)t(X), followed by interpolation to recover h(X)h(X)

These FFT operations are implemented using Distributed FFT.


The most computationally demanding part is the evaluation of the first polynomial term:

az(X)=i=0mai(X)zia_z(X) = \sum_{i=0}^m a_i(X) \cdot z_i
  • Input: AFn×(m+1)A \in \mathbb{F}^{n \times (m + 1)}, zFm+1\bm{z \in \mathbb{F}^{m+1}}

  • Output: (i=0mAj,izi)j=0n1\left( \sum_{i=0}^m A_{j,i} \cdot z_i \right)_{j=0}^{n-1}


Naive (Strawman) Approach:

  1. Join Aj,iA_{j,i} with ziz_i on key ii

  2. Map each pair (Aj,i,zi)(A_{j,i}, z_i) to the product aj,izia_{j,i} \cdot z_i

  3. Reduce by jj to compute (i=0mAj,izi)j=0n1\left( \sum_{i=0}^m A_{j,i} \cdot z_i \right)_{j=0}^{n-1}

This results in the same type of computational skew encountered during QAP instance reduction in the setup phase. However, the key difference here is that the bottleneck occurs across columns rather than rows.

Fortunately, the matrix AA's sparsity structure is fixed for a given circuit, even though the witness vector z\bm{z} changes with each input. Thus, DIZK can reuse the sparsity analysis from the setup phase to optimize this join.


Solution: Hybrid Join Strategy

Phase 1: Identify Dense Columns

  • Mark any column with n\ge \sqrt{n} non-zero entries as a dense column

  • Let IAI_A denote the set of all such column indices

Phase 2: Hybrid Join Execution

  • Dense columns (iIAi \in I_A):

    • Join Aj,iA_{j,i} with ziz_i on key ii

    • Map to Aj,iziA_{j,i} \cdot z_i

  • Sparse columns (iIAi \notin I_A):

    • Same join and map as strawman

  • Merge and Reduce:

    • Combine both outputs: (Aj,izi)=(Aj,izi)iIA(Aj,izi)iIA(A_{j,i} \cdot z_i) = (A_{j,i} \cdot z_i)_{i \in I_A} \cup (A_{j,i} \cdot z_i)_{i \notin I_A}

    • Reduce by jj to obtain (i=0mAj,izi)j=0n1\left( \sum_{i=0}^m A_{j,i} \cdot z_i \right)_{j=0}^{n-1}

Step 2: Distributed varMSM

In the final step of proof generation, the prover performs an inner product between a scalar vector and a vector of elliptic curve base points. DIZK distributes this Variable-base multi-scalar multiplication (varMSM) using Pippenger’s algorithm, which is well-suited for this scenario.

In varMSM, given a vector of scalars sFn\bm{s} \in \mathbb{F}^n and a vector of group elements PGn\bm{P} \in \mathbb{G}^n, the goal is to compute the sum i=1nsiPiG\sum_{i=1}^n{s_i \cdot P_i} \in \mathbb{G}.

  1. Chunk Distribution: The set of scalar-base point pairs (si,Pi)(s_i, P_i) is partitioned evenly across a set of executors SjS_j

  2. Local Computation: Each executor computes a partial result using Pippenger's locally:

Qj=iSjsiPiQ_j = \sum_{i \in S_j} s_i \cdot P_i
  1. Global Reduction: The driver aggregates the partial results from each executor:

Q=jQjQ = \sum_j Q_j

This design allows DIZK to scale to large witness sizes and high-throughput proof generation with minimal overhead and excellent parallelism.

Conclusion

As shown in the figure above, we observe that increasing the number of executors allows the system to support larger instance sizes.

From the graph above, we can observe the following:

  • When the number of executors is fixed, the setup time and prover time increase linearly with the instance size.

  • When the instance size is fixed, the setup time and prover time decrease linearly as the number of executors increases.

DIZK shows that zkSNARKs, which were traditionally constrained by memory and computational limits, can be effectively scaled using distributed computing frameworks like Apache Spark. By parallelizing key components such as FFT, Lagrange evaluation, and MSM, DIZK enables proof generation for circuits with billions of constraints, significantly expanding the practical applicability of zero-knowledge proofs. Its modular design also provides reusable distributed primitives that can benefit other proof systems like STARKs.

References

Written by ryan Kim from A41

Last updated