BitSage's GPU proof pipeline generates Stwo Circle STARK proofs entirely on GPU, bypassing the CPU-bound PolyOps trait that causes FFT cross-incompatibility between SIMD and GPU backends at sizes above 2^16. All proof operations — IFFT, FFT, Merkle commitment, composition evaluation, and FRI — run GPU-resident on a single device, with only Fiat-Shamir channel interactions on CPU.
Proofs are submitted on-chain to Starknet Sepolia (and mainnet) via INVOKE_V3 multicalls.
| Workload | GPU (ms) | CPU (ms) | Speedup | On-chain Status |
|---|---|---|---|---|
| 1K steps | 19 | 21 | 1.1x | SUCCEEDED |
| 64K steps | 47 | 157 | 3.3x | SUCCEEDED |
| 256K steps | 165 | 351 | 2.1x | SUCCEEDED |
| 1M steps | 666 | 1,058 | 1.6x | SUCCEEDED |
GPU proofs also produce smaller calldata (173 felts vs 269-333 for CPU), reducing on-chain gas costs by 25-35%.
A common objection: "GPUs cost $2/hour, CPUs are cheaper." Here's why GPU still wins at scale.
At $2/hr for an H100 (spot pricing on Lambda, Shadeform, or Brev):
| Workload | GPU prove time | CPU prove time | GPU cost/proof | CPU cost/proof* | GPU gas | CPU gas |
|---|---|---|---|---|---|---|
| 64K | 47ms | 157ms | $0.000026 | $0.000044 | 0.27 STRK | 0.36 STRK |
| 256K | 165ms | 351ms | $0.000092 | $0.000098 | 0.27 STRK | 0.38 STRK |
| 1M | 666ms | 1,058ms | $0.000370 | $0.000294 | 0.27 STRK | 0.41 STRK |
*CPU cost assumes $0.50/hr for a 28-core instance.
At 64K steps (typical ML inference proof), GPU compute cost is 40% cheaper than CPU. The real savings come from gas: GPU proofs save 0.09 STRK per transaction. At 1000 proofs/day, that's 90 STRK/day in gas savings alone — far exceeding the $48/day GPU rental.
The GPU doesn't just prove faster per-proof. It frees the CPU for trace generation, packing, and submission while the GPU handles the cryptographic heavy-lifting. A single H100 can sustain:
- ~76,000 proofs/hour at 64K steps (47ms each)
- ~21,800 proofs/hour at 256K steps (165ms each)
- ~5,400 proofs/hour at 1M steps (666ms each)
A 28-core CPU at the same workloads:
- ~22,900 proofs/hour at 64K (157ms each)
- ~10,200 proofs/hour at 256K (351ms each)
- ~3,400 proofs/hour at 1M (1,058ms each)
For a network that needs to verify thousands of ML inference results per hour, GPU is the only viable path.
GPU becomes cost-effective when you submit more than ~15 proofs/hour at 64K steps, purely from gas savings. Most production workloads far exceed this.
Starknet's sequencer batches transactions, but individual proof verification cost still matters for applications that submit proofs directly (verifiable ML, privacy proofs, cross-chain bridges). GPU-generated proofs are 25-35% smaller in calldata, directly reducing L1 data gas consumption — the most expensive resource on Starknet.
At 47ms for a 64K-step proof, GPU proving makes real-time verifiable ML inference practical. A user submits an inference request, the worker runs the neural network, generates a STARK proof of correct execution, and submits it on-chain — all under 1 second total latency. This is impossible with CPU-only proving at 157ms+ per proof when you factor in trace generation, packing, and network latency.
GPU workers can earn SAGE tokens by proving jobs faster than CPU workers. The economic incentive creates a natural market where:
- Workers with GPUs earn more per hour (higher throughput)
- Job submitters pay less per proof (lower gas from smaller calldata)
- The network as a whole settles more proofs per block
The pipeline architecture supports TEE (Trusted Execution Environment) attestation via NVIDIA Confidential Computing on H100/H200. This means proofs can attest not just to correct computation, but to computation performed on encrypted data within a hardware-isolated enclave — a building block for private DeFi, confidential ML, and HIPAA-compliant on-chain analytics.
- NVIDIA GPU with CUDA 12.0+ (RTX 3090 minimum, H100 recommended)
- Rust 1.75+ with
cargo - A funded Starknet Sepolia account (for on-chain submission)
- PostgreSQL 15+ (for coordinator mode)
git clone https://github.com/Bitsage-Network/bitsage-network.git
cd bitsage-network/rust-node
cp .env.example .envEdit .env with your Starknet account:
STARKNET_NETWORK=sepolia
STARKNET_RPC_URL=https://rpc.starknet-testnet.lava.build
DEPLOYER_ADDRESS=0x<your-account-address>
DEPLOYER_PRIVATE_KEY=0x<your-private-key>
ENABLE_GPU=truenvidia-smi # Should show your GPU
nvcc --version # Should show CUDA 12.xcargo build --release --features cuda --bin benchmark_proof_pipelineThis compiles the stwo CUDA kernels (FFT, FRI folding, Merkle Blake2s, quotient accumulation) from PTX source. First build takes ~2 minutes; subsequent builds are cached.
RUST_LOG=info ./target/release/benchmark_proof_pipelineThis will:
- Pre-warm GPU (compile PTX kernels, ~2.3s one-time cost)
- Run ML inference (4-8-3 neural network, 132 instructions)
- Generate proofs at 1K, 64K, 256K, and 1M steps — GPU and CPU
- Submit all 10 proofs to Starknet Sepolia
- Print comparison tables with speedups and gas costs
- Save results to
benchmark_results.json
cargo build --release --features cuda --bin sage-worker
./target/release/sage-worker setup --network sepolia
./target/release/sage-worker startThe worker polls the coordinator for jobs, generates GPU proofs, and submits them on-chain. Earnings are paid in SAGE tokens proportional to proof throughput.
# Build GPU worker image
docker build -t bitsage/worker:gpu -f Dockerfile.worker.gpu .
# Run with GPU access
docker run --gpus all \
--env-file .env \
bitsage/worker:gpuFor the full stack (coordinator + postgres + redis + nginx):
docker-compose -f deploy/aws/docker-compose.coordinator.yml up -dprove_with_stwo_gpu()
|
+-- prove_with_gpu_pipeline() # Full GPU pipeline (log_size >= 12)
| |-- build_trace_column_data() # CPU: build 26 trace columns
| |-- GpuProofPipeline::new() # GPU: init executor, upload twiddles
| |-- upload_polynomial() x26 # GPU: H2D transfer
| |-- ifft_with_denormalize() x26 # GPU: IFFT all columns
| |-- fft() x26 # GPU: FFT with blowup
| |-- merkle_tree_full() # GPU: Blake2s Merkle commit
| |-- [composition evaluation] # CPU: 21 AIR constraints (see Known Limitations)
| |-- [FRI commitment] # GPU: Merkle on composition
| +-- [proof assembly] # CPU: pack into StarkProof
|
+-- prove_with_stwo_gpu_backend() # Fallback: PolyOps path (SIMD FFT + GPU Merkle)
|
+-- prove_with_stwo_simd_backend() # CPU-only fallback (AVX2/NEON SIMD)
| GPU (H100 @ $2/hr) | CPU (28-core @ $0.50/hr) | |
|---|---|---|
| Compute cost | $2.00/hr | $0.50/hr |
| Prove time per proof | 47ms | 157ms |
| Gas per proof | 0.27 STRK | 0.36 STRK |
| Gas for 100 proofs | 27.0 STRK | 36.0 STRK |
| Gas savings | 9.0 STRK/hr | — |
| Gas savings @ $0.40/STRK | $3.60/hr | — |
| Net savings (gas - GPU premium) | $2.10/hr | — |
GPU pays for itself at just ~42 proofs/hour from gas savings alone. At 100/hr you're profiting $2.10/hr. At 1,000/hr you're saving $34.50/hr.
| GPU (H100) | CPU (28-core) | |
|---|---|---|
| Gas for 100 proofs | 27.0 STRK | 38.5 STRK |
| Gas savings @ $0.40 | $4.60/hr | — |
| Net savings | $3.10/hr | — |
| Max throughput | 21,800/hr | 10,200/hr |
| GPU (H100) | CPU (28-core) | |
|---|---|---|
| Gas for 100 proofs | 27.0 STRK | 40.9 STRK |
| Gas savings @ $0.40 | $5.56/hr | — |
| Net savings | $4.06/hr | — |
| Max throughput | 5,400/hr | 3,400/hr |
The GPU pipeline is memory-bandwidth bound for FFT/IFFT/Merkle operations and compute-bound for composition evaluation. Here's how next-gen hardware changes the picture:
| H100 PCIe | H100 SXM | H200 SXM | B200 | |
|---|---|---|---|---|
| HBM bandwidth | 2.0 TB/s | 3.35 TB/s | 4.8 TB/s | 8.0 TB/s |
| Memory | 80 GB | 80 GB | 141 GB | 192 GB |
| Cloud cost (spot) | ~$2/hr | ~$3/hr | ~$3.50/hr | ~$5-8/hr |
| SM architecture | 9.0 | 9.0 | 9.0 | 10.0 |
FFT/IFFT/Merkle scale linearly with bandwidth. Composition is compute-bound (M31 arithmetic), so it scales with SM count and clock speed. B200 has 2x the SMs of H100.
| Workload | H100 PCIe | H200 SXM (projected) | B200 (projected) |
|---|---|---|---|
| 64K | 47ms | ~28ms | ~18ms |
| 256K | 165ms | ~95ms | ~55ms |
| 1M | 666ms | ~380ms | ~200ms |
H200 — Same SM 9.0 architecture as H100, so no compute speedup. But 2.4x bandwidth means FFT/IFFT/Merkle operations (currently 5ms at 1M) drop to ~2ms. The real win is composition: the 141GB HBM fits much larger traces without spilling. For typical workloads, expect 1.5-1.8x overall speedup. At $3.50/hr, cost per proof is roughly the same as H100 — you're paying more per hour but proving faster.
B200 — Blackwell architecture (SM 10.0) with 2x the SMs and 4x the bandwidth. Composition evaluation (the bottleneck) should run ~2x faster from double the shader count. FFT/Merkle run ~4x faster from bandwidth. Overall 2.5-3.5x over H100. At $5-8/hr, cost per proof is lower than H100 for large workloads (256K+) because the speedup outpaces the price increase. Break-even is around 50 proofs/hour at 64K.
| Workload | H100 ($2/hr) | H200 ($3.50/hr) | B200 ($6/hr) |
|---|---|---|---|
| 64K | $0.000026 | $0.000027 | $0.000030 |
| 256K | $0.000092 | $0.000092 | $0.000092 |
| 1M | $0.000370 | $0.000369 | $0.000333 |
At 64K, all three are nearly identical in compute cost — the gas savings dominate. At 1M, B200 is 10% cheaper per proof despite costing 3x more per hour.
The benchmark data shows where time is spent at 1M steps:
| Stage | Time | % of Total | Runs On |
|---|---|---|---|
| Upload (H2D) | 82ms | 12% | PCIe bus |
| IFFT + FFT | 5ms | 1% | GPU |
| Merkle commit | 1ms | 0.2% | GPU |
| Composition | 364ms | 55% | CPU |
| FRI | 4ms | 0.6% | GPU |
| Proof assembly | ~210ms | 31% | CPU |
The ConstraintKernel in libs/stwo/.../gpu/constraints.rs already has eval_transitions() that evaluates constraints on GPU-resident data. The 21 Obelysk AIR constraints (C1-C21) are degree-2 polynomial operations on M31 — exactly what this kernel handles.
Currently the pipeline downloads all 26 evaluated columns to CPU (~80ms at 1M), then runs a sequential loop over every point in the extended domain. Moving this to GPU would:
- Eliminate the 80ms download
- Replace the 364ms sequential CPU loop with a parallel GPU kernel (~5-15ms)
- Cut total prove time from 666ms to ~300ms at 1M
This is the single highest-impact optimization available.
The pipeline currently uploads 26 columns in 26 separate upload_polynomial() calls. The GpuProofPipeline has upload_polynomials_bulk() which batches all columns into a single H2D transfer, reducing PCIe round-trip overhead.
At 1M: upload drops from 82ms to ~55ms.
The pipeline supports fft_graph and ifft_graph for CUDA graph-captured execution. Graphs eliminate per-kernel CPU launch overhead by recording a sequence of GPU operations and replaying them in a single driver call. For 26 columns × 2 passes (IFFT + FFT), this reduces CPU-side dispatch from ~52 kernel launches to 2 graph replays.
At 1M: IFFT+FFT drops from 5ms to ~3ms. Small absolute gain, but meaningful at 64K where 1ms matters.
Currently each proof creates a new GpuProofPipeline, allocating twiddle buffers and polynomial storage. For batch proving (100+ proofs), reusing a pipeline instance across proofs of the same size amortizes the allocation cost.
| Workload | Current | With Path 1+2+3 | Speedup |
|---|---|---|---|
| 64K | 47ms | ~20ms | 2.4x |
| 256K | 165ms | ~65ms | 2.5x |
| 1M | 666ms | ~280ms | 2.4x |
This would bring GPU vs CPU speedup to 5-8x at 64K and 3.8x at 1M — with no hardware changes.
-
Composition evaluation runs on CPU — The 21 AIR constraint equations are evaluated on CPU after downloading columns from GPU. This accounts for ~57% of pipeline time at 1M steps. Wiring the
ConstraintKernelGPU kernels would yield an additional 2-3x speedup. -
FRI folding not on GPU — The FRI fold kernel expects QM31 (4 u32 per element, AoS layout) but the composition polynomial is M31 (1 u32). Converting to QM31 AoS would enable full GPU FRI.
-
Small traces (< 4096 steps) fall back to the PolyOps path since GPU kernel launch overhead exceeds the computation time.
-
Single GPU only — Multi-GPU support exists in the executor pool but the pipeline doesn't yet partition work across devices.
| File | Purpose |
|---|---|
src/obelysk/stwo_adapter.rs |
Proof generation: GPU pipeline, CPU fallback, prewarm |
src/obelysk/proof_packer.rs |
Serialize proofs to Starknet calldata |
src/obelysk/multicall_builder.rs |
Build and submit INVOKE_V3 multicalls |
src/bin/benchmark_proof_pipeline.rs |
End-to-end benchmark binary |
libs/stwo/.../backend/gpu/pipeline.rs |
GpuProofPipeline: upload, IFFT, FFT, Merkle, FRI |
libs/stwo/.../backend/gpu/cuda_executor.rs |
CUDA device management, kernel compilation |
libs/stwo/.../backend/gpu/constraints.rs |
GPU constraint evaluation kernels |
config/coordinator.toml |
Coordinator runtime configuration |
.env.example |
Environment variable template |