Skip to content

Commit 8c62b60

Browse files
committed
docs: update benchmarks to 235 tok/s (+25% vs Ollama) and add new features
- All benchmark numbers updated: 235 tok/s (was 245), Ollama 188 (was 204) - Speedup updated from 20% to 25% across all pages - Homepage: new feature cards (EAGLE, Q4_K GEMV, Multi-LoRA/MoE) - Homepage: 24 architectures (13 families), updated performance journey - Blog posts: update notes added with corrected numbers - Reference: new features (EAGLE training, TransMLA, BitNet, NSA, etc.) - Migration guide: 10 new features listed
1 parent d14867e commit 8c62b60

10 files changed

+61
-33
lines changed

content/_index.html

Lines changed: 25 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
<meta charset="utf-8">
99
<meta name="viewport" content="width=device-width, initial-scale=1">
1010
<title>Zerfoo — Machine Learning Framework for Go</title>
11-
<meta name="description" content="Train, run, and serve ML models in your Go application. 245 tok/s on Gemma 3 1B — 20% faster than Ollama. Pure Go, zero CGo.">
11+
<meta name="description" content="Train, run, and serve ML models in your Go application. 235 tok/s on Gemma 3 1B — 25% faster than Ollama. Pure Go, zero CGo.">
1212
<meta name="theme-color" content="#8B5CF6">
1313
<link rel="icon" href="zerfoo.svg" type="image/svg+xml">
1414
<script>(function(){var t=localStorage.getItem('theme');if(t)document.documentElement.classList.add(t)})()</script>
@@ -266,8 +266,8 @@
266266
<h1>Machine learning for Go.<br><span class="grad">Pure Go. Zero CGo.</span></h1>
267267
<p class="sub">Train, run, and serve ML models in your Go application. One import, GPU-accelerated at runtime, no C compiler needed.</p>
268268
<div class="stats">
269-
<div class="stat"><div class="num">245 tok/s</div><div class="label">Gemma 3 1B Q4_K_M</div></div>
270-
<div class="stat"><div class="num">+20%</div><div class="label">faster than Ollama</div></div>
269+
<div class="stat"><div class="num">235 tok/s</div><div class="label">Gemma 3 1B Q4_K_M</div></div>
270+
<div class="stat"><div class="num">+25%</div><div class="label">faster than Ollama</div></div>
271271
<div class="stat"><div class="num">99.5%</div><div class="label">CUDA graph coverage</div></div>
272272
<div class="stat"><div class="num">0</div><div class="label">CGo calls</div></div>
273273
</div>
@@ -403,6 +403,21 @@ <h3>Production Ready</h3>
403403
<h3>ARM NEON SIMD</h3>
404404
<p>Hand-written ARM NEON and x86 AVX2 assembly for CPU-bound operations — GEMM, RMSNorm, RoPE, SiLU, softmax. Competitive CPU performance without a GPU.</p>
405405
</div>
406+
<div class="feat">
407+
<div class="icon">&#9889;</div>
408+
<h3>EAGLE Speculative Decoding</h3>
409+
<p>Built-in EAGLE draft head with integrated training. Speculative decoding accelerates generation by drafting multiple tokens per step and verifying in parallel.</p>
410+
</div>
411+
<div class="feat">
412+
<div class="icon">&#128640;</div>
413+
<h3>Q4_K Fused GEMV</h3>
414+
<p>Fused dequantize-and-multiply kernel for Q4_K quantized weights — 14x faster than the unfused path. Keeps decode throughput high on quantized models.</p>
415+
</div>
416+
<div class="feat">
417+
<div class="icon">&#128270;</div>
418+
<h3>Advanced Serving</h3>
419+
<p>Multi-LoRA per-request serving, quantized KV cache (Q4/Q3), and hybrid CPU/GPU MoE routing. Production-grade features for multi-tenant deployments.</p>
420+
</div>
406421
</div>
407422
</div>
408423
</section>
@@ -424,8 +439,8 @@ <h2>Faster than Ollama</h2>
424439
<td class="highlight">Zerfoo</td>
425440
<td>
426441
<div class="bench-bar">
427-
<div class="bar" style="width:min(245px,60vw)"></div>
428-
<div class="val">245 tok/s</div>
442+
<div class="bar" style="width:min(235px,60vw)"></div>
443+
<div class="val">235 tok/s</div>
429444
</div>
430445
</td>
431446
<td>Pure Go, zero CGo, CUDA graph capture, fused kernels</td>
@@ -434,8 +449,8 @@ <h2>Faster than Ollama</h2>
434449
<td>Ollama</td>
435450
<td>
436451
<div class="bench-bar">
437-
<div class="bar" style="width:min(204px,50vw);background:var(--bg3)"></div>
438-
<div class="val">204 tok/s</div>
452+
<div class="bar" style="width:min(188px,50vw);background:var(--bg3)"></div>
453+
<div class="val">188 tok/s</div>
439454
</div>
440455
</td>
441456
<td>llama.cpp C++ backend, same model, same hardware</td>
@@ -451,7 +466,7 @@ <h3 style="font-size:1rem;font-weight:600;margin-bottom:16px">Performance journe
451466
<tr><th>Date</th><th>Milestone</th><th>Tok/s</th><th>Improvement</th></tr>
452467
</thead>
453468
<tbody>
454-
<tr><td>Mar 17</td><td class="highlight">dp4a INT8 + arena reuse</td><td class="highlight">245</td><td>+20% vs Ollama</td></tr>
469+
<tr><td>Mar 27</td><td class="highlight">Multi-model benchmark (3-run median)</td><td class="highlight">235</td><td>+25% vs Ollama</td></tr>
455470
<tr><td>Mar 17</td><td>Q4_0 re-quant restored</td><td>245</td><td>+32% vs regression</td></tr>
456471
<tr><td>Mar 14</td><td>CUDA graph capture</td><td>234</td><td>+26% vs non-graph</td></tr>
457472
<tr><td>Mar 13</td><td>GPU-first pipeline</td><td>103</td><td>D2H elimination</td></tr>
@@ -472,7 +487,7 @@ <h3 style="font-size:1rem;font-weight:600;margin-bottom:16px">Performance journe
472487
<div class="wrap">
473488
<div class="section-head">
474489
<h2>Supported models</h2>
475-
<p>Production-ready transformer architectures. Load any GGUF model from HuggingFace.</p>
490+
<p>24 architectures across 13 model families. Load any GGUF model from HuggingFace.</p>
476491
</div>
477492
<div class="model-grid">
478493
<div class="model-card"><div class="name">Gemma 3</div><div class="status prod">Production</div></div>
@@ -575,7 +590,7 @@ <h2>From the blog</h2>
575590
<a href="/docs/blog/how-we-beat-ollama-cuda-graph-capture/" class="blog-card">
576591
<div class="tag">Performance</div>
577592
<h3>How We Beat Ollama: CUDA Graph Capture in Pure Go</h3>
578-
<p>CUDA graph capture and fused kernels took Zerfoo from 186 tok/s to 245 tok/s. A deep dive into making the decode path GPU-only.</p>
593+
<p>CUDA graph capture and fused kernels took Zerfoo from 186 tok/s to 235 tok/s. A deep dive into making the decode path GPU-only.</p>
579594
</a>
580595
<a href="/docs/blog/zero-cgo-pure-go-ml-inference/" class="blog-card">
581596
<div class="tag">Architecture</div>

content/docs/blog/01-introducing-zerfoo.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,9 @@ This means every tool, library, and application built for the OpenAI API works w
9898

9999
## Performance
100100

101-
On an NVIDIA DGX Spark with Gemma 3 1B Q4_K_M, Zerfoo achieves **245 tokens/second** decode throughput — 20% faster than Ollama (204 tok/s) on the same hardware. This comes from three key optimizations:
101+
> **Update 2026-03-27:** Benchmarks updated to reflect multi-model 3-run median methodology. Gemma 3 1B: 235 tok/s (was 245), Ollama: 188 tok/s (was 204). The speedup is now 25%.
102+
103+
On an NVIDIA DGX Spark with Gemma 3 1B Q4_K_M, Zerfoo achieves **235 tokens/second** decode throughput — 25% faster than Ollama (188 tok/s) on the same hardware. This comes from three key optimizations:
102104

103105
- **CUDA graph capture** with 99.5% instruction coverage eliminates per-kernel launch overhead
104106
- **Fused kernels** (FusedAddRMSNorm, FusedSiluGate, FusedQKNormRoPE) reduce memory round-trips

content/docs/blog/02-benchmark-comparison.md

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -6,23 +6,24 @@ bookToc: true
66

77
# Zerfoo vs Ollama vs llama.cpp: A Performance Comparison
88

9-
When we set out to build an ML inference framework in Go, the first question everyone asked was: "Can Go actually compete with C++ on inference throughput?" The answer is yes. On Gemma 3 1B Q4_K_M, Zerfoo decodes at **245 tokens/second** — 20% faster than Ollama and 10-15% faster than llama.cpp on the same NVIDIA DGX Spark hardware.
9+
> **Update 2026-03-27:** Benchmarks updated to multi-model 3-run median methodology. Gemma 3 1B: 235 tok/s (Ollama 188 tok/s) = 25% faster. Additional models: DeepSeek R1 1.5B (186 vs 167, +11%), Llama 3.2 3B (92 vs 93, parity), Mistral 7B (44 vs 44, parity).
10+
11+
When we set out to build an ML inference framework in Go, the first question everyone asked was: "Can Go actually compete with C++ on inference throughput?" The answer is yes. On Gemma 3 1B Q4_K_M, Zerfoo decodes at **235 tokens/second** — 25% faster than Ollama on the same NVIDIA DGX Spark hardware.
1012

1113
This post breaks down how we measured these numbers, what architectural decisions make them possible, and how you can reproduce the results on your own hardware.
1214

1315
## The Numbers
1416

1517
All measurements use the same GGUF model file, the same prompt ("The meaning of life is"), and measure steady-state decode throughput after warm-up on an NVIDIA DGX Spark (GB10 Grace Blackwell, 128 GB unified LPDDR5x, CUDA 13.0).
1618

17-
| Framework | Tok/s (decode) | CUDA Graphs | Notes |
18-
|-----------|----------------|-------------|-------|
19-
| **Zerfoo** | **245.15** | Yes | Q4_K_M loaded, re-quantized to Q4_0 at load time |
20-
| **Zerfoo** | **248.47** | Yes | 512 tokens — throughput stable at longer sequences |
21-
| **Zerfoo** | 174.44 | No | Without CUDA graph capture |
22-
| **Ollama** | 203.60 | N/A | Default settings, `ollama run gemma3:1b` |
23-
| **llama.cpp** | ~210-230 | No | Estimated from community reports on GB10-class hardware |
19+
| Model | Zerfoo (tok/s) | Ollama (tok/s) | Speedup |
20+
|-------|----------------|----------------|---------|
21+
| **Gemma 3 1B Q4_K_M** | **235** | 188 | **+25%** |
22+
| DeepSeek R1 1.5B | 186 | 167 | +11% |
23+
| Llama 3.2 3B | 92 | 93 | parity |
24+
| Mistral 7B | 44 | 44 | parity |
2425

25-
The gap between Zerfoo with and without CUDA graphs (245 vs 174 tok/s) tells the story: CUDA graph capture alone accounts for a 40% throughput increase. The remaining advantage over Ollama comes from fused kernels and zero CGo overhead.
26+
All numbers are 3-run medians from the 2026-03-27 multi-model benchmark. The gap narrows at larger model sizes where memory bandwidth dominates over kernel launch overhead. On smaller models where kernel fusion matters most, Zerfoo's CUDA graph capture and fused kernels provide a clear advantage.
2627

2728
## Why Zerfoo Is Faster
2829

@@ -122,7 +123,7 @@ We've measured on the DGX Spark so far. We expect similar relative performance o
122123

123124
| GPU | Zerfoo (est.) | Status |
124125
|-----|---------------|--------|
125-
| DGX Spark GB10 | 245 tok/s | Measured |
126+
| DGX Spark GB10 | 235 tok/s | Measured (3-run median, 2026-03-27) |
126127
| RTX 4090 | TBD | Community contributions welcome |
127128
| RTX 3090 | TBD | Community contributions welcome |
128129
| A100 80GB | TBD | Community contributions welcome |

content/docs/blog/03-architecture-deep-dive.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ bookToc: true
66

77
# Inside Zerfoo: An Architecture Deep Dive
88

9-
Zerfoo runs LLM inference in Go at 245 tokens/second — 20% faster than Ollama. This post walks through the internal architecture that makes that possible, from loading a GGUF file to streaming tokens over an OpenAI-compatible API.
9+
Zerfoo runs LLM inference in Go at 235 tokens/second — 25% faster than Ollama. This post walks through the internal architecture that makes that possible, from loading a GGUF file to streaming tokens over an OpenAI-compatible API.
1010

1111
## The Pipeline
1212

@@ -122,7 +122,7 @@ CUDA graph capture is the single biggest performance optimization in Zerfoo. It
122122

123123
Without CUDA graphs, each decode step dispatches hundreds of individual kernel launches — each one costing 5-10 microseconds of CPU-GPU synchronization. With CUDA graphs, the entire decode step is a single graph launch.
124124

125-
The numbers tell the story: 245 tok/s with CUDA graphs vs 174 tok/s without — a 40% throughput increase from this optimization alone.
125+
The numbers tell the story: 235 tok/s with CUDA graphs vs 174 tok/s without — a 35% throughput increase from this optimization alone.
126126

127127
Zerfoo achieves 99.5% instruction coverage in CUDA graph capture. The remaining 0.5% consists of operations that must run on the host: token sampling and tokenizer lookup.
128128

content/docs/blog/04-why-go-for-ml.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -163,6 +163,6 @@ If you're running Go in production and using LLMs, give Zerfoo a try:
163163
go get github.com/zerfoo/zerfoo@latest
164164
```
165165

166-
Seven lines of code to run inference. One binary to deploy. 245 tokens per second on a DGX Spark.
166+
Seven lines of code to run inference. One binary to deploy. 235 tokens per second on a DGX Spark.
167167

168168
The question isn't whether Go can do ML. The question is why your production inference is still running in a different language than the rest of your stack.

content/docs/blog/05-migrating-from-ollama.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,15 +6,15 @@ bookToc: true
66

77
# Migrating from Ollama to Zerfoo
88

9-
Ollama is a popular tool for running LLMs locally. If you're using Ollama today and want to switch to Zerfoo — whether for the 20% throughput improvement, Go-native embedding, or OpenAI API compatibility — this guide walks you through the migration step by step.
9+
Ollama is a popular tool for running LLMs locally. If you're using Ollama today and want to switch to Zerfoo — whether for the 25% throughput improvement, Go-native embedding, or OpenAI API compatibility — this guide walks you through the migration step by step.
1010

1111
## Why Migrate?
1212

1313
Before diving into the how, here's what Zerfoo offers over Ollama:
1414

1515
| Feature | Ollama | Zerfoo |
1616
|---------|--------|--------|
17-
| Decode throughput (Gemma 3 1B Q4_K_M) | 204 tok/s | **245 tok/s** (+20%) |
17+
| Decode throughput (Gemma 3 1B Q4_K_M) | 188 tok/s | **235 tok/s** (+25%) |
1818
| Language | Go + CGo (wraps llama.cpp) | Pure Go (zero CGo) |
1919
| Embeddable as library | No (separate process) | **Yes** (`go get` and import) |
2020
| OpenAI-compatible API | Yes | Yes |
@@ -306,7 +306,7 @@ Migrating from Ollama to Zerfoo is straightforward because both use GGUF models
306306
3. Change the base URL in your API clients
307307
4. Optionally, embed inference directly in your Go application
308308

309-
The reward: 20% faster decode throughput, zero-CGo builds, in-process inference, and a single-binary deployment model.
309+
The reward: 25% faster decode throughput, zero-CGo builds, in-process inference, and a single-binary deployment model.
310310

311311
```bash
312312
go install github.com/zerfoo/zerfoo/cmd/zerfoo@latest

content/docs/blog/how-we-beat-ollama-cuda-graph-capture.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ bookToc: true
88

99
*Performance deep-dive: how CUDA graph capture and fused kernels took Zerfoo from 186 tok/s to 234.30 tok/s on Gemma 3 1B.*
1010

11-
> **Update 2026-03-17:** Current throughput is **245 tok/s** (20% faster than Ollama 204 tok/s). The Phase 6 journey below documents reaching 234.30 tok/s — Phase 27 pushed further via Q4_0 re-quantization in the GGUF loader.
11+
> **Update 2026-03-27:** Current throughput is **235 tok/s** (25% faster than Ollama 188 tok/s, 3-run median from multi-model benchmark). The Phase 6 journey below documents reaching 234.30 tok/s.
1212
1313
## The Benchmark
1414

content/docs/blog/zero-cgo-pure-go-ml-inference.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -208,10 +208,10 @@ Here are the numbers. On a DGX Spark (GB10 Grace Blackwell), running Gemma 3 1B
208208

209209
| Runtime | Decode throughput | Notes |
210210
|---------|------------------|-------|
211-
| **Zerfoo** | **245 tok/s** | Pure Go, zero CGo, custom CUDA kernels via dlopen |
212-
| Ollama | 204 tok/s | Go wrapper around llama.cpp (C++) |
211+
| **Zerfoo** | **235 tok/s** | Pure Go, zero CGo, custom CUDA kernels via dlopen |
212+
| Ollama | 188 tok/s | Go wrapper around llama.cpp (C++) |
213213

214-
Zerfoo is 20% faster than Ollama on the same hardware, despite Ollama being a thin wrapper around C++. The performance comes from the kernels, not the binding mechanism:
214+
Zerfoo is 25% faster than Ollama on the same hardware, despite Ollama being a thin wrapper around C++. The performance comes from the kernels, not the binding mechanism:
215215

216216
- **25+ custom CUDA kernels** including fused RoPE, fused SwiGLU, fused Add+RMSNorm, fused QK-Norm+RoPE, flash attention (prefill and decode), quantized GEMM/GEMV (Q4_0, Q4_K_M, Q8_0)
217217
- **CUDA graph capture** replays the entire decode step as a single graph launch, eliminating per-kernel launch overhead. 99.5% of decode instructions are captured.

content/docs/reference/benchmarks.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -183,8 +183,9 @@ QWENVL_GGUF_PATH=/path/to/qwenvl.gguf go test -run TestQwenVL_VisionPipeline -co
183183

184184
| Date | Milestone | Tok/s | Notes |
185185
|------|-----------|-------|-------|
186+
| 2026-03-27 | Multi-model benchmark (3-run median) | 235 | +25% vs Ollama (188 tok/s) |
186187
| 2026-03-17 | dp4a + arena reuse | 245.15 | Parity at batch=1 (memory-bound); dp4a benefits at larger batches |
187-
| 2026-03-17 | Q4_0 re-quant restored | 244.99 | +32% vs regression, +20% vs Ollama |
188+
| 2026-03-17 | Q4_0 re-quant restored | 244.99 | +32% vs regression |
188189
| 2026-03-14 | CUDA graph capture | 234.30 | +26% vs non-graph baseline |
189190
| 2026-03-13 | GPU-first pipeline | 6.84 | +33.6% from D2H elimination |
190191
| 2026-03-13 | Graph compilation | 6.86 | +5% from worker pool |

content/docs/reference/migration-v1.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -236,7 +236,7 @@ for usage of deprecated symbols.
236236
These are additive and do not require migration, but are worth knowing about:
237237

238238
- **Architecture registry** -- `inference.RegisterArchitecture` / `inference.ListArchitectures` for pluggable model support.
239-
- **12 model architectures** -- Llama 3, Gemma 3, Mistral, Qwen 2, Phi 3/4, DeepSeek V3, Falcon, Command R, Mixtral, RWKV, Jamba, Mamba 3.
239+
- **24 architectures (13 model families)** -- Llama 3, Gemma 3, Mistral, Qwen 2, Phi 3/4, DeepSeek V3, Falcon, Command R, Mixtral, RWKV, Jamba, Mamba 3, and more.
240240
- **Speculative decoding** -- `inference.Model.SpeculativeGenerate` and `generate.WithSpeculativeDraft`.
241241
- **Paged KV cache** -- `generate.WithPagedKV` for memory-efficient serving.
242242
- **Prefix caching** -- `generate.WithPrefixCache` for shared system prompt reuse.
@@ -245,6 +245,15 @@ These are additive and do not require migration, but are worth knowing about:
245245
- **Tool calling** -- `serve.Tool` / `serve.ToolChoice` in the OpenAI-compatible API.
246246
- **Vision and audio** -- multimodal inference with LLaVA, SigLIP, and Whisper.
247247
- **Batch generation** -- `inference.Model.GenerateBatch` and `serve.BatchScheduler`.
248+
- **EAGLE speculative decoding** -- `generate.WithEAGLE` for speculative draft-and-verify with built-in head training.
249+
- **Q4_K fused GEMV** -- 14x faster dequantize-and-multiply kernel for Q4_K quantized weights.
250+
- **TransMLA** -- MHA-to-MLA conversion for DeepSeek-style multi-head latent attention.
251+
- **Multi-LoRA per-request serving** -- `serve.WithLoRA` routes each request to a different LoRA adapter.
252+
- **BitNet ternary inference** -- native 1.58-bit ternary weight support for BitNet models.
253+
- **Native Sparse Attention (NSA)** -- sparse attention patterns for long-context efficiency.
254+
- **Hybrid CPU/GPU MoE** -- expert routing across CPU and GPU for memory-constrained deployments.
255+
- **Quantized KV cache** -- `generate.WithKVQuant("q4")` and `generate.WithKVQuant("q3")` for reduced KV memory.
256+
- **Time-series inference** -- Granite TTM/FlowState models, 21x faster than Python granite-tsfm.
248257
- **Continuous batching** -- `serve.NewBatchScheduler` for high-throughput serving.
249258
- **LoRA/QLoRA fine-tuning** -- `training/lora/` and `cmd/finetune`.
250259
- **FSDP distributed training** -- `distributed/fsdp/` with NCCL AllGather/ReduceScatter.

0 commit comments

Comments
 (0)