You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<title>Zerfoo — Machine Learning Framework for Go</title>
11
-
<metaname="description" content="Train, run, and serve ML models in your Go application. 245 tok/s on Gemma 3 1B — 20% faster than Ollama. Pure Go, zero CGo.">
11
+
<metaname="description" content="Train, run, and serve ML models in your Go application. 235 tok/s on Gemma 3 1B — 25% faster than Ollama. Pure Go, zero CGo.">
<p>Hand-written ARM NEON and x86 AVX2 assembly for CPU-bound operations — GEMM, RMSNorm, RoPE, SiLU, softmax. Competitive CPU performance without a GPU.</p>
405
405
</div>
406
+
<divclass="feat">
407
+
<divclass="icon">⚡</div>
408
+
<h3>EAGLE Speculative Decoding</h3>
409
+
<p>Built-in EAGLE draft head with integrated training. Speculative decoding accelerates generation by drafting multiple tokens per step and verifying in parallel.</p>
410
+
</div>
411
+
<divclass="feat">
412
+
<divclass="icon">🚀</div>
413
+
<h3>Q4_K Fused GEMV</h3>
414
+
<p>Fused dequantize-and-multiply kernel for Q4_K quantized weights — 14x faster than the unfused path. Keeps decode throughput high on quantized models.</p>
415
+
</div>
416
+
<divclass="feat">
417
+
<divclass="icon">🔎</div>
418
+
<h3>Advanced Serving</h3>
419
+
<p>Multi-LoRA per-request serving, quantized KV cache (Q4/Q3), and hybrid CPU/GPU MoE routing. Production-grade features for multi-tenant deployments.</p>
Copy file name to clipboardExpand all lines: content/docs/blog/01-introducing-zerfoo.md
+3-1Lines changed: 3 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -98,7 +98,9 @@ This means every tool, library, and application built for the OpenAI API works w
98
98
99
99
## Performance
100
100
101
-
On an NVIDIA DGX Spark with Gemma 3 1B Q4_K_M, Zerfoo achieves **245 tokens/second** decode throughput — 20% faster than Ollama (204 tok/s) on the same hardware. This comes from three key optimizations:
101
+
> **Update 2026-03-27:** Benchmarks updated to reflect multi-model 3-run median methodology. Gemma 3 1B: 235 tok/s (was 245), Ollama: 188 tok/s (was 204). The speedup is now 25%.
102
+
103
+
On an NVIDIA DGX Spark with Gemma 3 1B Q4_K_M, Zerfoo achieves **235 tokens/second** decode throughput — 25% faster than Ollama (188 tok/s) on the same hardware. This comes from three key optimizations:
Copy file name to clipboardExpand all lines: content/docs/blog/02-benchmark-comparison.md
+11-10Lines changed: 11 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,23 +6,24 @@ bookToc: true
6
6
7
7
# Zerfoo vs Ollama vs llama.cpp: A Performance Comparison
8
8
9
-
When we set out to build an ML inference framework in Go, the first question everyone asked was: "Can Go actually compete with C++ on inference throughput?" The answer is yes. On Gemma 3 1B Q4_K_M, Zerfoo decodes at **245 tokens/second** — 20% faster than Ollama and 10-15% faster than llama.cpp on the same NVIDIA DGX Spark hardware.
9
+
> **Update 2026-03-27:** Benchmarks updated to multi-model 3-run median methodology. Gemma 3 1B: 235 tok/s (Ollama 188 tok/s) = 25% faster. Additional models: DeepSeek R1 1.5B (186 vs 167, +11%), Llama 3.2 3B (92 vs 93, parity), Mistral 7B (44 vs 44, parity).
10
+
11
+
When we set out to build an ML inference framework in Go, the first question everyone asked was: "Can Go actually compete with C++ on inference throughput?" The answer is yes. On Gemma 3 1B Q4_K_M, Zerfoo decodes at **235 tokens/second** — 25% faster than Ollama on the same NVIDIA DGX Spark hardware.
10
12
11
13
This post breaks down how we measured these numbers, what architectural decisions make them possible, and how you can reproduce the results on your own hardware.
12
14
13
15
## The Numbers
14
16
15
17
All measurements use the same GGUF model file, the same prompt ("The meaning of life is"), and measure steady-state decode throughput after warm-up on an NVIDIA DGX Spark (GB10 Grace Blackwell, 128 GB unified LPDDR5x, CUDA 13.0).
The gap between Zerfoo with and without CUDA graphs (245 vs 174 tok/s) tells the story: CUDA graph capture alone accounts for a 40% throughput increase. The remaining advantage over Ollama comes from fused kernels and zero CGo overhead.
26
+
All numbers are 3-run medians from the 2026-03-27 multi-model benchmark. The gap narrows at larger model sizes where memory bandwidth dominates over kernel launch overhead. On smaller models where kernel fusion matters most, Zerfoo's CUDA graph capture and fused kernels provide a clear advantage.
26
27
27
28
## Why Zerfoo Is Faster
28
29
@@ -122,7 +123,7 @@ We've measured on the DGX Spark so far. We expect similar relative performance o
Copy file name to clipboardExpand all lines: content/docs/blog/03-architecture-deep-dive.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ bookToc: true
6
6
7
7
# Inside Zerfoo: An Architecture Deep Dive
8
8
9
-
Zerfoo runs LLM inference in Go at 245 tokens/second — 20% faster than Ollama. This post walks through the internal architecture that makes that possible, from loading a GGUF file to streaming tokens over an OpenAI-compatible API.
9
+
Zerfoo runs LLM inference in Go at 235 tokens/second — 25% faster than Ollama. This post walks through the internal architecture that makes that possible, from loading a GGUF file to streaming tokens over an OpenAI-compatible API.
10
10
11
11
## The Pipeline
12
12
@@ -122,7 +122,7 @@ CUDA graph capture is the single biggest performance optimization in Zerfoo. It
122
122
123
123
Without CUDA graphs, each decode step dispatches hundreds of individual kernel launches — each one costing 5-10 microseconds of CPU-GPU synchronization. With CUDA graphs, the entire decode step is a single graph launch.
124
124
125
-
The numbers tell the story: 245 tok/s with CUDA graphs vs 174 tok/s without — a 40% throughput increase from this optimization alone.
125
+
The numbers tell the story: 235 tok/s with CUDA graphs vs 174 tok/s without — a 35% throughput increase from this optimization alone.
126
126
127
127
Zerfoo achieves 99.5% instruction coverage in CUDA graph capture. The remaining 0.5% consists of operations that must run on the host: token sampling and tokenizer lookup.
Copy file name to clipboardExpand all lines: content/docs/blog/04-why-go-for-ml.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -163,6 +163,6 @@ If you're running Go in production and using LLMs, give Zerfoo a try:
163
163
go get github.com/zerfoo/zerfoo@latest
164
164
```
165
165
166
-
Seven lines of code to run inference. One binary to deploy. 245 tokens per second on a DGX Spark.
166
+
Seven lines of code to run inference. One binary to deploy. 235 tokens per second on a DGX Spark.
167
167
168
168
The question isn't whether Go can do ML. The question is why your production inference is still running in a different language than the rest of your stack.
Copy file name to clipboardExpand all lines: content/docs/blog/05-migrating-from-ollama.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,15 +6,15 @@ bookToc: true
6
6
7
7
# Migrating from Ollama to Zerfoo
8
8
9
-
Ollama is a popular tool for running LLMs locally. If you're using Ollama today and want to switch to Zerfoo — whether for the 20% throughput improvement, Go-native embedding, or OpenAI API compatibility — this guide walks you through the migration step by step.
9
+
Ollama is a popular tool for running LLMs locally. If you're using Ollama today and want to switch to Zerfoo — whether for the 25% throughput improvement, Go-native embedding, or OpenAI API compatibility — this guide walks you through the migration step by step.
10
10
11
11
## Why Migrate?
12
12
13
13
Before diving into the how, here's what Zerfoo offers over Ollama:
Copy file name to clipboardExpand all lines: content/docs/blog/how-we-beat-ollama-cuda-graph-capture.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,7 +8,7 @@ bookToc: true
8
8
9
9
*Performance deep-dive: how CUDA graph capture and fused kernels took Zerfoo from 186 tok/s to 234.30 tok/s on Gemma 3 1B.*
10
10
11
-
> **Update 2026-03-17:** Current throughput is **245 tok/s** (20% faster than Ollama 204 tok/s). The Phase 6 journey below documents reaching 234.30 tok/s — Phase 27 pushed further via Q4_0 re-quantization in the GGUF loader.
11
+
> **Update 2026-03-27:** Current throughput is **235 tok/s** (25% faster than Ollama 188 tok/s, 3-run median from multi-model benchmark). The Phase 6 journey below documents reaching 234.30 tok/s.
Copy file name to clipboardExpand all lines: content/docs/blog/zero-cgo-pure-go-ml-inference.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -208,10 +208,10 @@ Here are the numbers. On a DGX Spark (GB10 Grace Blackwell), running Gemma 3 1B
208
208
209
209
| Runtime | Decode throughput | Notes |
210
210
|---------|------------------|-------|
211
-
|**Zerfoo**|**245 tok/s**| Pure Go, zero CGo, custom CUDA kernels via dlopen |
212
-
| Ollama |204 tok/s | Go wrapper around llama.cpp (C++) |
211
+
|**Zerfoo**|**235 tok/s**| Pure Go, zero CGo, custom CUDA kernels via dlopen |
212
+
| Ollama |188 tok/s | Go wrapper around llama.cpp (C++) |
213
213
214
-
Zerfoo is 20% faster than Ollama on the same hardware, despite Ollama being a thin wrapper around C++. The performance comes from the kernels, not the binding mechanism:
214
+
Zerfoo is 25% faster than Ollama on the same hardware, despite Ollama being a thin wrapper around C++. The performance comes from the kernels, not the binding mechanism:
215
215
216
216
-**25+ custom CUDA kernels** including fused RoPE, fused SwiGLU, fused Add+RMSNorm, fused QK-Norm+RoPE, flash attention (prefill and decode), quantized GEMM/GEMV (Q4_0, Q4_K_M, Q8_0)
217
217
-**CUDA graph capture** replays the entire decode step as a single graph launch, eliminating per-kernel launch overhead. 99.5% of decode instructions are captured.
0 commit comments