Agent-native inference engine with O(1) fork latency for tree-structured reasoning.
Dendrite is a specialized LLM inference engine designed for agentic workloads that require exploring multiple reasoning paths simultaneously. Unlike traditional inference engines that optimize for single-sequence throughput, Dendrite provides constant-time forking of inference state, enabling efficient tree-of-thought, MCTS, and beam search algorithms.
TL;DR: If your agent explores multiple reasoning paths, Dendrite is 1000-10000x faster at branching.
| Scenario | vLLM | SGLang | Dendrite |
|---|---|---|---|
| Fork 4K context | 50-100ms | 5-10ms | 3ΞΌs |
| 6-branch exploration | 300-600ms | 30-60ms | 18ΞΌs |
| MCTS (500 forks) | 25-50s | 2.5-5s | 1.5ms |
| Memory (6 branches, 4K prefix) | 6GB | ~2GB | 1.1GB |
Use Dendrite for: Tree-of-Thought, MCTS, Beam Search, Speculative Decoding, Multi-Agent Use vLLM for: Single-sequence generation, simple chat (higher throughput)
See BENCHMARKS.md and PRACTICAL_IMPACT.md for details.
- O(1) Fork Latency: Create reasoning branches without copying the KV cache using copy-on-write semantics
- Tree-Structured KV Cache: Share memory across branches with reference counting
- PagedAttention: Memory-efficient KV cache management with 16-token blocks
- MCTS & Beam Search: Built-in tree search algorithms with UCT scoring
- FlashInfer Integration: High-performance attention kernels with cascade support
- Grammar Constraints: Structured output via llguidance integration
Traditional (vLLM/SGLang): Dendrite (Copy-on-Write):
Fork = Copy entire KV cache Fork = Copy block table pointers
O(context_length) O(num_blocks) β O(1)
βββββββββββββββββββββββ βββββββββββββββββββββββ
β Parent: 4K tokens β β Block Table (Parent)β
β [================] β β [0][1][2][3] β
βββββββββββββββββββββββ ββββββββββββββββββββββ
β fork β β β β
βΌ βΌ βΌ βΌ βΌ
βββββββββββββββββββββββ ββββββββββββββββββββββββ
β Child: Copy 4K β β Physical Blocks β
β [================] β β 50-100ms β [B0][B1][B2][B3] β
βββββββββββββββββββββββ β ref=2 (shared!) β
ββββββββββββββββββββββββ
β²
ββββββββββββββββββββββ
β [0][1][2][3] β
β Block Table (Child) β β 500ns
βββββββββββββββββββββββ
Memory is duplicated only when branches diverge and only at block granularity (16 tokens).
| Component | Status | Notes |
|---|---|---|
| KV Cache (CoW) | β Complete | O(1) fork with reference counting |
| Paged KV Cache | β Complete | Memory-efficient page pool |
| Tree State | β Complete | Radix tree for prefix sharing |
| Scheduler | β Complete | Fair batch scheduling |
| Attention Backend | β Complete | Reference + Flash Attention |
| Transformer | β Complete | Full LLaMA architecture |
| Tokenizer | β Complete | HuggingFace tokenizers |
| MCTS Search | β Complete | UCT scoring |
| Beam Search | β Complete | Top-k beams |
| Grammar Constraints | β Complete | llguidance integration |
| GPU Inference | β Complete | candle-flash-attn |
| FP8 Quantization | π In Progress | E4M3/E5M2/MXFP8 |
| FlashInfer FFI | π In Progress | Paged attention kernels |
272 tests passing | 40.8 tok/s on NVIDIA GB10 (DGX Spark) with TinyLlama-1.1B
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β dendrite β
β (High-level API) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β dendrite-core β
β βββββββββββ ββββββββββββ βββββββββββββ βββββββββββββββββββ β
β β Cache β β Tree β β Scheduler β β Model β β
β β (CoW) β β State β β (Batch) β β (Transformer) β β
β βββββββββββ ββββββββββββ βββββββββββββ βββββββββββββββββββ β
β βββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββ β
β β Search β β Grammar β β
β β (MCTS, Beam, UCT) β β (llguidance constraints) β β
β βββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β dendrite-ffi β
β (FlashInfer + CUDA bindings) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
use dendrite_core::model::{ModelConfig, Tokenizer, Transformer};
use dendrite_core::attention::ReferenceBackend;
use candle_core::Device;
use std::sync::Arc;
#[tokio::main]
async fn main() -> anyhow::Result<()> {
// Load model
let model_path = std::path::Path::new("/path/to/tinyllama");
let config = ModelConfig::from_file(&model_path.join("config.json"))?;
let tokenizer = Tokenizer::from_dir(model_path)?;
let backend = Arc::new(ReferenceBackend::new());
let mut transformer = Transformer::new(config, backend, Device::Cpu)?;
transformer.load_weights(model_path)?;
// Generate text
let text = transformer
.generate_text(&tokenizer, "Hello, my name is", 50, 0.0)
.await?;
println!("{}", text);
Ok(())
}use dendrite_core::model::{ModelConfig, Transformer, KvCache};
use dendrite_core::attention::FlashAttnBackend; // requires "cuda" feature
use candle_core::{Device, Tensor};
// Load model on GPU
let device = Device::new_cuda(0)?;
let backend = Arc::new(FlashAttnBackend::new(0)?);
let mut transformer = Transformer::new(config, backend, device.clone())?;
transformer.load_weights(model_path)?;
// Create KV cache for efficient autoregressive generation
let mut cache = transformer.create_cache();
// Prefill prompt
let prompt_tokens = vec![1u32, 15043, 29892]; // "<s>Hello,"
let input = Tensor::from_slice(&prompt_tokens, (1, 3), &device)?;
let logits = transformer.forward_with_cache(&input, &mut cache).await?;
// Decode with cached KV (O(1) per token)
let next_token = transformer.sample(&logits, 0.0)?;
println!("Generated token: {}", next_token);use dendrite_core::cache::{PagedKvCache, DEFAULT_PAGE_SIZE};
use candle_core::DType;
// Create paged cache with O(1) fork support
let cache = PagedKvCache::new(
32, // num_layers
1000, // initial pages
DEFAULT_PAGE_SIZE, // 16 tokens/page
8, // num_kv_heads
128, // head_dim
DType::F16,
device,
)?;
// Allocate parent sequence
let parent = cache.allocate_sequence();
// O(1) fork - shares pages via copy-on-write
let child1 = cache.fork_sequence(parent)?;
let child2 = cache.fork_sequence(parent)?;
// Each child can diverge independently
// Pages are copied only when modified| Crate | Description |
|---|---|
dendrite |
High-level API and engine |
dendrite-core |
Core engine: KV cache, scheduler, tree management |
dendrite-ffi |
FFI bindings for FlashInfer and CUDA |
- Rust 1.83+ (edition 2024)
- CUDA 12.x (optional, for GPU acceleration)
- FlashInfer 0.2.x (optional, for optimized kernels)
# CPU-only build
cargo build --release
# With CUDA support
cargo build --release --features cuda
# Full build with all features
cargo build --release --features full# GPU inference with TinyLlama (requires model weights)
cargo run -p dendrite-core --features cuda --example gpu_inference -- /path/to/tinyllama
# Text generation with tokenizer integration
cargo run -p dendrite-core --example text_inference -- /path/to/tinyllama
# Golden token validation (verify vs HuggingFace)
python scripts/generate_golden.py --model /path/to/tinyllama --output golden_cases.json
cargo run -p dendrite-core --example golden_validation -- /path/to/tinyllama golden_cases.json
# MCTS search with simulated environment
cargo run -p dendrite-core --example mcts_search
# Beam search with mock language model
cargo run -p dendrite-core --example beam_search# Fork latency benchmark
cargo bench --bench fork
# Tree search benchmark
cargo bench --bench tree_search
# Memory usage benchmark
cargo bench --bench memoryDendrite is optimized for:
- NVIDIA GB10/DGX Spark: 128GB unified memory, NVLink-C2C
- Grace Hopper: Unified memory architecture
- Works on any CUDA-capable GPU with reduced memory
| Metric | Target | Actual |
|---|---|---|
| Fork latency | < 50 ΞΌs | ~500ns (100x better) |
| Grammar mask | < 50 ΞΌs | ~1.6ΞΌs (30x better) |
| Decode latency | < 10 ms | 10ms on GB10 |
| Memory overhead | < 5% per fork | ~0.1% (CoW) |
See docs/architecture.md for detailed design.
MIT License - see LICENSE for details.
Contributions welcome! Please read CONTRIBUTING.md for guidelines.
- FlashInfer for attention kernels
- vLLM for PagedAttention inspiration
- llguidance for grammar constraints
- Candle for Rust ML framework