Skip to content

Agent-native inference engine with O(1) fork latency for tree-structured reasoning

License

Notifications You must be signed in to change notification settings

BioInfo/dendrite

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

29 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Dendrite

Agent-native inference engine with O(1) fork latency for tree-structured reasoning.

Dendrite is a specialized LLM inference engine designed for agentic workloads that require exploring multiple reasoning paths simultaneously. Unlike traditional inference engines that optimize for single-sequence throughput, Dendrite provides constant-time forking of inference state, enabling efficient tree-of-thought, MCTS, and beam search algorithms.

Why Dendrite? (vs vLLM/SGLang)

TL;DR: If your agent explores multiple reasoning paths, Dendrite is 1000-10000x faster at branching.

Scenario vLLM SGLang Dendrite
Fork 4K context 50-100ms 5-10ms 3ΞΌs
6-branch exploration 300-600ms 30-60ms 18ΞΌs
MCTS (500 forks) 25-50s 2.5-5s 1.5ms
Memory (6 branches, 4K prefix) 6GB ~2GB 1.1GB

Use Dendrite for: Tree-of-Thought, MCTS, Beam Search, Speculative Decoding, Multi-Agent Use vLLM for: Single-sequence generation, simple chat (higher throughput)

See BENCHMARKS.md and PRACTICAL_IMPACT.md for details.

Key Features

  • O(1) Fork Latency: Create reasoning branches without copying the KV cache using copy-on-write semantics
  • Tree-Structured KV Cache: Share memory across branches with reference counting
  • PagedAttention: Memory-efficient KV cache management with 16-token blocks
  • MCTS & Beam Search: Built-in tree search algorithms with UCT scoring
  • FlashInfer Integration: High-performance attention kernels with cascade support
  • Grammar Constraints: Structured output via llguidance integration

How It Works

Traditional (vLLM/SGLang):               Dendrite (Copy-on-Write):

Fork = Copy entire KV cache              Fork = Copy block table pointers
       O(context_length)                        O(num_blocks) β‰ˆ O(1)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Parent: 4K tokens   β”‚                  β”‚ Block Table (Parent)β”‚
β”‚ [================]  β”‚                  β”‚ [0][1][2][3]        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                  β””β”€β”€β”‚β”€β”€β”‚β”€β”€β”‚β”€β”€β”‚β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚ fork                             β”‚  β”‚  β”‚  β”‚
         β–Ό                                  β–Ό  β–Ό  β–Ό  β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Child: Copy 4K      β”‚                  β”‚ Physical Blocks      β”‚
β”‚ [================]  β”‚  ← 50-100ms      β”‚ [B0][B1][B2][B3]     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                  β”‚ ref=2 (shared!)      β”‚
                                         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                    β–²
                                         β”Œβ”€β”€β”‚β”€β”€β”‚β”€β”€β”‚β”€β”€β”‚β”€β”€β”€β”€β”€β”€β”€β”€β”
                                         β”‚ [0][1][2][3]        β”‚
                                         β”‚ Block Table (Child) β”‚ ← 500ns
                                         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Memory is duplicated only when branches diverge and only at block granularity (16 tokens).

Project Status

Component Status Notes
KV Cache (CoW) βœ… Complete O(1) fork with reference counting
Paged KV Cache βœ… Complete Memory-efficient page pool
Tree State βœ… Complete Radix tree for prefix sharing
Scheduler βœ… Complete Fair batch scheduling
Attention Backend βœ… Complete Reference + Flash Attention
Transformer βœ… Complete Full LLaMA architecture
Tokenizer βœ… Complete HuggingFace tokenizers
MCTS Search βœ… Complete UCT scoring
Beam Search βœ… Complete Top-k beams
Grammar Constraints βœ… Complete llguidance integration
GPU Inference βœ… Complete candle-flash-attn
FP8 Quantization πŸ”„ In Progress E4M3/E5M2/MXFP8
FlashInfer FFI πŸ”„ In Progress Paged attention kernels

272 tests passing | 40.8 tok/s on NVIDIA GB10 (DGX Spark) with TinyLlama-1.1B

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         dendrite                                β”‚
β”‚                    (High-level API)                             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                      dendrite-core                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚  Cache  β”‚  β”‚   Tree   β”‚  β”‚ Scheduler β”‚  β”‚     Model       β”‚ β”‚
β”‚  β”‚  (CoW)  β”‚  β”‚  State   β”‚  β”‚ (Batch)   β”‚  β”‚  (Transformer)  β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚        Search           β”‚  β”‚        Grammar              β”‚  β”‚
β”‚  β”‚  (MCTS, Beam, UCT)      β”‚  β”‚  (llguidance constraints)   β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                      dendrite-ffi                               β”‚
β”‚              (FlashInfer + CUDA bindings)                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Quick Start

Basic Text Generation

use dendrite_core::model::{ModelConfig, Tokenizer, Transformer};
use dendrite_core::attention::ReferenceBackend;
use candle_core::Device;
use std::sync::Arc;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    // Load model
    let model_path = std::path::Path::new("/path/to/tinyllama");
    let config = ModelConfig::from_file(&model_path.join("config.json"))?;
    let tokenizer = Tokenizer::from_dir(model_path)?;

    let backend = Arc::new(ReferenceBackend::new());
    let mut transformer = Transformer::new(config, backend, Device::Cpu)?;
    transformer.load_weights(model_path)?;

    // Generate text
    let text = transformer
        .generate_text(&tokenizer, "Hello, my name is", 50, 0.0)
        .await?;

    println!("{}", text);
    Ok(())
}

GPU Inference with KV Cache

use dendrite_core::model::{ModelConfig, Transformer, KvCache};
use dendrite_core::attention::FlashAttnBackend;  // requires "cuda" feature
use candle_core::{Device, Tensor};

// Load model on GPU
let device = Device::new_cuda(0)?;
let backend = Arc::new(FlashAttnBackend::new(0)?);
let mut transformer = Transformer::new(config, backend, device.clone())?;
transformer.load_weights(model_path)?;

// Create KV cache for efficient autoregressive generation
let mut cache = transformer.create_cache();

// Prefill prompt
let prompt_tokens = vec![1u32, 15043, 29892];  // "<s>Hello,"
let input = Tensor::from_slice(&prompt_tokens, (1, 3), &device)?;
let logits = transformer.forward_with_cache(&input, &mut cache).await?;

// Decode with cached KV (O(1) per token)
let next_token = transformer.sample(&logits, 0.0)?;
println!("Generated token: {}", next_token);

O(1) Fork for Tree Search

use dendrite_core::cache::{PagedKvCache, DEFAULT_PAGE_SIZE};
use candle_core::DType;

// Create paged cache with O(1) fork support
let cache = PagedKvCache::new(
    32,                    // num_layers
    1000,                  // initial pages
    DEFAULT_PAGE_SIZE,     // 16 tokens/page
    8,                     // num_kv_heads
    128,                   // head_dim
    DType::F16,
    device,
)?;

// Allocate parent sequence
let parent = cache.allocate_sequence();

// O(1) fork - shares pages via copy-on-write
let child1 = cache.fork_sequence(parent)?;
let child2 = cache.fork_sequence(parent)?;

// Each child can diverge independently
// Pages are copied only when modified

Crate Structure

Crate Description
dendrite High-level API and engine
dendrite-core Core engine: KV cache, scheduler, tree management
dendrite-ffi FFI bindings for FlashInfer and CUDA

Requirements

  • Rust 1.83+ (edition 2024)
  • CUDA 12.x (optional, for GPU acceleration)
  • FlashInfer 0.2.x (optional, for optimized kernels)

Building

# CPU-only build
cargo build --release

# With CUDA support
cargo build --release --features cuda

# Full build with all features
cargo build --release --features full

Running Examples

# GPU inference with TinyLlama (requires model weights)
cargo run -p dendrite-core --features cuda --example gpu_inference -- /path/to/tinyllama

# Text generation with tokenizer integration
cargo run -p dendrite-core --example text_inference -- /path/to/tinyllama

# Golden token validation (verify vs HuggingFace)
python scripts/generate_golden.py --model /path/to/tinyllama --output golden_cases.json
cargo run -p dendrite-core --example golden_validation -- /path/to/tinyllama golden_cases.json

# MCTS search with simulated environment
cargo run -p dendrite-core --example mcts_search

# Beam search with mock language model
cargo run -p dendrite-core --example beam_search

Benchmarks

# Fork latency benchmark
cargo bench --bench fork

# Tree search benchmark
cargo bench --bench tree_search

# Memory usage benchmark
cargo bench --bench memory

Target Hardware

Dendrite is optimized for:

  • NVIDIA GB10/DGX Spark: 128GB unified memory, NVLink-C2C
  • Grace Hopper: Unified memory architecture
  • Works on any CUDA-capable GPU with reduced memory

Performance (Measured)

Metric Target Actual
Fork latency < 50 ΞΌs ~500ns (100x better)
Grammar mask < 50 ΞΌs ~1.6ΞΌs (30x better)
Decode latency < 10 ms 10ms on GB10
Memory overhead < 5% per fork ~0.1% (CoW)

See docs/architecture.md for detailed design.

License

MIT License - see LICENSE for details.

Contributing

Contributions welcome! Please read CONTRIBUTING.md for guidelines.

Acknowledgments

About

Agent-native inference engine with O(1) fork latency for tree-structured reasoning

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors