Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,4 @@
Container-Root/version.txt
Container-Root/ray/anyscale/efs_env.sh

__pycache__/
163 changes: 163 additions & 0 deletions Container-Root/ray/rayservice/disaggregated_prefill_decode/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
# Disaggregated Prefill/Decode Serving

This example demonstrates **disaggregated prefill/decode** serving for LLMs using [Ray Serve LLM](https://docs.ray.io/en/latest/serve/llm/user-guides/prefill-decode.html) with vLLM's [NIXLConnector](https://docs.vllm.ai/en/stable/features/nixl_connector_usage.html).

## What is Disaggregated Serving?

Traditional LLM inference colocates two phases on the same GPU:

1. **Prefill** — processes the full input prompt in parallel (compute-bound, high FLOPS)
2. **Decode** — generates tokens one at a time autoregressively (memory-bandwidth-bound)

When colocated, these phases interfere with each other: a long prefill blocks decode and increases inter-token latency (ITL), while ongoing decode delays new prefill requests and increases time-to-first-token (TTFT).

**Disaggregated serving** separates them onto dedicated instances connected via high-speed KV cache transfer (NIXL), enabling:

| Benefit | Description |
|---|---|
| **Independent scaling** | Scale prefill and decode replicas separately based on demand |
| **Reduced interference** | Prefill doesn't block decode and vice versa |
| **Cost optimization** | Use different instance types for different workloads |
| **Better latency** | Optimize TTFT and ITL independently |

## Architecture

```
┌─────────────────┐
│ Ray Serve │
│ Router │
└────────┬────────┘
┌──────────────┴──────────────┐
│ │
┌────────▼────────┐ ┌────────▼────────┐
│ Prefill Pool │ NIXL │ Decode Pool │
│ (compute-bound)│ ──────► │ (memory-bound) │
│ 1-2 replicas │ KV cache │ 1-4 replicas │
└─────────────────┘ transfer └─────────────────┘
```

## Files

| File | Description |
|---|---|
| `disaggregated_prefill_decode.py` | Python deployment script (can run standalone) |
| `rayservice.disaggregated_prefill_decode.yaml` | KubeRay RayService manifest for Kubernetes |
| `disaggregated_prefill_decode_req.py` | Test client with chat completion and streaming |

## Prerequisites

- **Ray** >= 2.44 with `ray[serve]`
- **vLLM** v1 (default engine in Ray Serve LLM)
- **NIXL**: `pip install nixl` (pre-installed in `rayproject/ray-llm` images)
- **GPU nodes** with sufficient VRAM (e.g., NVIDIA A10G for Llama-3.1-8B)
- **HuggingFace token** with access to `meta-llama/Llama-3.1-8B-Instruct`

## Quick Start

### Option 1: Kubernetes (RayService)

```bash
# From the rayservice directory
./rayservice-create.sh disaggregated_prefill_decode

# Check status
./rayservice-status.sh

# Wait for pods to be ready, then test
./rayservice-test.sh disaggregated_prefill_decode
```

### Option 2: Direct Python

```bash
# Install dependencies
pip install "ray[serve]" vllm nixl

# Set HuggingFace token
export HF_TOKEN=<your-token>

# Deploy
python disaggregated_prefill_decode.py

# In another terminal, test
python disaggregated_prefill_decode_req.py
```

### Option 3: Ray Serve CLI

```bash
# Deploy from YAML (extract the serveConfigV2 section)
serve deploy rayservice.disaggregated_prefill_decode.yaml

# Test
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 64
}'
```

## Configuration

### Changing the Model

Update `model_id` in both the prefill and decode configs:

```yaml
prefill_config:
model_loading_config:
model_id: your-org/your-model
decode_config:
model_loading_config:
model_id: your-org/your-model
```

### Scaling

Adjust `min_replicas` / `max_replicas` independently for each phase. A typical pattern is more decode replicas than prefill, since decode is the longer-running phase:

```yaml
prefill_config:
deployment_config:
autoscaling_config:
min_replicas: 2
max_replicas: 4

decode_config:
deployment_config:
autoscaling_config:
min_replicas: 6
max_replicas: 10
```

### GPU Instance Types (AWS)

| Instance | GPU | VRAM | Best For |
|---|---|---|---|
| g5.xlarge | 1x A10G | 24 GB | Small models (7-8B) |
| g5.2xlarge | 1x A10G | 24 GB | Small models, more CPU/RAM |
| p4d.24xlarge | 8x A100 | 320 GB | Large models (70B+) |
| p5.48xlarge | 8x H100 | 640 GB | Largest models, highest throughput |

### Alternative KV Transfer Backends

This example uses NIXLConnector. For advanced caching, you can switch to LMCacheConnectorV1:

```yaml
engine_kwargs:
kv_transfer_config:
kv_connector: LMCacheConnectorV1
kv_role: kv_producer # or kv_consumer
```

See the [Ray Serve docs](https://docs.ray.io/en/latest/serve/llm/user-guides/prefill-decode.html) for LMCache and Mooncake backend configurations.

## References

- [Ray Serve Prefill/Decode Disaggregation Guide](https://docs.ray.io/en/latest/serve/llm/user-guides/prefill-decode.html)
- [vLLM NIXLConnector Usage](https://docs.vllm.ai/en/stable/features/nixl_connector_usage.html)
- [DistServe Paper — Disaggregated Inference](https://arxiv.org/abs/2401.09670)
- [Anyscale Blog — Wide-EP and Disaggregated Serving](https://www.anyscale.com/blog/ray-serve-llm-anyscale-apis-wide-ep-disaggregated-serving-vllm)
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
"""
Disaggregated Prefill/Decode Serving with Ray Serve LLM

This example demonstrates how to deploy an LLM with prefill/decode
disaggregation using Ray Serve's built-in LLM APIs and vLLM's
NIXLConnector for KV cache transfer.

Disaggregated serving separates the prefill phase (processing input
prompts) from the decode phase (generating tokens), enabling:

- Independent scaling of prefill and decode replicas
- Reduced interference between compute-bound prefill and
memory-bound decode
- Cost optimization via heterogeneous hardware

Prerequisites:
- Ray >= 2.44 with ray[serve] installed
- vLLM v1 (default engine)
- NIXL: pip install nixl (pre-installed in ray-llm images)
- GPU workers with enough VRAM for the model

Usage:
# Deploy via Ray Serve config (recommended for Kubernetes)
serve deploy rayservice.disaggregated_prefill_decode.yaml

# Or run directly with Python
python disaggregated_prefill_decode.py

# Test the endpoint
python disaggregated_prefill_decode_req.py
"""

from ray.serve.llm import LLMConfig, build_pd_openai_app
import ray.serve as serve

# Model to serve — change to any HuggingFace model you have access to.
MODEL_ID = "meta-llama/Llama-3.1-8B-Instruct"

# ── Prefill instance configuration ──────────────────────────────────
# The prefill instance processes input prompts and produces KV cache
# entries that are transferred to decode instances via NIXL.
prefill_config = LLMConfig(
model_loading_config={
"model_id": MODEL_ID,
},
deployment_config={
"autoscaling_config": {
"min_replicas": 1,
"max_replicas": 2,
}
},
accelerator_type="A10G",
engine_kwargs={
"kv_transfer_config": {
"kv_connector": "NixlConnector",
"kv_role": "kv_both",
},
},
)

# ── Decode instance configuration ───────────────────────────────────
# The decode instance generates tokens autoregressively, consuming
# KV cache entries produced by the prefill instance.
decode_config = LLMConfig(
model_loading_config={
"model_id": MODEL_ID,
},
deployment_config={
"autoscaling_config": {
"min_replicas": 1,
"max_replicas": 4,
}
},
accelerator_type="A10G",
engine_kwargs={
"kv_transfer_config": {
"kv_connector": "NixlConnector",
"kv_role": "kv_both",
},
},
)

# ── Build and deploy ────────────────────────────────────────────────
# build_pd_openai_app creates an OpenAI-compatible API with a router
# that directs requests to prefill instances first, then hands off
# KV cache to decode instances for token generation.
pd_config = dict(
prefill_config=prefill_config,
decode_config=decode_config,
)

app = build_pd_openai_app(pd_config)

if __name__ == "__main__":
serve.run(app)
print(f"\nDisaggregated serving is running for {MODEL_ID}")
print("Send requests to http://localhost:8000/v1/chat/completions")
print("Press Ctrl+C to stop.\n")

# Keep the process alive
import time
try:
while True:
time.sleep(10)
except KeyboardInterrupt:
print("Shutting down...")
Loading
Loading