Skip to content

Fix disaggregated prefill/decode RayService for H200 GPU clusters#28

Open
iankouls-aws wants to merge 3 commits intomainfrom
disaggregated-inference
Open

Fix disaggregated prefill/decode RayService for H200 GPU clusters#28
iankouls-aws wants to merge 3 commits intomainfrom
disaggregated-inference

Conversation

@iankouls-aws
Copy link
Copy Markdown
Contributor

Summary

Fixes the disaggregated prefill/decode RayService manifest so it works correctly on H200 GPU clusters (p5en instances).

Changes

Model

  • Switch from meta-llama/Llama-3.1-8B-Instruct (gated, requires HF token) to Qwen/Qwen2.5-7B-Instruct (open, no auth required)
  • Updated test script (disaggregated_prefill_decode_req.py) to match the new model ID

Accelerator

  • Changed accelerator_type from A10G to H200 for both prefill and decode configs to match actual cluster hardware

Head node GPU isolation

  • Added num-gpus: '0' to headGroupSpec.rayStartParams to prevent Ray from scheduling GPU workloads on the head node, which caused OOM failures when Serve replicas landed there

Container image

  • Changed image tag from :latest (does not exist) to :latest-py311-cu128 for all three containers (head, prefill worker, decode worker)

Testing

Deployed and verified on EKS cluster (shared-eks-cluster-cgk, ap-southeast-3) with p5en.48xlarge nodes. All pods reach Ready state and Serve deployments (Prefill, Decode, PDProxyServer, OpenAiIngress) initialize successfully.

Adds a new RayService example demonstrating LLM serving with
prefill/decode disaggregation using Ray Serve LLM APIs and
vLLM's NIXLConnector for KV cache transfer.

Files:
- disaggregated_prefill_decode.py: Python deployment script
- rayservice.disaggregated_prefill_decode.yaml: KubeRay manifest
- disaggregated_prefill_decode_req.py: Test client (chat + streaming)
- README.md: Documentation with architecture, config, and usage

Also updates rayservice-create.sh to list the new example.
…ead GPUs

- Switch model from gated meta-llama/Llama-3.1-8B-Instruct to open
  Qwen/Qwen2.5-7B-Instruct (no HF token required)
- Update accelerator_type from A10G to H200 to match actual cluster hardware
- Set num-gpus: 0 on head node to prevent Serve replicas from being
  scheduled on the head and hitting GPU memory contention
The :latest tag does not exist on rayproject/ray-llm. Use the explicit
:latest-py311-cu128 tag instead.
@iankouls-aws iankouls-aws requested a review from mvinci12 April 1, 2026 06:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant