Fix disaggregated prefill/decode RayService for H200 GPU clusters by iankouls-aws · Pull Request #28 · aws-samples/aws-do-ray

iankouls-aws · 2026-04-01T06:21:24Z

Summary

Fixes the disaggregated prefill/decode RayService manifest so it works correctly on H200 GPU clusters (p5en instances).

Changes

Model

Switch from meta-llama/Llama-3.1-8B-Instruct (gated, requires HF token) to Qwen/Qwen2.5-7B-Instruct (open, no auth required)
Updated test script (disaggregated_prefill_decode_req.py) to match the new model ID

Accelerator

Changed accelerator_type from A10G to H200 for both prefill and decode configs to match actual cluster hardware

Head node GPU isolation

Added num-gpus: '0' to headGroupSpec.rayStartParams to prevent Ray from scheduling GPU workloads on the head node, which caused OOM failures when Serve replicas landed there

Container image

Changed image tag from :latest (does not exist) to :latest-py311-cu128 for all three containers (head, prefill worker, decode worker)

Testing

Deployed and verified on EKS cluster (shared-eks-cluster-cgk, ap-southeast-3) with p5en.48xlarge nodes. All pods reach Ready state and Serve deployments (Prefill, Decode, PDProxyServer, OpenAiIngress) initialize successfully.

Adds a new RayService example demonstrating LLM serving with prefill/decode disaggregation using Ray Serve LLM APIs and vLLM's NIXLConnector for KV cache transfer. Files: - disaggregated_prefill_decode.py: Python deployment script - rayservice.disaggregated_prefill_decode.yaml: KubeRay manifest - disaggregated_prefill_decode_req.py: Test client (chat + streaming) - README.md: Documentation with architecture, config, and usage Also updates rayservice-create.sh to list the new example.

…ead GPUs - Switch model from gated meta-llama/Llama-3.1-8B-Instruct to open Qwen/Qwen2.5-7B-Instruct (no HF token required) - Update accelerator_type from A10G to H200 to match actual cluster hardware - Set num-gpus: 0 on head node to prevent Serve replicas from being scheduled on the head and hitting GPU memory contention

The :latest tag does not exist on rayproject/ray-llm. Use the explicit :latest-py311-cu128 tag instead.

iankouls-aws added 3 commits March 31, 2026 19:19

fix(rayservice): use ray-llm:latest-py311-cu128 image tag

08f7fdf

The :latest tag does not exist on rayproject/ray-llm. Use the explicit :latest-py311-cu128 tag instead.

iankouls-aws requested a review from mvinci12 April 1, 2026 06:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix disaggregated prefill/decode RayService for H200 GPU clusters#28

Fix disaggregated prefill/decode RayService for H200 GPU clusters#28
iankouls-aws wants to merge 3 commits intomainfrom
disaggregated-inference

iankouls-aws commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

iankouls-aws commented Apr 1, 2026

Summary

Changes

Model

Accelerator

Head node GPU isolation

Container image

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant