[atom-vllm][atom-sglang][CI] build CI image on GPU machine instead of a build-only machine by zejunchen-zejun · Pull Request #561 · ROCm/ATOM

zejunchen-zejun · 2026-04-14T12:46:36Z

remove the logic of using the dedicated machine building atom-vllm atom-sglang image. This PR uses the gpu machine to pull nightly image directly and begin test
add accuracy upload for atom-sglang nightly acc validation
add docker image name to dashboard accuracy chart
change model to ds mxfp4 mtp model for nightly

Copilot

Pull request overview

This PR updates the ATOM vLLM and SGLang CI workflows to stop building/pushing reusable “pre-build” images on the build-only-atom runner and instead build the required CI images locally within the GPU test jobs.

Changes:

Removed the dedicated image build-and-push jobs (build_oot_image / build_sglang_image) that ran on build-only-atom.
Updated the accuracy jobs to always generate the overlay Dockerfile and build the needed image locally before running tests.
Simplified image tag handling to always use locally built tags (atom_oot:ci, atom_sglang:ci) and removed remote pull/push/cleanup logic for pre-build tags.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
.github/workflows/atom-vllm-test.yaml	Removes remote prebuild job and shifts OOT image creation to local builds inside GPU accuracy jobs.
.github/workflows/atom-sglang-test.yaml	Removes remote prebuild job and shifts SGLang image creation to local builds inside GPU accuracy jobs.

Comments suppressed due to low confidence (2)

.github/workflows/atom-vllm-test.yaml:232

The local image tags (atom_oot:ci / atom_oot_base:ci) are hard-coded. If multiple self-hosted runners share the same Docker daemon (or if jobs overlap on the same host), parallel jobs can overwrite/remove each other’s tags via docker rmi and docker build -t ..., making runs flaky and harder to debug. Consider tagging images with a unique suffix (e.g., ${GITHUB_RUN_ID}, ${GITHUB_COMMIT_SHA}, and/or ${{ strategy.job-index }}) and using that tag consistently for the container run/cleanup.

      - name: Build OOT image locally
        run: |
          set -euo pipefail
          docker rmi "atom_oot_base:ci" || true
          docker rmi "atom_oot:ci" || true

.github/workflows/atom-sglang-test.yaml:208

The local image tags (atom_sglang:ci / atom_sglang_base:ci) are hard-coded. If multiple self-hosted runners share the same Docker daemon (or if jobs overlap on the same host), parallel jobs can overwrite/remove each other’s tags via docker rmi and docker build -t ..., leading to flakiness. Consider tagging images with a unique suffix (e.g., ${GITHUB_RUN_ID}, ${GITHUB_COMMIT_SHA}, and/or ${{ strategy.job-index }}) and using that tag consistently for the container run/cleanup.

      - name: Build SGLANG image locally
        run: |
          set -euo pipefail
          docker rmi "atom_sglang_base:ci" || true
          docker rmi "atom_sglang:ci" || true

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

zhuyuhua-v

LGTM

wuhuikx · 2026-04-15T09:16:11Z

How about the ATOM model? Can we also add the docker tag for ATOM model with accuracy checking?

zejunchen-zejun · 2026-04-15T09:44:22Z

How about the ATOM model? Can we also add the docker tag for ATOM model with accuracy checking?

Sure, I will add docker name for ATOM accuracy chart.

Copilot

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

.github/workflows/atom-vllm-test.yaml:231

The OOT CI image is now built inside each matrix job run. With multiple models in the matrix this duplicates the Docker build work (and pulls) N times per PR, increasing runtime and load on GPU runners. Consider building the overlay image once in a dedicated job (on a GPU runner), then reusing it across matrix jobs (e.g., push/pull a sha-tagged image for non-fork PRs, or docker save/load as an artifact).

      - name: Build OOT image locally
        run: |
          set -euo pipefail
          docker rmi "atom_oot_base:ci" || true
          docker rmi "atom_oot:ci" || true

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-15T13:34:16Z

@@ -682,25 +486,4 @@ jobs:
          rm -f Dockerfile.mod || true
          docker rmi "atom_oot_base:ci" || true
          docker rmi "atom_oot:ci" || true


OOT_IMAGE_CACHE_KEEP is still defined at the workflow level, but the cleanup logic that used it (pruning old oot-pre-build-* tags) has been removed. Either remove this env var or reintroduce a cache retention mechanism so the setting has an effect.

Suggested change

docker rmi "atom_oot:ci" || true

cache_keep="${OOT_IMAGE_CACHE_KEEP:-0}"

if [ "$cache_keep" -gt 0 ] 2>/dev/null; then

echo "Retaining OOT image cache atom_oot:ci because OOT_IMAGE_CACHE_KEEP=$cache_keep"

else

docker rmi "atom_oot:ci" || true

fi

Copilot · 2026-04-15T13:34:17Z

          rm -f Dockerfile.mod || true
          docker rmi "atom_sglang_base:ci" || true
          docker rmi "atom_sglang:ci" || true


SGLANG_IMAGE_CACHE_KEEP is still defined at the workflow level, but the image pruning logic that referenced it was removed. Please either drop the unused env var or restore a retention/pruning step so the setting is meaningful.

finish on gpu machine instead of a build-only machine Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

nightly accuracy validation workflow Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

PerryZhang01 · 2026-04-16T08:47:50Z

 EXPLICIT_MODEL_NAME=${OOT_MODEL_NAME:-}
 EXPLICIT_MODEL_PATH=${OOT_MODEL_PATH:-}
 EXPLICIT_EXTRA_ARGS=${OOT_EXTRA_ARGS:-}
+OOT_DOCKER_IMAGE=${OOT_DOCKER_IMAGE:-}


remove OOT tag

ok, will update in PR: #541

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

Copilot

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (2)

.github/workflows/atom-vllm-test.yaml:233

Building the OOT Docker image inside each matrix job means the image build runs once per model (and potentially does a full vLLM rebuild on label mismatch), which can significantly increase CI duration and GPU runner usage. Consider restoring a single image-build job (but run it on a GPU runner) and then reusing that image across the matrix (e.g., push an ephemeral tag or otherwise share the built image) so each accuracy job only pulls/uses it.

      - name: Build OOT image locally
        run: |
          set -euo pipefail
          docker rmi "atom_oot_base:ci" || true
          docker rmi "atom_oot:ci" || true

          BUILD_MODE="full"

.github/workflows/atom-sglang-test.yaml:212

This workflow now builds the SGLang Docker image within each matrix job, so the image build repeats per model and can become a major CI time/cost driver (especially when the nightly tag can't be reused and a full rebuild happens). Consider reintroducing a single build job (on a GPU runner) and reusing the resulting image across the matrix (e.g., via an ephemeral pushed tag) to avoid duplicated builds.

      - name: Build SGLANG image locally
        run: |
          set -euo pipefail
          docker rmi "atom_sglang_base:ci" || true
          docker rmi "atom_sglang:ci" || true

          BUILD_MODE="full"
          if ! docker pull "${NIGHTLY_SGLANG_IMAGE_TAG}"; then
            echo "Unable to pull ${NIGHTLY_SGLANG_IMAGE_TAG}; rebuilding SGLang locally."

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

zejunchen-zejun requested a review from zhuyuhua-v April 14, 2026 12:51

zejunchen-zejun marked this pull request as ready for review April 14, 2026 12:51

Copilot AI review requested due to automatic review settings April 14, 2026 12:51

Copilot started reviewing on behalf of zejunchen-zejun April 14, 2026 12:53 View session

Copilot AI reviewed Apr 14, 2026

View reviewed changes

Comment thread .github/workflows/atom-vllm-test.yaml

Comment thread .github/workflows/atom-sglang-test.yaml

zhuyuhua-v previously approved these changes Apr 15, 2026

View reviewed changes

zejunchen-zejun dismissed zhuyuhua-v’s stale review via 54896c5 April 15, 2026 06:49

Copilot AI review requested due to automatic review settings April 15, 2026 13:26

Copilot started reviewing on behalf of zejunchen-zejun April 15, 2026 13:27 View session

Copilot AI reviewed Apr 15, 2026

View reviewed changes

zejunchen-zejun added 3 commits April 16, 2026 13:57

[atom-vllm][atom-sglang] ci image build

40c23af

finish on gpu machine instead of a build-only machine Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

upload docker image name to dashboard

bc02078

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

add docker image name shown in atom native

dc7f905

nightly accuracy validation workflow Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

zejunchen-zejun force-pushed the zejun/change_ci_image_build_0414 branch from 5351f8d to dc7f905 Compare April 16, 2026 05:59

PerryZhang01 reviewed Apr 16, 2026

View reviewed changes

valarLip previously approved these changes Apr 16, 2026

View reviewed changes

change ds mxfp4 model to amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4

c4fddb2

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

Copilot AI review requested due to automatic review settings April 16, 2026 09:50

zejunchen-zejun dismissed valarLip’s stale review via c4fddb2 April 16, 2026 09:50

Copilot started reviewing on behalf of zejunchen-zejun April 16, 2026 09:51 View session

Copilot AI reviewed Apr 16, 2026

View reviewed changes

Comment thread .github/scripts/atom_oot_test.sh

Comment thread .github/benchmark/sglang_models_accuracy.json

valarLip approved these changes Apr 16, 2026

View reviewed changes

valarLip merged commit 522518f into main Apr 16, 2026
25 of 31 checks passed

valarLip deleted the zejun/change_ci_image_build_0414 branch April 16, 2026 11:55

zhuyuhua-v mentioned this pull request Apr 17, 2026

Update the naming of vLLM-ATOM path #541

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[atom-vllm][atom-sglang][CI] build CI image on GPU machine instead of a build-only machine#561

[atom-vllm][atom-sglang][CI] build CI image on GPU machine instead of a build-only machine#561
valarLip merged 4 commits intomainfrom
zejun/change_ci_image_build_0414

zejunchen-zejun commented Apr 14, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

zhuyuhua-v left a comment

Uh oh!

wuhuikx commented Apr 15, 2026

Uh oh!

zejunchen-zejun commented Apr 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 15, 2026

Uh oh!

Copilot AI Apr 15, 2026

Uh oh!

PerryZhang01 Apr 16, 2026

Uh oh!

zejunchen-zejun Apr 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

-          docker rmi "atom_oot:ci" || true
+          cache_keep="${OOT_IMAGE_CACHE_KEEP:-0}"
+          if [ "$cache_keep" -gt 0 ] 2>/dev/null; then
+            echo "Retaining OOT image cache atom_oot:ci because OOT_IMAGE_CACHE_KEEP=$cache_keep"
+          else
+            docker rmi "atom_oot:ci" || true
+          fi

Conversation

zejunchen-zejun commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

zhuyuhua-v left a comment

Choose a reason for hiding this comment

Uh oh!

wuhuikx commented Apr 15, 2026

Uh oh!

zejunchen-zejun commented Apr 15, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

PerryZhang01 Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

zejunchen-zejun Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

zejunchen-zejun commented Apr 14, 2026 •

edited

Loading