Skip to content

[atom-vllm][atom-sglang][CI] build CI image on GPU machine instead of a build-only machine#561

Merged
valarLip merged 4 commits intomainfrom
zejun/change_ci_image_build_0414
Apr 16, 2026
Merged

[atom-vllm][atom-sglang][CI] build CI image on GPU machine instead of a build-only machine#561
valarLip merged 4 commits intomainfrom
zejun/change_ci_image_build_0414

Conversation

@zejunchen-zejun
Copy link
Copy Markdown
Collaborator

@zejunchen-zejun zejunchen-zejun commented Apr 14, 2026

  1. remove the logic of using the dedicated machine building atom-vllm atom-sglang image. This PR uses the gpu machine to pull nightly image directly and begin test
  2. add accuracy upload for atom-sglang nightly acc validation
  3. add docker image name to dashboard accuracy chart
  4. change model to ds mxfp4 mtp model for nightly

@zejunchen-zejun zejunchen-zejun marked this pull request as ready for review April 14, 2026 12:51
Copilot AI review requested due to automatic review settings April 14, 2026 12:51
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the ATOM vLLM and SGLang CI workflows to stop building/pushing reusable “pre-build” images on the build-only-atom runner and instead build the required CI images locally within the GPU test jobs.

Changes:

  • Removed the dedicated image build-and-push jobs (build_oot_image / build_sglang_image) that ran on build-only-atom.
  • Updated the accuracy jobs to always generate the overlay Dockerfile and build the needed image locally before running tests.
  • Simplified image tag handling to always use locally built tags (atom_oot:ci, atom_sglang:ci) and removed remote pull/push/cleanup logic for pre-build tags.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
.github/workflows/atom-vllm-test.yaml Removes remote prebuild job and shifts OOT image creation to local builds inside GPU accuracy jobs.
.github/workflows/atom-sglang-test.yaml Removes remote prebuild job and shifts SGLang image creation to local builds inside GPU accuracy jobs.
Comments suppressed due to low confidence (2)

.github/workflows/atom-vllm-test.yaml:232

  • The local image tags (atom_oot:ci / atom_oot_base:ci) are hard-coded. If multiple self-hosted runners share the same Docker daemon (or if jobs overlap on the same host), parallel jobs can overwrite/remove each other’s tags via docker rmi and docker build -t ..., making runs flaky and harder to debug. Consider tagging images with a unique suffix (e.g., ${GITHUB_RUN_ID}, ${GITHUB_COMMIT_SHA}, and/or ${{ strategy.job-index }}) and using that tag consistently for the container run/cleanup.
      - name: Build OOT image locally
        run: |
          set -euo pipefail
          docker rmi "atom_oot_base:ci" || true
          docker rmi "atom_oot:ci" || true

.github/workflows/atom-sglang-test.yaml:208

  • The local image tags (atom_sglang:ci / atom_sglang_base:ci) are hard-coded. If multiple self-hosted runners share the same Docker daemon (or if jobs overlap on the same host), parallel jobs can overwrite/remove each other’s tags via docker rmi and docker build -t ..., leading to flakiness. Consider tagging images with a unique suffix (e.g., ${GITHUB_RUN_ID}, ${GITHUB_COMMIT_SHA}, and/or ${{ strategy.job-index }}) and using that tag consistently for the container run/cleanup.
      - name: Build SGLANG image locally
        run: |
          set -euo pipefail
          docker rmi "atom_sglang_base:ci" || true
          docker rmi "atom_sglang:ci" || true

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .github/workflows/atom-vllm-test.yaml
Comment thread .github/workflows/atom-sglang-test.yaml
zhuyuhua-v
zhuyuhua-v previously approved these changes Apr 15, 2026
Copy link
Copy Markdown
Collaborator

@zhuyuhua-v zhuyuhua-v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wuhuikx
Copy link
Copy Markdown
Collaborator

wuhuikx commented Apr 15, 2026

How about the ATOM model? Can we also add the docker tag for ATOM model with accuracy checking?

@zejunchen-zejun
Copy link
Copy Markdown
Collaborator Author

How about the ATOM model? Can we also add the docker tag for ATOM model with accuracy checking?

Sure, I will add docker name for ATOM accuracy chart.

Copilot AI review requested due to automatic review settings April 15, 2026 13:26
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

.github/workflows/atom-vllm-test.yaml:231

  • The OOT CI image is now built inside each matrix job run. With multiple models in the matrix this duplicates the Docker build work (and pulls) N times per PR, increasing runtime and load on GPU runners. Consider building the overlay image once in a dedicated job (on a GPU runner), then reusing it across matrix jobs (e.g., push/pull a sha-tagged image for non-fork PRs, or docker save/load as an artifact).
      - name: Build OOT image locally
        run: |
          set -euo pipefail
          docker rmi "atom_oot_base:ci" || true
          docker rmi "atom_oot:ci" || true

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -682,25 +486,4 @@ jobs:
rm -f Dockerfile.mod || true
docker rmi "atom_oot_base:ci" || true
docker rmi "atom_oot:ci" || true
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOT_IMAGE_CACHE_KEEP is still defined at the workflow level, but the cleanup logic that used it (pruning old oot-pre-build-* tags) has been removed. Either remove this env var or reintroduce a cache retention mechanism so the setting has an effect.

Suggested change
docker rmi "atom_oot:ci" || true
cache_keep="${OOT_IMAGE_CACHE_KEEP:-0}"
if [ "$cache_keep" -gt 0 ] 2>/dev/null; then
echo "Retaining OOT image cache atom_oot:ci because OOT_IMAGE_CACHE_KEEP=$cache_keep"
else
docker rmi "atom_oot:ci" || true
fi

Copilot uses AI. Check for mistakes.
Comment on lines 465 to 467
rm -f Dockerfile.mod || true
docker rmi "atom_sglang_base:ci" || true
docker rmi "atom_sglang:ci" || true
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGLANG_IMAGE_CACHE_KEEP is still defined at the workflow level, but the image pruning logic that referenced it was removed. Please either drop the unused env var or restore a retention/pruning step so the setting is meaningful.

Copilot uses AI. Check for mistakes.
finish on gpu machine instead of a build-only
machine

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
nightly accuracy validation workflow

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
@zejunchen-zejun zejunchen-zejun force-pushed the zejun/change_ci_image_build_0414 branch from 5351f8d to dc7f905 Compare April 16, 2026 05:59
EXPLICIT_MODEL_NAME=${OOT_MODEL_NAME:-}
EXPLICIT_MODEL_PATH=${OOT_MODEL_PATH:-}
EXPLICIT_EXTRA_ARGS=${OOT_EXTRA_ARGS:-}
OOT_DOCKER_IMAGE=${OOT_DOCKER_IMAGE:-}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove OOT tag

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will update in PR: #541

valarLip
valarLip previously approved these changes Apr 16, 2026
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (2)

.github/workflows/atom-vllm-test.yaml:233

  • Building the OOT Docker image inside each matrix job means the image build runs once per model (and potentially does a full vLLM rebuild on label mismatch), which can significantly increase CI duration and GPU runner usage. Consider restoring a single image-build job (but run it on a GPU runner) and then reusing that image across the matrix (e.g., push an ephemeral tag or otherwise share the built image) so each accuracy job only pulls/uses it.
      - name: Build OOT image locally
        run: |
          set -euo pipefail
          docker rmi "atom_oot_base:ci" || true
          docker rmi "atom_oot:ci" || true

          BUILD_MODE="full"

.github/workflows/atom-sglang-test.yaml:212

  • This workflow now builds the SGLang Docker image within each matrix job, so the image build repeats per model and can become a major CI time/cost driver (especially when the nightly tag can't be reused and a full rebuild happens). Consider reintroducing a single build job (on a GPU runner) and reusing the resulting image across the matrix (e.g., via an ephemeral pushed tag) to avoid duplicated builds.
      - name: Build SGLANG image locally
        run: |
          set -euo pipefail
          docker rmi "atom_sglang_base:ci" || true
          docker rmi "atom_sglang:ci" || true

          BUILD_MODE="full"
          if ! docker pull "${NIGHTLY_SGLANG_IMAGE_TAG}"; then
            echo "Unable to pull ${NIGHTLY_SGLANG_IMAGE_TAG}; rebuilding SGLang locally."

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .github/scripts/atom_oot_test.sh
Comment thread .github/benchmark/sglang_models_accuracy.json
@valarLip valarLip merged commit 522518f into main Apr 16, 2026
25 of 31 checks passed
@valarLip valarLip deleted the zejun/change_ci_image_build_0414 branch April 16, 2026 11:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants