Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates the ATOM vLLM and SGLang CI workflows to stop building/pushing reusable “pre-build” images on the build-only-atom runner and instead build the required CI images locally within the GPU test jobs.
Changes:
- Removed the dedicated image build-and-push jobs (
build_oot_image/build_sglang_image) that ran onbuild-only-atom. - Updated the accuracy jobs to always generate the overlay Dockerfile and build the needed image locally before running tests.
- Simplified image tag handling to always use locally built tags (
atom_oot:ci,atom_sglang:ci) and removed remote pull/push/cleanup logic for pre-build tags.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| .github/workflows/atom-vllm-test.yaml | Removes remote prebuild job and shifts OOT image creation to local builds inside GPU accuracy jobs. |
| .github/workflows/atom-sglang-test.yaml | Removes remote prebuild job and shifts SGLang image creation to local builds inside GPU accuracy jobs. |
Comments suppressed due to low confidence (2)
.github/workflows/atom-vllm-test.yaml:232
- The local image tags (
atom_oot:ci/atom_oot_base:ci) are hard-coded. If multiple self-hosted runners share the same Docker daemon (or if jobs overlap on the same host), parallel jobs can overwrite/remove each other’s tags viadocker rmianddocker build -t ..., making runs flaky and harder to debug. Consider tagging images with a unique suffix (e.g.,${GITHUB_RUN_ID},${GITHUB_COMMIT_SHA}, and/or${{ strategy.job-index }}) and using that tag consistently for the container run/cleanup.
- name: Build OOT image locally
run: |
set -euo pipefail
docker rmi "atom_oot_base:ci" || true
docker rmi "atom_oot:ci" || true
.github/workflows/atom-sglang-test.yaml:208
- The local image tags (
atom_sglang:ci/atom_sglang_base:ci) are hard-coded. If multiple self-hosted runners share the same Docker daemon (or if jobs overlap on the same host), parallel jobs can overwrite/remove each other’s tags viadocker rmianddocker build -t ..., leading to flakiness. Consider tagging images with a unique suffix (e.g.,${GITHUB_RUN_ID},${GITHUB_COMMIT_SHA}, and/or${{ strategy.job-index }}) and using that tag consistently for the container run/cleanup.
- name: Build SGLANG image locally
run: |
set -euo pipefail
docker rmi "atom_sglang_base:ci" || true
docker rmi "atom_sglang:ci" || true
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
How about the ATOM model? Can we also add the docker tag for ATOM model with accuracy checking? |
Sure, I will add docker name for ATOM accuracy chart. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
.github/workflows/atom-vllm-test.yaml:231
- The OOT CI image is now built inside each matrix job run. With multiple models in the matrix this duplicates the Docker build work (and pulls) N times per PR, increasing runtime and load on GPU runners. Consider building the overlay image once in a dedicated job (on a GPU runner), then reusing it across matrix jobs (e.g., push/pull a sha-tagged image for non-fork PRs, or docker save/load as an artifact).
- name: Build OOT image locally
run: |
set -euo pipefail
docker rmi "atom_oot_base:ci" || true
docker rmi "atom_oot:ci" || true
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -682,25 +486,4 @@ jobs: | |||
| rm -f Dockerfile.mod || true | |||
| docker rmi "atom_oot_base:ci" || true | |||
| docker rmi "atom_oot:ci" || true | |||
There was a problem hiding this comment.
OOT_IMAGE_CACHE_KEEP is still defined at the workflow level, but the cleanup logic that used it (pruning old oot-pre-build-* tags) has been removed. Either remove this env var or reintroduce a cache retention mechanism so the setting has an effect.
| docker rmi "atom_oot:ci" || true | |
| cache_keep="${OOT_IMAGE_CACHE_KEEP:-0}" | |
| if [ "$cache_keep" -gt 0 ] 2>/dev/null; then | |
| echo "Retaining OOT image cache atom_oot:ci because OOT_IMAGE_CACHE_KEEP=$cache_keep" | |
| else | |
| docker rmi "atom_oot:ci" || true | |
| fi |
| rm -f Dockerfile.mod || true | ||
| docker rmi "atom_sglang_base:ci" || true | ||
| docker rmi "atom_sglang:ci" || true |
There was a problem hiding this comment.
SGLANG_IMAGE_CACHE_KEEP is still defined at the workflow level, but the image pruning logic that referenced it was removed. Please either drop the unused env var or restore a retention/pruning step so the setting is meaningful.
finish on gpu machine instead of a build-only machine Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
nightly accuracy validation workflow Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
5351f8d to
dc7f905
Compare
| EXPLICIT_MODEL_NAME=${OOT_MODEL_NAME:-} | ||
| EXPLICIT_MODEL_PATH=${OOT_MODEL_PATH:-} | ||
| EXPLICIT_EXTRA_ARGS=${OOT_EXTRA_ARGS:-} | ||
| OOT_DOCKER_IMAGE=${OOT_DOCKER_IMAGE:-} |
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (2)
.github/workflows/atom-vllm-test.yaml:233
- Building the OOT Docker image inside each matrix job means the image build runs once per model (and potentially does a full vLLM rebuild on label mismatch), which can significantly increase CI duration and GPU runner usage. Consider restoring a single image-build job (but run it on a GPU runner) and then reusing that image across the matrix (e.g., push an ephemeral tag or otherwise share the built image) so each accuracy job only pulls/uses it.
- name: Build OOT image locally
run: |
set -euo pipefail
docker rmi "atom_oot_base:ci" || true
docker rmi "atom_oot:ci" || true
BUILD_MODE="full"
.github/workflows/atom-sglang-test.yaml:212
- This workflow now builds the SGLang Docker image within each matrix job, so the image build repeats per model and can become a major CI time/cost driver (especially when the nightly tag can't be reused and a full rebuild happens). Consider reintroducing a single build job (on a GPU runner) and reusing the resulting image across the matrix (e.g., via an ephemeral pushed tag) to avoid duplicated builds.
- name: Build SGLANG image locally
run: |
set -euo pipefail
docker rmi "atom_sglang_base:ci" || true
docker rmi "atom_sglang:ci" || true
BUILD_MODE="full"
if ! docker pull "${NIGHTLY_SGLANG_IMAGE_TAG}"; then
echo "Unable to pull ${NIGHTLY_SGLANG_IMAGE_TAG}; rebuilding SGLang locally."
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Uh oh!
There was an error while loading. Please reload this page.