Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions .github/configs/amd-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -415,6 +415,32 @@
- { tp: 4, conc-start: 4, conc-end: 128 }
- { tp: 8, ep: 8, conc-start: 32, conc-end: 256 }

minimaxm2.5-fp4-mi355x-atom:
# TODO:
image: TBD

Check failure on line 420 in .github/configs/amd-master.yaml

View check run for this annotation

Claude / Claude Code Review

image: TBD placeholder will cause CI pipeline failures

The `minimaxm2.5-fp4-mi355x-atom` config entry has `image: TBD` instead of a valid Docker image tag, which will cause any CI pipeline targeting this config to fail immediately when attempting to pull the image. Before merging, replace `TBD` with a real atom image (the fp8 equivalent uses `rocm/atom:rocm7.2.1-ubuntu24.04-pytorch2.9.1-atom0.1.2` as a reference point).
Comment on lines +418 to +420
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The minimaxm2.5-fp4-mi355x-atom config entry has image: TBD instead of a valid Docker image tag, which will cause any CI pipeline targeting this config to fail immediately when attempting to pull the image. Before merging, replace TBD with a real atom image (the fp8 equivalent uses rocm/atom:rocm7.2.1-ubuntu24.04-pytorch2.9.1-atom0.1.2 as a reference point).

Extended reasoning...

What the bug is and how it manifests

The minimaxm2.5-fp4-mi355x-atom config entry in .github/configs/amd-master.yaml (line 420) sets image: TBD. This is not a valid Docker image reference. When any CI job targets this config entry, the container runtime will attempt docker pull TBD (or equivalent), which will fail with an image-not-found error and abort the pipeline.

The specific code path that triggers it

The CI benchmark workflow reads the image field from the YAML config and passes it directly as the Docker image to pull and run (e.g., IMAGE: ${{ inputs.image }}). There is no guard or filtering logic that would skip entries with placeholder values — the config is consumed as-is. Any automated benchmark job that selects the minimaxm2.5-fp4-mi355x-atom key will immediately fail at the Docker pull step.

Why existing code doesn't prevent it

The YAML config file has no schema validation or image-existence checks at parse time. The # TODO: comments in the diff confirm the author is aware the image value is incomplete, but these are source comments only — they have no runtime effect and do not prevent the broken entry from being picked up by automation.

What the impact would be

Any CI benchmark run that triggers the minimaxm2.5-fp4-mi355x-atom config entry will fail with a Docker image pull error, producing a broken pipeline and potentially blocking other benchmark results from the same run.

How to fix it

Replace image: TBD with the correct atom Docker image tag for the MiniMax M2.5 FP4 workload on MI355X. Based on the fp8 equivalent (minimaxm2.5-fp8-mi355x-atom), a likely candidate is rocm/atom:rocm7.2.1-ubuntu24.04-pytorch2.9.1-atom0.1.2 or newer, but the exact image should be verified with the atom team before merging.

Step-by-step proof

  1. CI automation reads .github/configs/amd-master.yaml and finds the minimaxm2.5-fp4-mi355x-atom key.
  2. It extracts image: TBD and sets IMAGE=TBD in the job environment.
  3. The benchmark template executes docker pull TBD (or equivalent container invocation).
  4. Docker returns Error: No such image: TBD / registry lookup fails.
  5. The CI job exits with a non-zero status code and the pipeline is marked as failed.

model: MiniMaxAI/MiniMax-M2.5
model-prefix: minimaxm2.5
runner: mi355x
precision: fp4
framework: atom
multinode: false
seq-len-configs:
- isl: 1024
osl: 1024
# TODO:
search-space:
- { tp: 1, conc-start: 4, conc-end: 256 }
- { tp: 2, conc-start: 4, conc-end: 256 }
- { tp: 4, conc-start: 4, conc-end: 256 }
- { tp: 8, conc-start: 4, conc-end: 256 }
- isl: 8192
osl: 1024
search-space:
- { tp: 1, conc-start: 4, conc-end: 256 }
- { tp: 2, conc-start: 4, conc-end: 256 }
- { tp: 4, conc-start: 4, conc-end: 256 }
- { tp: 8, conc-start: 4, conc-end: 256 }

minimaxm2.5-fp8-mi300x-vllm:
image: vllm/vllm-openai-rocm:v0.16.0
model: MiniMaxAI/MiniMax-M2.5
Expand Down
80 changes: 80 additions & 0 deletions benchmarks/single_node/minimaxm2.5_fp4_mi355x_atom.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
#!/usr/bin/env bash

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
RANDOM_RANGE_RATIO \
RESULT_FILENAME \
EP_SIZE \
DP_ATTENTION

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

echo "TP: $TP, CONC: $CONC, ISL: $ISL, OSL: $OSL, EP_SIZE: $EP_SIZE, DP_ATTENTION: $DP_ATTENTION"

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}

export OMP_NUM_THREADS=1

# Calculate max-model-len based on ISL and OSL
if [ "$ISL" = "1024" ] && [ "$OSL" = "1024" ]; then
CALCULATED_MAX_MODEL_LEN=""
else
CALCULATED_MAX_MODEL_LEN=" --max-model-len 10240 "
fi

if [ "$EP_SIZE" -gt 1 ]; then
EP=" --enable-expert-parallel"
else
EP=" "
fi

# Start GPU monitoring (power, temperature, clocks every second)
start_gpu_monitor

set -x

python3 -m atom.entrypoints.openai_server \
--model $MODEL \
--server-port $PORT \
-tp $TP \
--kv_cache_dtype fp8 $CALCULATED_MAX_MODEL_LEN $EP \
--trust-remote-code \
> $SERVER_LOG 2>&1 &

SERVER_PID=$!

# Wait for server to be ready
wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

export PYTHONDONTWRITEBYTECODE=1
run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/ \
--trust-remote-code

# After throughput, run evaluation only if RUN_EVAL is true
if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT"
append_lm_eval_summary
fi

# Stop GPU monitoring
stop_gpu_monitor
set +x
Loading