Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions .github/configs/amd-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -277,6 +277,28 @@ glm5-fp8-mi355x-atom:
search-space:
- { tp: 8, conc-start: 4, conc-end: 256 }

glm5-fp4-mi355x-atom:
#TODO
image: TBD
model: amd/GLM-5.1-MXFP4
model-prefix: glm5
runner: mi355x
precision: fp4
framework: atom
multinode: false
seq-len-configs:
#TODO:
- isl: 1024
osl: 1024
search-space:
- { tp: 4, conc-start: 4, conc-end: 256 }
- { tp: 8, conc-start: 4, conc-end: 256 }
- isl: 8192
osl: 1024
search-space:
- { tp: 4, conc-start: 4, conc-end: 256 }
- { tp: 8, conc-start: 4, conc-end: 256 }

kimik2.5-int4-mi355x-vllm:
image: vllm/vllm-openai-rocm:v0.18.0
model: moonshotai/Kimi-K2.5
Expand Down
81 changes: 81 additions & 0 deletions benchmarks/single_node/glm5_fp4_mi355x_atom.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
#!/usr/bin/env bash

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
RANDOM_RANGE_RATIO \
RESULT_FILENAME \
EP_SIZE \
DP_ATTENTION

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

echo "TP: $TP, CONC: $CONC, ISL: $ISL, OSL: $OSL, EP_SIZE: $EP_SIZE, DP_ATTENTION: $DP_ATTENTION"

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}

export OMP_NUM_THREADS=1

# Calculate max-model-len based on ISL and OSL
if [ "$ISL" = "1024" ] && [ "$OSL" = "1024" ]; then
CALCULATED_MAX_MODEL_LEN=""
else
CALCULATED_MAX_MODEL_LEN=" --max-model-len 10240 "
fi
Comment on lines +28 to +32
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new glm5_fp4_mi355x_atom.sh is missing the EVAL_ONLY handling block present in other atom scripts (e.g., dsr1_fp4_mi355x_atom.sh, gptoss_fp4_mi355x_atom.sh lines 34-37). When EVAL_ONLY=true is set, the server starts with --max-model-len 10240 instead of EVAL_MAX_MODEL_LEN (typically 16384+), causing a context-length mismatch between the server and lm-eval that leads to OOM errors or silent evaluation failures. Add the missing block after line 32: if [ "${EVAL_ONLY}" = "true" ]; then setup_eval_context; CALCULATED_MAX_MODEL_LEN=" --max-model-len $EVAL_MAX_MODEL_LEN "; fi.

Extended reasoning...

What the bug is and how it manifests

The glm5_fp4_mi355x_atom.sh script is missing the EVAL_ONLY context-setup block that peer atom scripts use to reconfigure the server's context length when running in evaluation-only mode. Without it, when a user sets EVAL_ONLY=true, the script skips the throughput benchmark (handled by run_benchmark_serving's internal check) but the server was already started with an undersized context window (--max-model-len 10240 for non-1024/1024 workloads). The lm-eval framework then requests completions up to EVAL_MAX_MODEL_LEN (default 16384 or computed from ISL+OSL+200), which exceeds the server's configured limit.

The specific code path that triggers it

In other atom scripts (e.g., dsr1_fp4_mi355x_atom.sh lines 34-37, gptoss_fp4_mi355x_atom.sh lines 34-37), after computing CALCULATED_MAX_MODEL_LEN from ISL/OSL, there is:

if [ "${EVAL_ONLY}" = "true" ]; then
    setup_eval_context
    CALCULATED_MAX_MODEL_LEN=" --max-model-len $EVAL_MAX_MODEL_LEN "
fi

setup_eval_context (defined in benchmark_lib.sh) sets EVAL_MAX_MODEL_LEN based on the model's native context length (or ISL+OSL+200), then CALCULATED_MAX_MODEL_LEN is overridden to that larger value before the server starts. The new script omits this block entirely — the server always starts with whichever context limit was computed from the ISL/OSL check (10240 for most workloads).

Why existing code doesn't prevent it

EVAL_ONLY is not listed in check_env_vars, so there is no runtime guard. The script's RUN_EVAL check on line 69 only gates the lm-eval call, it does not fix the server's context size. Once the server is started with --max-model-len 10240, it cannot be reconfigured without restarting.

Impact

Any CI or ad-hoc run with EVAL_ONLY=true will either: (a) receive context-exceeded errors from the server when lm-eval submits prompts longer than 10240 tokens, producing incorrect/truncated scores; (b) trigger OOM on the GPU as the server tries to accommodate oversized KV-cache requests; or (c) silently truncate outputs, yielding misleadingly low evaluation numbers. The benchmark mode is unaffected, so this bug is invisible during normal benchmarking.

Addressing the refutation

The refutation notes that glm5_fp8_mi355x_atom.sh and kimik2.5_fp4_mi355x_atom.sh also lack this block — true, but that means those scripts share the same latent bug. The fact that the fp4 script was modeled after fp8 means it inherited the deficiency rather than being deliberately designed without eval support. The dsr1 and gptoss scripts demonstrate the correct pattern. Two scripts being broken does not make a third script correct.

How to fix it and step-by-step proof

Add after line 32 (after the ISL/OSL CALCULATED_MAX_MODEL_LEN block):

if [ "${EVAL_ONLY}" = "true" ]; then
    setup_eval_context
    CALCULATED_MAX_MODEL_LEN=" --max-model-len $EVAL_MAX_MODEL_LEN "
fi

Concrete example: ISL=4096, OSL=1024 (a common eval workload).

  1. ISL≠1024 or OSL≠1024, so CALCULATED_MAX_MODEL_LEN=" --max-model-len 10240 ".
  2. User sets EVAL_ONLY=true.
  3. Without the fix: server starts with --max-model-len 10240. setup_eval_context is never called, so EVAL_MAX_MODEL_LEN falls back to 0 or the model default (e.g., 131072 for GLM-4). lm-eval submits a prompt of 4096+1024+200=5320 tokens — just within 10240, so no immediate crash, but the eval context is wrong and larger GLM-5 native context means lm-eval may use sequences far exceeding 10240, causing rejections or OOM.
  4. With the fix: setup_eval_context computes EVAL_MAX_MODEL_LEN=5320, server starts with --max-model-len 5320, lm-eval uses the same limit — consistent and correct.


if [ "$EP_SIZE" -gt 1 ]; then
EP=" --enable-expert-parallel"
else
EP=" "
fi

# Start GPU monitoring (power, temperature, clocks every second)
start_gpu_monitor

set -x
pip install -U transformers
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟣 Line 44 runs pip install -U transformers with no version pin, making benchmark results non-reproducible across runs and vulnerable to silent API breakage if a new transformers release is incompatible with atom or GLM5 FP4. This same unpinned install exists in the fp8 atom counterpart script (pre-existing), but other GLM5 scripts in the repo (glm5_fp8_b200.sh, glm5_fp8_mi355x.sh) use pinned versions — consider pinning to a specific version or git SHA before publishing as an official benchmark.

Extended reasoning...

What the bug is and how it manifests

Line 44 of glm5_fp4_mi355x_atom.sh runs pip install -U transformers immediately before launching the atom server, with no version constraint. This means each benchmark run installs whatever the latest transformers release is at that moment. Two runs separated by days or weeks could use different transformers versions, producing incomparable results.

The specific code path

The install runs inside set -x after GPU monitoring has started but before the server process is launched. If the install fetches an incompatible transformers version, the atom server may fail to start or behave incorrectly — and since the script uses background launch + wait_for_server_ready, a subtle startup failure might not be immediately obvious.

Why existing code doesn't prevent it

The -U (upgrade) flag explicitly requests the latest available version. There is no ==, >=, or git+...@<sha> constraint anywhere. The check_env_vars block at the top checks for required environment variables but does not validate the software environment.

Why the refutation deserves a response

One verifier argues this is intentional, since the fp8 counterpart (glm5_fp8_mi355x_atom.sh) uses the same unversioned install. Consistency with an existing pattern does not eliminate the risk — it only means the issue is pre-existing. The fp8 atom script has the same reproducibility problem, and the fp4 script inherits it. The comparison scripts for other backends (glm5_fp8_b200.sh: transformers==5.2.0; glm5_fp8_mi355x.sh: pinned git SHA) demonstrate the repo does value version pinning elsewhere.

Step-by-step proof of the problem

  1. Run A (today): pip install -U transformers installs transformers 4.51.0.
  2. Benchmark completes successfully; results recorded.
  3. transformers 4.52.0 is released with a breaking change to a tokenizer API used by GLM5.
  4. Run B (next week): pip install -U transformers installs 4.52.0.
  5. The atom server starts but produces different output, or fails silently — Run B results are incomparable with Run A.

How to fix it

Pin to the same version or commit SHA used by sister scripts. For example:
pip install --no-deps "transformers==5.2.0"
or
pip install -U --no-cache-dir "git+https://github.com/huggingface/transformers.git@<specific-sha>"
Ideally, also fix glm5_fp8_mi355x_atom.sh at the same time since it has the same issue.

python3 -m atom.entrypoints.openai_server \
--model $MODEL \
--server-port $PORT \
-tp $TP \
--kv_cache_dtype fp8 $CALCULATED_MAX_MODEL_LEN $EP \
--default-chat-template-kwargs '{"enable_thinking": false}' \
--trust-remote-code \
> $SERVER_LOG 2>&1 &

SERVER_PID=$!

# Wait for server to be ready
wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

export PYTHONDONTWRITEBYTECODE=1
run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/ \
--trust-remote-code

# After throughput, run evaluation only if RUN_EVAL is true
if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT"
append_lm_eval_summary
fi

# Stop GPU monitoring
stop_gpu_monitor
set +x
Loading