Conversation
Signed-off-by: seungrokj <seungrok.jung@amd.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you |
Signed-off-by: seungrokj <seungrok.jung@amd.com>
Signed-off-by: seungrokj <seungrok.jung@amd.com>
| if [ "$ISL" = "1024" ] && [ "$OSL" = "1024" ]; then | ||
| CALCULATED_MAX_MODEL_LEN="" | ||
| else | ||
| CALCULATED_MAX_MODEL_LEN=" --max-model-len 10240 " | ||
| fi |
There was a problem hiding this comment.
🔴 The new glm5_fp4_mi355x_atom.sh is missing the EVAL_ONLY handling block present in other atom scripts (e.g., dsr1_fp4_mi355x_atom.sh, gptoss_fp4_mi355x_atom.sh lines 34-37). When EVAL_ONLY=true is set, the server starts with --max-model-len 10240 instead of EVAL_MAX_MODEL_LEN (typically 16384+), causing a context-length mismatch between the server and lm-eval that leads to OOM errors or silent evaluation failures. Add the missing block after line 32: if [ "${EVAL_ONLY}" = "true" ]; then setup_eval_context; CALCULATED_MAX_MODEL_LEN=" --max-model-len $EVAL_MAX_MODEL_LEN "; fi.
Extended reasoning...
What the bug is and how it manifests
The glm5_fp4_mi355x_atom.sh script is missing the EVAL_ONLY context-setup block that peer atom scripts use to reconfigure the server's context length when running in evaluation-only mode. Without it, when a user sets EVAL_ONLY=true, the script skips the throughput benchmark (handled by run_benchmark_serving's internal check) but the server was already started with an undersized context window (--max-model-len 10240 for non-1024/1024 workloads). The lm-eval framework then requests completions up to EVAL_MAX_MODEL_LEN (default 16384 or computed from ISL+OSL+200), which exceeds the server's configured limit.
The specific code path that triggers it
In other atom scripts (e.g., dsr1_fp4_mi355x_atom.sh lines 34-37, gptoss_fp4_mi355x_atom.sh lines 34-37), after computing CALCULATED_MAX_MODEL_LEN from ISL/OSL, there is:
if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
CALCULATED_MAX_MODEL_LEN=" --max-model-len $EVAL_MAX_MODEL_LEN "
fisetup_eval_context (defined in benchmark_lib.sh) sets EVAL_MAX_MODEL_LEN based on the model's native context length (or ISL+OSL+200), then CALCULATED_MAX_MODEL_LEN is overridden to that larger value before the server starts. The new script omits this block entirely — the server always starts with whichever context limit was computed from the ISL/OSL check (10240 for most workloads).
Why existing code doesn't prevent it
EVAL_ONLY is not listed in check_env_vars, so there is no runtime guard. The script's RUN_EVAL check on line 69 only gates the lm-eval call, it does not fix the server's context size. Once the server is started with --max-model-len 10240, it cannot be reconfigured without restarting.
Impact
Any CI or ad-hoc run with EVAL_ONLY=true will either: (a) receive context-exceeded errors from the server when lm-eval submits prompts longer than 10240 tokens, producing incorrect/truncated scores; (b) trigger OOM on the GPU as the server tries to accommodate oversized KV-cache requests; or (c) silently truncate outputs, yielding misleadingly low evaluation numbers. The benchmark mode is unaffected, so this bug is invisible during normal benchmarking.
Addressing the refutation
The refutation notes that glm5_fp8_mi355x_atom.sh and kimik2.5_fp4_mi355x_atom.sh also lack this block — true, but that means those scripts share the same latent bug. The fact that the fp4 script was modeled after fp8 means it inherited the deficiency rather than being deliberately designed without eval support. The dsr1 and gptoss scripts demonstrate the correct pattern. Two scripts being broken does not make a third script correct.
How to fix it and step-by-step proof
Add after line 32 (after the ISL/OSL CALCULATED_MAX_MODEL_LEN block):
if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
CALCULATED_MAX_MODEL_LEN=" --max-model-len $EVAL_MAX_MODEL_LEN "
fiConcrete example: ISL=4096, OSL=1024 (a common eval workload).
- ISL≠1024 or OSL≠1024, so
CALCULATED_MAX_MODEL_LEN=" --max-model-len 10240 ". - User sets
EVAL_ONLY=true. - Without the fix: server starts with
--max-model-len 10240.setup_eval_contextis never called, soEVAL_MAX_MODEL_LENfalls back to 0 or the model default (e.g., 131072 for GLM-4). lm-eval submits a prompt of 4096+1024+200=5320 tokens — just within 10240, so no immediate crash, but the eval context is wrong and larger GLM-5 native context means lm-eval may use sequences far exceeding 10240, causing rejections or OOM. - With the fix:
setup_eval_contextcomputesEVAL_MAX_MODEL_LEN=5320, server starts with--max-model-len 5320, lm-eval uses the same limit — consistent and correct.
| start_gpu_monitor | ||
|
|
||
| set -x | ||
| pip install -U transformers |
There was a problem hiding this comment.
🟣 Line 44 runs pip install -U transformers with no version pin, making benchmark results non-reproducible across runs and vulnerable to silent API breakage if a new transformers release is incompatible with atom or GLM5 FP4. This same unpinned install exists in the fp8 atom counterpart script (pre-existing), but other GLM5 scripts in the repo (glm5_fp8_b200.sh, glm5_fp8_mi355x.sh) use pinned versions — consider pinning to a specific version or git SHA before publishing as an official benchmark.
Extended reasoning...
What the bug is and how it manifests
Line 44 of glm5_fp4_mi355x_atom.sh runs pip install -U transformers immediately before launching the atom server, with no version constraint. This means each benchmark run installs whatever the latest transformers release is at that moment. Two runs separated by days or weeks could use different transformers versions, producing incomparable results.
The specific code path
The install runs inside set -x after GPU monitoring has started but before the server process is launched. If the install fetches an incompatible transformers version, the atom server may fail to start or behave incorrectly — and since the script uses background launch + wait_for_server_ready, a subtle startup failure might not be immediately obvious.
Why existing code doesn't prevent it
The -U (upgrade) flag explicitly requests the latest available version. There is no ==, >=, or git+...@<sha> constraint anywhere. The check_env_vars block at the top checks for required environment variables but does not validate the software environment.
Why the refutation deserves a response
One verifier argues this is intentional, since the fp8 counterpart (glm5_fp8_mi355x_atom.sh) uses the same unversioned install. Consistency with an existing pattern does not eliminate the risk — it only means the issue is pre-existing. The fp8 atom script has the same reproducibility problem, and the fp4 script inherits it. The comparison scripts for other backends (glm5_fp8_b200.sh: transformers==5.2.0; glm5_fp8_mi355x.sh: pinned git SHA) demonstrate the repo does value version pinning elsewhere.
Step-by-step proof of the problem
- Run A (today):
pip install -U transformersinstalls transformers 4.51.0. - Benchmark completes successfully; results recorded.
- transformers 4.52.0 is released with a breaking change to a tokenizer API used by GLM5.
- Run B (next week):
pip install -U transformersinstalls 4.52.0. - The atom server starts but produces different output, or fails silently — Run B results are incomparable with Run A.
How to fix it
Pin to the same version or commit SHA used by sister scripts. For example:
pip install --no-deps "transformers==5.2.0"
or
pip install -U --no-cache-dir "git+https://github.com/huggingface/transformers.git@<specific-sha>"
Ideally, also fix glm5_fp8_mi355x_atom.sh at the same time since it has the same issue.
hi,
WIP.
internally tested. shipping soon.
used
amd/GLM-5.1-MXFP4
instead of
amd/GLM-5-MXFP4
cc. @ChangLiu0709 @andyluo7 @chunfangamd @ajith-sirra-amd
Regards,
Seungrok