[WIP] [AMD/ROCM] atom glm5 fp4 on mi355x by seungrokj · Pull Request #1043 · SemiAnalysisAI/InferenceX

seungrokj · 2026-04-16T08:28:37Z

hi,

WIP.
internally tested. shipping soon.

used
amd/GLM-5.1-MXFP4
instead of
amd/GLM-5-MXFP4

cc. @ChangLiu0709 @andyluo7 @chunfangamd @ajith-sirra-amd

Regards,
Seungrok

Signed-off-by: seungrokj <seungrok.jung@amd.com>

github-actions · 2026-04-16T08:28:50Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

Signed-off-by: seungrokj <seungrok.jung@amd.com>

claude · 2026-04-16T08:37:50Z

+if [ "$ISL" = "1024" ] && [ "$OSL" = "1024" ]; then
+    CALCULATED_MAX_MODEL_LEN=""
+else
+    CALCULATED_MAX_MODEL_LEN=" --max-model-len 10240 "
+fi


🔴 The new glm5_fp4_mi355x_atom.sh is missing the EVAL_ONLY handling block present in other atom scripts (e.g., dsr1_fp4_mi355x_atom.sh, gptoss_fp4_mi355x_atom.sh lines 34-37). When EVAL_ONLY=true is set, the server starts with --max-model-len 10240 instead of EVAL_MAX_MODEL_LEN (typically 16384+), causing a context-length mismatch between the server and lm-eval that leads to OOM errors or silent evaluation failures. Add the missing block after line 32: if [ "${EVAL_ONLY}" = "true" ]; then setup_eval_context; CALCULATED_MAX_MODEL_LEN=" --max-model-len $EVAL_MAX_MODEL_LEN "; fi.

Extended reasoning...

What the bug is and how it manifests

The glm5_fp4_mi355x_atom.sh script is missing the EVAL_ONLY context-setup block that peer atom scripts use to reconfigure the server's context length when running in evaluation-only mode. Without it, when a user sets EVAL_ONLY=true, the script skips the throughput benchmark (handled by run_benchmark_serving's internal check) but the server was already started with an undersized context window (--max-model-len 10240 for non-1024/1024 workloads). The lm-eval framework then requests completions up to EVAL_MAX_MODEL_LEN (default 16384 or computed from ISL+OSL+200), which exceeds the server's configured limit.

The specific code path that triggers it

In other atom scripts (e.g., dsr1_fp4_mi355x_atom.sh lines 34-37, gptoss_fp4_mi355x_atom.sh lines 34-37), after computing CALCULATED_MAX_MODEL_LEN from ISL/OSL, there is:

if [ "${EVAL_ONLY}" = "true" ]; then setup_eval_context CALCULATED_MAX_MODEL_LEN=" --max-model-len $EVAL_MAX_MODEL_LEN " fi

setup_eval_context (defined in benchmark_lib.sh) sets EVAL_MAX_MODEL_LEN based on the model's native context length (or ISL+OSL+200), then CALCULATED_MAX_MODEL_LEN is overridden to that larger value before the server starts. The new script omits this block entirely — the server always starts with whichever context limit was computed from the ISL/OSL check (10240 for most workloads).

Why existing code doesn't prevent it

EVAL_ONLY is not listed in check_env_vars, so there is no runtime guard. The script's RUN_EVAL check on line 69 only gates the lm-eval call, it does not fix the server's context size. Once the server is started with --max-model-len 10240, it cannot be reconfigured without restarting.

Impact

Any CI or ad-hoc run with EVAL_ONLY=true will either: (a) receive context-exceeded errors from the server when lm-eval submits prompts longer than 10240 tokens, producing incorrect/truncated scores; (b) trigger OOM on the GPU as the server tries to accommodate oversized KV-cache requests; or (c) silently truncate outputs, yielding misleadingly low evaluation numbers. The benchmark mode is unaffected, so this bug is invisible during normal benchmarking.

Addressing the refutation

The refutation notes that glm5_fp8_mi355x_atom.sh and kimik2.5_fp4_mi355x_atom.sh also lack this block — true, but that means those scripts share the same latent bug. The fact that the fp4 script was modeled after fp8 means it inherited the deficiency rather than being deliberately designed without eval support. The dsr1 and gptoss scripts demonstrate the correct pattern. Two scripts being broken does not make a third script correct.

How to fix it and step-by-step proof

Add after line 32 (after the ISL/OSL CALCULATED_MAX_MODEL_LEN block):

if [ "${EVAL_ONLY}" = "true" ]; then setup_eval_context CALCULATED_MAX_MODEL_LEN=" --max-model-len $EVAL_MAX_MODEL_LEN " fi

Concrete example: ISL=4096, OSL=1024 (a common eval workload).

ISL≠1024 or OSL≠1024, so CALCULATED_MAX_MODEL_LEN=" --max-model-len 10240 ".

User sets EVAL_ONLY=true.

Without the fix: server starts with --max-model-len 10240. setup_eval_context is never called, so EVAL_MAX_MODEL_LEN falls back to 0 or the model default (e.g., 131072 for GLM-4). lm-eval submits a prompt of 4096+1024+200=5320 tokens — just within 10240, so no immediate crash, but the eval context is wrong and larger GLM-5 native context means lm-eval may use sequences far exceeding 10240, causing rejections or OOM.

With the fix: setup_eval_context computes EVAL_MAX_MODEL_LEN=5320, server starts with --max-model-len 5320, lm-eval uses the same limit — consistent and correct.

claude · 2026-04-16T08:37:50Z

+start_gpu_monitor
+
+set -x
+pip install -U transformers


🟣 Line 44 runs pip install -U transformers with no version pin, making benchmark results non-reproducible across runs and vulnerable to silent API breakage if a new transformers release is incompatible with atom or GLM5 FP4. This same unpinned install exists in the fp8 atom counterpart script (pre-existing), but other GLM5 scripts in the repo (glm5_fp8_b200.sh, glm5_fp8_mi355x.sh) use pinned versions — consider pinning to a specific version or git SHA before publishing as an official benchmark.

Extended reasoning...

What the bug is and how it manifests

Line 44 of glm5_fp4_mi355x_atom.sh runs pip install -U transformers immediately before launching the atom server, with no version constraint. This means each benchmark run installs whatever the latest transformers release is at that moment. Two runs separated by days or weeks could use different transformers versions, producing incomparable results.

The specific code path

The install runs inside set -x after GPU monitoring has started but before the server process is launched. If the install fetches an incompatible transformers version, the atom server may fail to start or behave incorrectly — and since the script uses background launch + wait_for_server_ready, a subtle startup failure might not be immediately obvious.

Why existing code doesn't prevent it

The -U (upgrade) flag explicitly requests the latest available version. There is no ==, >=, or git+...@<sha> constraint anywhere. The check_env_vars block at the top checks for required environment variables but does not validate the software environment.

Why the refutation deserves a response

One verifier argues this is intentional, since the fp8 counterpart (glm5_fp8_mi355x_atom.sh) uses the same unversioned install. Consistency with an existing pattern does not eliminate the risk — it only means the issue is pre-existing. The fp8 atom script has the same reproducibility problem, and the fp4 script inherits it. The comparison scripts for other backends (glm5_fp8_b200.sh: transformers==5.2.0; glm5_fp8_mi355x.sh: pinned git SHA) demonstrate the repo does value version pinning elsewhere.

Step-by-step proof of the problem

Run A (today): pip install -U transformers installs transformers 4.51.0.

Benchmark completes successfully; results recorded.

transformers 4.52.0 is released with a breaking change to a tokenizer API used by GLM5.

Run B (next week): pip install -U transformers installs 4.52.0.

The atom server starts but produces different output, or fails silently — Run B results are incomparable with Run A.

How to fix it

Pin to the same version or commit SHA used by sister scripts. For example:
pip install --no-deps "transformers==5.2.0"
or
pip install -U --no-cache-dir "git+https://github.com/huggingface/transformers.git@<specific-sha>"
Ideally, also fix glm5_fp8_mi355x_atom.sh at the same time since it has the same issue.

atom glm5 fp4 on mi355x

8e0dfbf

Signed-off-by: seungrokj <seungrok.jung@amd.com>

seungrokj requested a review from a team April 16, 2026 08:28

github-project-automation bot added this to InferenceMAX Board Apr 16, 2026

atom glm5 fp4 on mi355x

5c565ee

Signed-off-by: seungrokj <seungrok.jung@amd.com>

seungrokj requested review from 1am9trash, billishyahao, chunfangamd and yctseng0211 as code owners April 16, 2026 08:30

atom glm5 fp4 on mi355x

a3b84c8

Signed-off-by: seungrokj <seungrok.jung@amd.com>

seungrokj changed the title ~~[WIP] [AMD/ROCM] atom glm5 fp4 on mi355x~~ [WIP] [AMD/ROCM] atom glm5.1 fp4 on mi355x Apr 16, 2026

claude bot reviewed Apr 16, 2026

View reviewed changes

seungrokj added the AMD label Apr 16, 2026

seungrokj changed the title ~~[WIP] [AMD/ROCM] atom glm5.1 fp4 on mi355x~~ [WIP] [AMD/ROCM] atom glm5 fp4 on mi355x Apr 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] [AMD/ROCM] atom glm5 fp4 on mi355x#1043

[WIP] [AMD/ROCM] atom glm5 fp4 on mi355x#1043
seungrokj wants to merge 3 commits intomainfrom
srok/atom_glm5_fp4

seungrokj commented Apr 16, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 16, 2026

Uh oh!

claude bot Apr 16, 2026

Uh oh!

claude bot Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

seungrokj commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 16, 2026

Uh oh!

claude bot Apr 16, 2026

Choose a reason for hiding this comment

What the bug is and how it manifests

The specific code path that triggers it

Why existing code doesn't prevent it

Impact

Addressing the refutation

How to fix it and step-by-step proof

Uh oh!

claude bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

seungrokj commented Apr 16, 2026 •

edited

Loading