-
Notifications
You must be signed in to change notification settings - Fork 134
[WIP] [AMD/ROCM] atom glm5 fp4 on mi355x #1043
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,81 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| source "$(dirname "$0")/../benchmark_lib.sh" | ||
|
|
||
| check_env_vars \ | ||
| MODEL \ | ||
| TP \ | ||
| CONC \ | ||
| ISL \ | ||
| OSL \ | ||
| RANDOM_RANGE_RATIO \ | ||
| RESULT_FILENAME \ | ||
| EP_SIZE \ | ||
| DP_ATTENTION | ||
|
|
||
| if [[ -n "$SLURM_JOB_ID" ]]; then | ||
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||
| fi | ||
|
|
||
| echo "TP: $TP, CONC: $CONC, ISL: $ISL, OSL: $OSL, EP_SIZE: $EP_SIZE, DP_ATTENTION: $DP_ATTENTION" | ||
|
|
||
| SERVER_LOG=/workspace/server.log | ||
| PORT=${PORT:-8888} | ||
|
|
||
| export OMP_NUM_THREADS=1 | ||
|
|
||
| # Calculate max-model-len based on ISL and OSL | ||
| if [ "$ISL" = "1024" ] && [ "$OSL" = "1024" ]; then | ||
| CALCULATED_MAX_MODEL_LEN="" | ||
| else | ||
| CALCULATED_MAX_MODEL_LEN=" --max-model-len 10240 " | ||
| fi | ||
|
|
||
| if [ "$EP_SIZE" -gt 1 ]; then | ||
| EP=" --enable-expert-parallel" | ||
| else | ||
| EP=" " | ||
| fi | ||
|
|
||
| # Start GPU monitoring (power, temperature, clocks every second) | ||
| start_gpu_monitor | ||
|
|
||
| set -x | ||
| pip install -U transformers | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🟣 Line 44 runs Extended reasoning...What the bug is and how it manifests Line 44 of The specific code path The install runs inside Why existing code doesn't prevent it The Why the refutation deserves a response One verifier argues this is intentional, since the fp8 counterpart ( Step-by-step proof of the problem
How to fix it Pin to the same version or commit SHA used by sister scripts. For example: |
||
| python3 -m atom.entrypoints.openai_server \ | ||
| --model $MODEL \ | ||
| --server-port $PORT \ | ||
| -tp $TP \ | ||
| --kv_cache_dtype fp8 $CALCULATED_MAX_MODEL_LEN $EP \ | ||
| --default-chat-template-kwargs '{"enable_thinking": false}' \ | ||
| --trust-remote-code \ | ||
| > $SERVER_LOG 2>&1 & | ||
|
|
||
| SERVER_PID=$! | ||
|
|
||
| # Wait for server to be ready | ||
| wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" | ||
|
|
||
| export PYTHONDONTWRITEBYTECODE=1 | ||
| run_benchmark_serving \ | ||
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend vllm \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts "$((CONC * 10))" \ | ||
| --max-concurrency "$CONC" \ | ||
| --result-filename "$RESULT_FILENAME" \ | ||
| --result-dir /workspace/ \ | ||
| --trust-remote-code | ||
|
|
||
| # After throughput, run evaluation only if RUN_EVAL is true | ||
| if [ "${RUN_EVAL}" = "true" ]; then | ||
| run_eval --framework lm-eval --port "$PORT" | ||
| append_lm_eval_summary | ||
| fi | ||
|
|
||
| # Stop GPU monitoring | ||
| stop_gpu_monitor | ||
| set +x | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 The new
glm5_fp4_mi355x_atom.shis missing theEVAL_ONLYhandling block present in other atom scripts (e.g.,dsr1_fp4_mi355x_atom.sh,gptoss_fp4_mi355x_atom.shlines 34-37). WhenEVAL_ONLY=trueis set, the server starts with--max-model-len 10240instead ofEVAL_MAX_MODEL_LEN(typically 16384+), causing a context-length mismatch between the server and lm-eval that leads to OOM errors or silent evaluation failures. Add the missing block after line 32:if [ "${EVAL_ONLY}" = "true" ]; then setup_eval_context; CALCULATED_MAX_MODEL_LEN=" --max-model-len $EVAL_MAX_MODEL_LEN "; fi.Extended reasoning...
What the bug is and how it manifests
The
glm5_fp4_mi355x_atom.shscript is missing theEVAL_ONLYcontext-setup block that peer atom scripts use to reconfigure the server's context length when running in evaluation-only mode. Without it, when a user setsEVAL_ONLY=true, the script skips the throughput benchmark (handled byrun_benchmark_serving's internal check) but the server was already started with an undersized context window (--max-model-len 10240for non-1024/1024 workloads). The lm-eval framework then requests completions up toEVAL_MAX_MODEL_LEN(default 16384 or computed from ISL+OSL+200), which exceeds the server's configured limit.The specific code path that triggers it
In other atom scripts (e.g.,
dsr1_fp4_mi355x_atom.shlines 34-37,gptoss_fp4_mi355x_atom.shlines 34-37), after computingCALCULATED_MAX_MODEL_LENfrom ISL/OSL, there is:setup_eval_context(defined inbenchmark_lib.sh) setsEVAL_MAX_MODEL_LENbased on the model's native context length (or ISL+OSL+200), thenCALCULATED_MAX_MODEL_LENis overridden to that larger value before the server starts. The new script omits this block entirely — the server always starts with whichever context limit was computed from the ISL/OSL check (10240 for most workloads).Why existing code doesn't prevent it
EVAL_ONLYis not listed incheck_env_vars, so there is no runtime guard. The script'sRUN_EVALcheck on line 69 only gates the lm-eval call, it does not fix the server's context size. Once the server is started with--max-model-len 10240, it cannot be reconfigured without restarting.Impact
Any CI or ad-hoc run with
EVAL_ONLY=truewill either: (a) receive context-exceeded errors from the server when lm-eval submits prompts longer than 10240 tokens, producing incorrect/truncated scores; (b) trigger OOM on the GPU as the server tries to accommodate oversized KV-cache requests; or (c) silently truncate outputs, yielding misleadingly low evaluation numbers. The benchmark mode is unaffected, so this bug is invisible during normal benchmarking.Addressing the refutation
The refutation notes that
glm5_fp8_mi355x_atom.shandkimik2.5_fp4_mi355x_atom.shalso lack this block — true, but that means those scripts share the same latent bug. The fact that the fp4 script was modeled after fp8 means it inherited the deficiency rather than being deliberately designed without eval support. Thedsr1andgptossscripts demonstrate the correct pattern. Two scripts being broken does not make a third script correct.How to fix it and step-by-step proof
Add after line 32 (after the ISL/OSL
CALCULATED_MAX_MODEL_LENblock):Concrete example: ISL=4096, OSL=1024 (a common eval workload).
CALCULATED_MAX_MODEL_LEN=" --max-model-len 10240 ".EVAL_ONLY=true.--max-model-len 10240.setup_eval_contextis never called, soEVAL_MAX_MODEL_LENfalls back to 0 or the model default (e.g., 131072 for GLM-4). lm-eval submits a prompt of 4096+1024+200=5320 tokens — just within 10240, so no immediate crash, but the eval context is wrong and larger GLM-5 native context means lm-eval may use sequences far exceeding 10240, causing rejections or OOM.setup_eval_contextcomputesEVAL_MAX_MODEL_LEN=5320, server starts with--max-model-len 5320, lm-eval uses the same limit — consistent and correct.