-
Notifications
You must be signed in to change notification settings - Fork 31
Add Mlperf llama3.1 8b in primus #656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,55 @@ | ||||||
| # LLama3.1 8B MLPerf Pretraining | ||||||
|
|
||||||
| MLPerf-compliant LLama3.1 8B pretraining using Primus | ||||||
|
|
||||||
| ## Setup | ||||||
|
|
||||||
| ### Start Docker Image | ||||||
|
|
||||||
| ```bash | ||||||
| export MLPERF_PAT=<your_github_pat> | ||||||
| docker run -it --device /dev/dri --device /dev/kfd --device /dev/infiniband --network host --ipc host --group-add video --cap-add SYS_PTRACE --security-opt seccomp=unconfined --privileged -v $HOME:$HOME --shm-size 128G --name primus_training_env rocm/primus:v26.2 | ||||||
|
|
||||||
|
|
||||||
| git clone --recurse-submodules https://github.com/AMD-AIG-AIMA/Primus.git | ||||||
| cd Primus | ||||||
| ``` | ||||||
|
|
||||||
|
|
||||||
| ### Configuration | ||||||
|
|
||||||
| - **Model**: LLama3.1 8B (4096 hidden, 32 layers, 32 attention heads) | ||||||
| - **Training**: 1.2M iterations, GBS=32, MBS=2, LR=8e-4 | ||||||
| - **Precision**: FP8 hybrid | ||||||
| - **Data**: C4 dataset (tokenized) | ||||||
|
|
||||||
| ## Key Files | ||||||
|
|
||||||
| - `configs/MI355X/llama3.1_8B-pretrain.yaml` - Model and training config | ||||||
| - Update `train_data_path` and `train_data_path` to your local downloaded location | ||||||
| - `config_MI355X_1x8x1.sh` - System config and env vars | ||||||
| - Update `PRIMUS_PATH` to clone Primus Repo | ||||||
| - Update `EXP`to `<PRIMUS_PATH>/examples/mlperf/configs/MI355X/llama3.1_8B-pretrain-FP8.yaml` | ||||||
| - `src/train.py` - Training entry point | ||||||
| - `run_and_time.sh` - Run script | ||||||
|
|
||||||
| ### Data | ||||||
|
|
||||||
| Download preprocessed C4 dataset: | ||||||
|
|
||||||
| ```bash | ||||||
| cd /data/mlperf_llama31_8b | ||||||
| bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) \ | ||||||
| -d data https://training.mlcommons-storage.org/metadata/llama-3-1-8b-preprocessed-c4-dataset.uri | ||||||
| ``` | ||||||
|
|
||||||
| ### How to run | ||||||
|
|
||||||
| ```bash | ||||||
| export HF_TOKEN=<your_huggingface_token> | ||||||
| source config_MI355X_1x8x1.sh | ||||||
| bash run_and_time.sh | ||||||
| ``` | ||||||
| ## Notes | ||||||
|
|
||||||
| - `log_interval: 99999999` suppresses regular Primus logs | ||||||
|
||||||
| - `log_interval: 99999999` suppresses regular Primus logs | |
| - `log_interval: 999999` suppresses regular Primus logs |
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,75 @@ | ||||||||||||||||||||||
| #!/bin/bash | ||||||||||||||||||||||
| # MLPerf LLama3.1 8B Configuration for MI355X (1x8x1) | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| export DGXSYSTEM=MI355X_1x8x1 | ||||||||||||||||||||||
| export GPUS_PER_NODE=8 | ||||||||||||||||||||||
| export NNODES=1 | ||||||||||||||||||||||
| export NODE_RANK=0 | ||||||||||||||||||||||
| export MASTER_ADDR=localhost | ||||||||||||||||||||||
| export MASTER_PORT=29502 | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| export PRIMUS_PATH=/home/vidgoyal/Primus-dev/Primus/ | ||||||||||||||||||||||
| export PRIMUS_MLPERF=1 | ||||||||||||||||||||||
| export PYTHONPATH="${PRIMUS_PATH}:${PRIMUS_PATH}/third_party/Megatron-LM:${PYTHONPATH}" | ||||||||||||||||||||||
| export EXP=/home/vidgoyal/Primus-dev/Primus/examples/mlperf/configs/MI355X/llama3.1_8B-pretrain-FP8.yaml | ||||||||||||||||||||||
|
Comment on lines
+11
to
+14
|
||||||||||||||||||||||
| export PRIMUS_PATH=/home/vidgoyal/Primus-dev/Primus/ | |
| export PRIMUS_MLPERF=1 | |
| export PYTHONPATH="${PRIMUS_PATH}:${PRIMUS_PATH}/third_party/Megatron-LM:${PYTHONPATH}" | |
| export EXP=/home/vidgoyal/Primus-dev/Primus/examples/mlperf/configs/MI355X/llama3.1_8B-pretrain-FP8.yaml | |
| SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" | |
| REPO_ROOT="$(cd "${SCRIPT_DIR}/../.." && pwd)" | |
| export PRIMUS_PATH="${PRIMUS_PATH:-${REPO_ROOT}}" | |
| export PRIMUS_MLPERF=1 | |
| export PYTHONPATH="${PRIMUS_PATH}:${PRIMUS_PATH}/third_party/Megatron-LM:${PYTHONPATH}" | |
| export EXP="${EXP:-${PRIMUS_PATH}/examples/mlperf/configs/MI355X/llama3.1_8B-pretrain-FP8.yaml}" |
Copilot
AI
Apr 17, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This config script hardcodes developer-specific absolute paths for PRIMUS_PATH and EXP, which prevents reuse on other systems. Consider making these derived from the script location (e.g., repo root) or requiring them as inputs, and keep only portable defaults.
Copilot
AI
Apr 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This script exports MLLOG_TARGET_EVAL_LOSS, but the early-stop logic added in primus/modules/trainer/megatron/trainer.py reads TARGET_EVAL_LOSS. Either export TARGET_EVAL_LOSS here as well, or update the trainer to consume MLLOG_TARGET_EVAL_LOSS to keep the MLPerf workflow consistent.
| export MLLOG_TARGET_EVAL_LOSS=3.3 | |
| export MLLOG_TARGET_EVAL_LOSS=3.3 | |
| export TARGET_EVAL_LOSS="${MLLOG_TARGET_EVAL_LOSS}" |
Copilot
AI
Apr 17, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TORCHPROF_OUTPUT_DIR is hardcoded to a developer home directory. This will break in containerized/CI runs and on other machines. Prefer a relative path, a /results/... default, or require callers to set the env var.
| Original file line number | Diff line number | Diff line change | ||||||
|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,119 @@ | ||||||||
| work_group: ${TEAM:amd} | ||||||||
| user_name: ${USER:root} | ||||||||
| exp_name: ${EXP_NAME:llama3.1_8B-pretrain-v26.2} | ||||||||
| workspace: ./output | ||||||||
|
|
||||||||
| modules: | ||||||||
| pre_trainer: | ||||||||
| framework: megatron | ||||||||
| config: pre_trainer.yaml | ||||||||
|
|
||||||||
| # model to run | ||||||||
| model: llama3.1_8B.yaml | ||||||||
| overrides: | ||||||||
| # --- Logging Config --- | ||||||||
| wandb_project: "Primus-llama3.1-8B-pretrain" | ||||||||
| disable_wandb: true | ||||||||
| disable_tensorboard: true | ||||||||
| stderr_sink_level: DEBUG | ||||||||
| log_interval: 999999 | ||||||||
| log_avg_skip_iterations: 2 | ||||||||
| log_avg_reset_interval: 50 | ||||||||
|
|
||||||||
| eval_iters: 32 # 32 * GBS = 1024 eval samples | ||||||||
| eval_interval: ${PRIMUS_EVAL_INTERVAL:10} # 10 * GBS = 320 eval samples perform evaluation. | ||||||||
|
|
||||||||
| # --- Training Config --- | ||||||||
| train_iters: ${PRIMUS_TRAIN_ITERS:200} | ||||||||
| micro_batch_size: 2 # grad_acc = global_batch_size / (micro_batch_size * num_gpus) = 32 / (2 * 8) = 2 | ||||||||
| global_batch_size: 32 | ||||||||
|
|
||||||||
| seq_length: 8192 | ||||||||
|
||||||||
| seq_length: 8192 |
Copilot
AI
Apr 17, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seq_length is defined twice under overrides (once near the top and again in the data section). YAML will keep only the latter, which is easy to miss and can cause confusing config drift. Remove the duplicate key (or add a comment explaining intentional override).
| test_data_path: null | |
| test_data_path: null | |
| # Intentionally overrides an earlier `seq_length` in `overrides`; 8192 is the effective value. |
Copilot
AI
Apr 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seq_length is defined again here, duplicating the earlier seq_length setting. Please remove one of the duplicate keys to avoid ambiguity and YAML-parser incompatibilities.
| seq_length: 8192 |
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,61 @@ | ||||||||||||||||||||||||||||||||||||
| #!/bin/bash | ||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
| set -e | ||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
| mkdir -p /results | ||||||||||||||||||||||||||||||||||||
|
Comment on lines
+3
to
+5
|
||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
| export GPUS_PER_NODE=${GPUS_PER_NODE:-8} | ||||||||||||||||||||||||||||||||||||
| export NNODES=${NNODES:-1} | ||||||||||||||||||||||||||||||||||||
| export NODE_RANK=${NODE_RANK:-0} | ||||||||||||||||||||||||||||||||||||
| export MASTER_ADDR=${MASTER_ADDR:-localhost} | ||||||||||||||||||||||||||||||||||||
| export MASTER_PORT=${MASTER_PORT:-29502} | ||||||||||||||||||||||||||||||||||||
| export EXP=${EXP:-/workspace/code/conf/llama3.1_8B-pretrain.yaml} | ||||||||||||||||||||||||||||||||||||
| export DATA_PATH=${DATA_PATH:-/data} | ||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
| echo "============================================" | ||||||||||||||||||||||||||||||||||||
| echo "MLPerf LLama3.1 8B Training" | ||||||||||||||||||||||||||||||||||||
| echo "============================================" | ||||||||||||||||||||||||||||||||||||
| echo "Config: ${EXP}" | ||||||||||||||||||||||||||||||||||||
| echo "Data: ${DATA_PATH}" | ||||||||||||||||||||||||||||||||||||
| echo "GPUs: ${GPUS_PER_NODE}" | ||||||||||||||||||||||||||||||||||||
| echo "Nodes: ${NNODES}" | ||||||||||||||||||||||||||||||||||||
| echo "Train iters: ${PRIMUS_TRAIN_ITERS}" | ||||||||||||||||||||||||||||||||||||
| echo "Eval interval: ${PRIMUS_EVAL_INTERVAL}" | ||||||||||||||||||||||||||||||||||||
| echo "Enable MLPerf logging: ${ENABLE_MLPERF}" | ||||||||||||||||||||||||||||||||||||
| echo "MLLOG_TRAIN_LOSS_LOG_FREQ: ${MLLOG_TRAIN_LOSS_LOG_FREQ}" | ||||||||||||||||||||||||||||||||||||
| echo "MLLOG_TARGET_EVAL_LOSS: ${MLLOG_TARGET_EVAL_LOSS}" | ||||||||||||||||||||||||||||||||||||
| echo "MLLOG_SUBMISSION_BENCHMARK: ${MLLOG_SUBMISSION_BENCHMARK}" | ||||||||||||||||||||||||||||||||||||
| echo "MLLOG_SUBMISSION_DIVISION: ${MLLOG_SUBMISSION_DIVISION}" | ||||||||||||||||||||||||||||||||||||
| echo "MLLOG_SUBMISSION_ORG: ${MLLOG_SUBMISSION_ORG}" | ||||||||||||||||||||||||||||||||||||
| echo "MLLOG_SUBMISSION_PLATFORM: ${MLLOG_SUBMISSION_PLATFORM}" | ||||||||||||||||||||||||||||||||||||
| echo "============================================" | ||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
| start=$(date +%s) | ||||||||||||||||||||||||||||||||||||
| start_fmt=$(date +%Y-%m-%d\ %r) | ||||||||||||||||||||||||||||||||||||
| echo "STARTING TIMING RUN AT $start_fmt" | ||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
| torchrun \ | ||||||||||||||||||||||||||||||||||||
| --nproc_per_node=${GPUS_PER_NODE} \ | ||||||||||||||||||||||||||||||||||||
| --nnodes=${NNODES} \ | ||||||||||||||||||||||||||||||||||||
| --node_rank=${NODE_RANK} \ | ||||||||||||||||||||||||||||||||||||
| --master_addr=${MASTER_ADDR} \ | ||||||||||||||||||||||||||||||||||||
| --master_port=${MASTER_PORT} \ | ||||||||||||||||||||||||||||||||||||
| src/train.py | ||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
| ret_code=$? | ||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
|
Comment on lines
+37
to
+46
|
||||||||||||||||||||||||||||||||||||
| torchrun \ | |
| --nproc_per_node=${GPUS_PER_NODE} \ | |
| --nnodes=${NNODES} \ | |
| --node_rank=${NODE_RANK} \ | |
| --master_addr=${MASTER_ADDR} \ | |
| --master_port=${MASTER_PORT} \ | |
| src/train.py | |
| ret_code=$? | |
| ret_code=0 | |
| torchrun \ | |
| --nproc_per_node=${GPUS_PER_NODE} \ | |
| --nnodes=${NNODES} \ | |
| --node_rank=${NODE_RANK} \ | |
| --master_addr=${MASTER_ADDR} \ | |
| --master_port=${MASTER_PORT} \ | |
| src/train.py || ret_code=$? |
Copilot
AI
Apr 17, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This script uses set -e but then tries to capture ret_code=$? after torchrun. With -e, a non-zero torchrun exit will abort the script immediately, so ret_code/timing output won’t be recorded. If you want timing even on failure, temporarily disable -e around torchrun (or use an if ...; then ...; fi pattern) and handle the exit code explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The README says to update
train_data_pathtwice. The second one should likely bevalid_data_path(or whichever validation key is used in the config), otherwise readers may miss updating the validation dataset path.