Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 55 additions & 0 deletions examples/mlperf/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# LLama3.1 8B MLPerf Pretraining

MLPerf-compliant LLama3.1 8B pretraining using Primus

## Setup

### Start Docker Image

```bash
export MLPERF_PAT=<your_github_pat>
docker run -it --device /dev/dri --device /dev/kfd --device /dev/infiniband --network host --ipc host --group-add video --cap-add SYS_PTRACE --security-opt seccomp=unconfined --privileged -v $HOME:$HOME --shm-size 128G --name primus_training_env rocm/primus:v26.2


git clone --recurse-submodules https://github.com/AMD-AIG-AIMA/Primus.git
cd Primus
```


### Configuration

- **Model**: LLama3.1 8B (4096 hidden, 32 layers, 32 attention heads)
- **Training**: 1.2M iterations, GBS=32, MBS=2, LR=8e-4
- **Precision**: FP8 hybrid
- **Data**: C4 dataset (tokenized)

## Key Files

- `configs/MI355X/llama3.1_8B-pretrain.yaml` - Model and training config
- Update `train_data_path` and `train_data_path` to your local downloaded location
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README says to update train_data_path twice. The second one should likely be valid_data_path (or whichever validation key is used in the config), otherwise readers may miss updating the validation dataset path.

Suggested change
- Update `train_data_path` and `train_data_path` to your local downloaded location
- Update `train_data_path` and `valid_data_path` to your local downloaded location

Copilot uses AI. Check for mistakes.
- `config_MI355X_1x8x1.sh` - System config and env vars
- Update `PRIMUS_PATH` to clone Primus Repo
- Update `EXP`to `<PRIMUS_PATH>/examples/mlperf/configs/MI355X/llama3.1_8B-pretrain-FP8.yaml`
- `src/train.py` - Training entry point
- `run_and_time.sh` - Run script

### Data

Download preprocessed C4 dataset:

```bash
cd /data/mlperf_llama31_8b
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) \
-d data https://training.mlcommons-storage.org/metadata/llama-3-1-8b-preprocessed-c4-dataset.uri
```

### How to run

```bash
export HF_TOKEN=<your_huggingface_token>
source config_MI355X_1x8x1.sh
bash run_and_time.sh
```
## Notes

- `log_interval: 99999999` suppresses regular Primus logs
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The note claims log_interval: 99999999 suppresses logs, but the provided config sets log_interval: 999999. Please align the README with the actual value/config behavior to avoid confusion.

Suggested change
- `log_interval: 99999999` suppresses regular Primus logs
- `log_interval: 999999` suppresses regular Primus logs

Copilot uses AI. Check for mistakes.
75 changes: 75 additions & 0 deletions examples/mlperf/config_MI355X_1x8x1.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
#!/bin/bash
# MLPerf LLama3.1 8B Configuration for MI355X (1x8x1)

export DGXSYSTEM=MI355X_1x8x1
export GPUS_PER_NODE=8
export NNODES=1
export NODE_RANK=0
export MASTER_ADDR=localhost
export MASTER_PORT=29502

export PRIMUS_PATH=/home/vidgoyal/Primus-dev/Primus/
export PRIMUS_MLPERF=1
export PYTHONPATH="${PRIMUS_PATH}:${PRIMUS_PATH}/third_party/Megatron-LM:${PYTHONPATH}"
export EXP=/home/vidgoyal/Primus-dev/Primus/examples/mlperf/configs/MI355X/llama3.1_8B-pretrain-FP8.yaml
Comment on lines +11 to +14
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This config hardcodes developer-specific absolute paths for PRIMUS_PATH and EXP. For portability/reproducibility, consider making these relative (e.g., based on the script location) or using placeholders that users are instructed to set (similar to other env-var defaults).

Suggested change
export PRIMUS_PATH=/home/vidgoyal/Primus-dev/Primus/
export PRIMUS_MLPERF=1
export PYTHONPATH="${PRIMUS_PATH}:${PRIMUS_PATH}/third_party/Megatron-LM:${PYTHONPATH}"
export EXP=/home/vidgoyal/Primus-dev/Primus/examples/mlperf/configs/MI355X/llama3.1_8B-pretrain-FP8.yaml
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd "${SCRIPT_DIR}/../.." && pwd)"
export PRIMUS_PATH="${PRIMUS_PATH:-${REPO_ROOT}}"
export PRIMUS_MLPERF=1
export PYTHONPATH="${PRIMUS_PATH}:${PRIMUS_PATH}/third_party/Megatron-LM:${PYTHONPATH}"
export EXP="${EXP:-${PRIMUS_PATH}/examples/mlperf/configs/MI355X/llama3.1_8B-pretrain-FP8.yaml}"

Copilot uses AI. Check for mistakes.
export DATA_PATH=/data
Comment on lines +11 to +15
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This config script hardcodes developer-specific absolute paths for PRIMUS_PATH and EXP, which prevents reuse on other systems. Consider making these derived from the script location (e.g., repo root) or requiring them as inputs, and keep only portable defaults.

Copilot uses AI. Check for mistakes.

export PRIMUS_MICRO_BATCH_SIZE=2
export PRIMUS_GLOBAL_BATCH_SIZE=32
export PRIMUS_LR=8e-4
export PRIMUS_TRAIN_ITERS=1200000
export EVAL_SAMPLES_INTERVAL=12288 #12288 # Evaluate every 12,288 samples
export PRIMUS_EVAL_INTERVAL=$((EVAL_SAMPLES_INTERVAL / PRIMUS_GLOBAL_BATCH_SIZE)) # Auto-computed
export SEED=31952

export ROCTRACER_LOG=1
export ROCTRACER_LOG_LEVEL=5

export HSA_ENABLE_INTERRUPT=0
export HSA_TOOLS_LIB=/opt/rocm/lib/libroctracer64.so


export PRIMUS_APPLY_ROPE_FUSION=True
export PRIMUS_FP8_RECIPE=hybrid

export HSA_NO_SCRATCH_RECLAIM=1
export HSA_ENABLE_SDMA=1
export GPU_MAX_HW_QUEUES=2
export CUDA_DEVICE_MAX_CONNECTIONS=1
export NVTE_FUSED_ATTN=1
export NVTE_FUSED_ATTN_CK=1
export NVTE_FUSED_ATTN_AOTRITON=1
export NVTE_CK_USES_FWD_V3=1
export NVTE_CK_USES_BWD_V3=1
export NVTE_CK_IS_V3_ATOMIC_FP32=0
export USE_TE_SWIGLU=1
export TOKENIZERS_PARALLELISM=false
export NCCL_CHECKS_DISABLE=1
export TORCH_NCCL_HIGH_PRIORITY=1

export NVTE_ASYNC_AMAX_REDUCTION=1
export NVTE_DP_AMAX_REDUCE_INTERVAL=0

export ENABLE_MLLOG=1
export MLLOG_OUTPUT_FILE=/results/mlperf_output.log
export MLLOG_TRAIN_LOSS_LOG_FREQ=100
export MLLOG_TARGET_EVAL_LOSS=3.3
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script exports MLLOG_TARGET_EVAL_LOSS, but the early-stop logic added in primus/modules/trainer/megatron/trainer.py reads TARGET_EVAL_LOSS. Either export TARGET_EVAL_LOSS here as well, or update the trainer to consume MLLOG_TARGET_EVAL_LOSS to keep the MLPerf workflow consistent.

Suggested change
export MLLOG_TARGET_EVAL_LOSS=3.3
export MLLOG_TARGET_EVAL_LOSS=3.3
export TARGET_EVAL_LOSS="${MLLOG_TARGET_EVAL_LOSS}"

Copilot uses AI. Check for mistakes.
export TARGET_EVAL_LOSS=3.3
export MLLOG_SUBMISSION_BENCHMARK=llama31_8b
export MLLOG_SUBMISSION_DIVISION=closed
export MLLOG_SUBMISSION_ORG=AMD
export MLLOG_SUBMISSION_PLATFORM=MI355X

export TORCHPROF_OUTPUT_DIR=/home/vidgoyal/small_llm_pretraining/primus/outputs/
export TORCHPROF_VERBOSE=0
export TORCHPROF_MAXROWS=100
export TORCHPROF_PROFILE_MEMORY=0
export TORCHPROF_WITH_STACK=0
export TORCHPROF_RECORD_SHAPES=0
export TORCHPROF_WITH_FLOPS=0
Comment on lines +63 to +69
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TORCHPROF_OUTPUT_DIR is hardcoded to a developer home directory. This will break in containerized/CI runs and on other machines. Prefer a relative path, a /results/... default, or require callers to set the env var.

Copilot uses AI. Check for mistakes.
export PROF_WARMUP_STEPS=10 #128
export PROF_ACTIVE_STEPS=6372
# export HIPBLASLT_TUNING_OVERRIDE_FILE=tuning.txt

export NVTE_FLASH_ATTN=0
export NVTE_FUSED_ATTN=1
119 changes: 119 additions & 0 deletions examples/mlperf/configs/MI355X/llama3.1_8B-pretrain-FP8.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
work_group: ${TEAM:amd}
user_name: ${USER:root}
exp_name: ${EXP_NAME:llama3.1_8B-pretrain-v26.2}
workspace: ./output

modules:
pre_trainer:
framework: megatron
config: pre_trainer.yaml

# model to run
model: llama3.1_8B.yaml
overrides:
# --- Logging Config ---
wandb_project: "Primus-llama3.1-8B-pretrain"
disable_wandb: true
disable_tensorboard: true
stderr_sink_level: DEBUG
log_interval: 999999
log_avg_skip_iterations: 2
log_avg_reset_interval: 50

eval_iters: 32 # 32 * GBS = 1024 eval samples
eval_interval: ${PRIMUS_EVAL_INTERVAL:10} # 10 * GBS = 320 eval samples perform evaluation.

# --- Training Config ---
train_iters: ${PRIMUS_TRAIN_ITERS:200}
micro_batch_size: 2 # grad_acc = global_batch_size / (micro_batch_size * num_gpus) = 32 / (2 * 8) = 2
global_batch_size: 32

seq_length: 8192
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This YAML defines seq_length twice (line 31 and line 67). Duplicate keys can be confusing and some YAML parsers reject them; consider keeping a single seq_length entry (or rename one if they are meant to differ).

Suggested change
seq_length: 8192

Copilot uses AI. Check for mistakes.
max_position_embeddings: 8192

seed: ${SEED:1234}

lr: 0.0008 # 8e-4
min_lr: 0.00008 # 10% of lr
lr_warmup_iters: 128 # TODO: lr warmup steps should be 128.
lr_decay_iters: 1199872 # 1200000 - 128
lr_decay_style: cosine
weight_decay: 0.1
adam_beta1: 0.9
adam_beta2: 0.95
eod_mask_loss: false
init_method_std: 0.02
norm_epsilon: 1.0e-5

adam_eps: 1.0e-5 # TODO, find out what is the correct key to this parameter.

check_for_nan_in_loss_and_grad: false # default true
check_for_spiky_loss: false # default false, but setting it here explicitly
check_for_large_grads: false # default false, but setting it here explicitly

# --- Model Parallel Config ---
tensor_model_parallel_size: 1
pipeline_model_parallel_size: 1
expert_model_parallel_size: 1
overlap_grad_reduce: true
overlap_param_gather: true
gradient_accumulation_fusion: true

# --- Data Config ---
mock_data: false
train_data_path: "/data/mlperf/data/c4-train.en_6_text_document"
valid_data_path: "/data/mlperf/data/c4-validation-91205-samples.en_text_document"
test_data_path: null
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seq_length is defined twice under overrides (once near the top and again in the data section). YAML will keep only the latter, which is easy to miss and can cause confusing config drift. Remove the duplicate key (or add a comment explaining intentional override).

Suggested change
test_data_path: null
test_data_path: null
# Intentionally overrides an earlier `seq_length` in `overrides`; 8192 is the effective value.

Copilot uses AI. Check for mistakes.
seq_length: 8192
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seq_length is defined again here, duplicating the earlier seq_length setting. Please remove one of the duplicate keys to avoid ambiguity and YAML-parser incompatibilities.

Suggested change
seq_length: 8192

Copilot uses AI. Check for mistakes.
data_cache_path: /npy_indices
mmap_bin_files: true

# --- Profiling Config ---
profile: false
use_pytorch_profiler: true
profile_ranks: [0] # Only profile rank 0 to save disk space
profile_step_start: 8 # Start after warmup (step 8)
profile_step_end: 13 # Profile 5 iterations (8-12)
disable_profiler_activity_cpu: false # GPU kernels only (smaller files)
torch_profiler_record_shapes: false # Disable for smaller traces
torch_profiler_with_stack: false # Disable for smaller traces
torch_profiler_use_gzip: true # Compress output

# --- Checkpointing Config ---
finetune: false
auto_continue_train: false
load: null
no_load_optim: null
no_load_rng: null
save: null
save_interval: 2000000
no_save_optim: null
no_save_rng: null
disable_last_saving: true
ckpt_format: torch

# --- FSDP Config ---
use_torch_fsdp2: false
use_distributed_optimizer: true # this is needed for fsdp2

# Cross entropy flags
cross_entropy_fusion_impl: "te"
cross_entropy_loss_fusion: true

# --- Mixed Precision Config ---
fp8: hybrid # e4m3, hybrid
fp8_amax_history_len: 4
fp8_amax_compute_algo: "most_recent"
accumulate_allreduce_grads_in_fp32: false
grad_reduce_in_bf16: true
attention_softmax_in_fp32: false
# fp8_param_gather: true


# --- Primus Turbo Config ---
enable_primus_turbo: false
use_turbo_attention: false
use_turbo_parallel_linear: false # can't use together with delayed recipe
use_turbo_grouped_mlp: false
moe_use_fused_router_with_aux_score: false
enable_turbo_attention_float8 : false
61 changes: 61 additions & 0 deletions examples/mlperf/run_and_time.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
#!/bin/bash

set -e

mkdir -p /results
Comment on lines +3 to +5
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With set -e, a non-zero torchrun exit will terminate the script immediately, so ret_code=$? and the explicit failure handling below never run. Consider disabling -e around torchrun (or using torchrun ...; ret_code=$? with set +e/set -e guards) so timing/result logging works on failures too.

Copilot uses AI. Check for mistakes.

export GPUS_PER_NODE=${GPUS_PER_NODE:-8}
export NNODES=${NNODES:-1}
export NODE_RANK=${NODE_RANK:-0}
export MASTER_ADDR=${MASTER_ADDR:-localhost}
export MASTER_PORT=${MASTER_PORT:-29502}
export EXP=${EXP:-/workspace/code/conf/llama3.1_8B-pretrain.yaml}
export DATA_PATH=${DATA_PATH:-/data}

echo "============================================"
echo "MLPerf LLama3.1 8B Training"
echo "============================================"
echo "Config: ${EXP}"
echo "Data: ${DATA_PATH}"
echo "GPUs: ${GPUS_PER_NODE}"
echo "Nodes: ${NNODES}"
echo "Train iters: ${PRIMUS_TRAIN_ITERS}"
echo "Eval interval: ${PRIMUS_EVAL_INTERVAL}"
echo "Enable MLPerf logging: ${ENABLE_MLPERF}"
echo "MLLOG_TRAIN_LOSS_LOG_FREQ: ${MLLOG_TRAIN_LOSS_LOG_FREQ}"
echo "MLLOG_TARGET_EVAL_LOSS: ${MLLOG_TARGET_EVAL_LOSS}"
echo "MLLOG_SUBMISSION_BENCHMARK: ${MLLOG_SUBMISSION_BENCHMARK}"
echo "MLLOG_SUBMISSION_DIVISION: ${MLLOG_SUBMISSION_DIVISION}"
echo "MLLOG_SUBMISSION_ORG: ${MLLOG_SUBMISSION_ORG}"
echo "MLLOG_SUBMISSION_PLATFORM: ${MLLOG_SUBMISSION_PLATFORM}"
echo "============================================"

start=$(date +%s)
start_fmt=$(date +%Y-%m-%d\ %r)
echo "STARTING TIMING RUN AT $start_fmt"

torchrun \
--nproc_per_node=${GPUS_PER_NODE} \
--nnodes=${NNODES} \
--node_rank=${NODE_RANK} \
--master_addr=${MASTER_ADDR} \
--master_port=${MASTER_PORT} \
src/train.py

ret_code=$?

Comment on lines +37 to +46
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ret_code=$? check is ineffective when the script runs with set -e (the script exits immediately if torchrun fails). If you want to handle failures explicitly, capture the exit code by temporarily disabling -e or by using torchrun ... || ret_code=$? and then re-enable -e.

Suggested change
torchrun \
--nproc_per_node=${GPUS_PER_NODE} \
--nnodes=${NNODES} \
--node_rank=${NODE_RANK} \
--master_addr=${MASTER_ADDR} \
--master_port=${MASTER_PORT} \
src/train.py
ret_code=$?
ret_code=0
torchrun \
--nproc_per_node=${GPUS_PER_NODE} \
--nnodes=${NNODES} \
--node_rank=${NODE_RANK} \
--master_addr=${MASTER_ADDR} \
--master_port=${MASTER_PORT} \
src/train.py || ret_code=$?

Copilot uses AI. Check for mistakes.
Comment on lines +3 to +46
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script uses set -e but then tries to capture ret_code=$? after torchrun. With -e, a non-zero torchrun exit will abort the script immediately, so ret_code/timing output won’t be recorded. If you want timing even on failure, temporarily disable -e around torchrun (or use an if ...; then ...; fi pattern) and handle the exit code explicitly.

Copilot uses AI. Check for mistakes.
end=$(date +%s)
end_fmt=$(date +%Y-%m-%d\ %r)
echo "ENDING TIMING RUN AT $end_fmt"

result=$(( end - start ))
result_name="LLAMA3.1_8B"
echo "RESULT,$result_name,,$result,AMD,$start_fmt"

if [[ $ret_code != 0 ]]; then
echo "Training failed with exit code: $ret_code"
exit $ret_code
fi

exit 0

Loading
Loading