feat: log raw importance ratios and fraction of truncation/masking in vLLM importance sampling correction by muupan · Pull Request #5243 · huggingface/trl

muupan · 2026-03-08T16:19:46Z

What does this PR do?

Resolves #5231

This PR adds new logged metrics:

sampling/raw_importance_sampling_ratio/{min,max,mean}
- These are the same with existing sampling/importance_sampling_ratio/{min,max,mean} except being computed with values before truncation or masking.
sampling/frac_modified_importance_sampling_ratio
- This is the fraction of importance sampling ratio values that are either truncated or masked.

I ran the following code to verify the output:

import argparse

from datasets import load_dataset
from trl import GRPOConfig, GRPOTrainer
from trl.rewards import accuracy_reward


parser = argparse.ArgumentParser()
parser.add_argument("--vllm_importance_sampling_mode", type=str, default="sequence_mask")
args = parser.parse_args()

dataset = load_dataset("trl-lib/DeepMath-103K", split="train[:5]")

args = GRPOConfig(
    output_dir=f"outputs/{args.vllm_importance_sampling_mode}",
    vllm_importance_sampling_correction=True,
    vllm_importance_sampling_mode=args.vllm_importance_sampling_mode,
    vllm_importance_sampling_cap=1.2,
    use_vllm=True,
    vllm_mode="colocate",
    max_steps=5,
    num_train_epochs=1,
    logging_steps=1,
    save_strategy="no",
    report_to="tensorboard",
)

trainer = GRPOTrainer(
    args=args,
    model="Qwen/Qwen2-0.5B-Instruct",
    reward_funcs=accuracy_reward,
    train_dataset=dataset,
)
trainer.train()

uv run --no-sync accelerate launch --num_processes 1 train_grpo_example.py --vllm_importance_sampling_mode token_truncate
uv run --no-sync accelerate launch --num_processes 1 train_grpo_example.py --vllm_importance_sampling_mode token_mask
uv run --no-sync accelerate launch --num_processes 1 train_grpo_example.py --vllm_importance_sampling_mode sequence_truncate
uv run --no-sync accelerate launch --num_processes 1 train_grpo_example.py --vllm_importance_sampling_mode sequence_mask

From the tensorboard records, you can see:

The new metrics are all recorded.
sampling/raw_importance_sampling_ratio/max are higher than the cap value of 1.2, while sampling/importance_sampling_ratio/max is upper bounded by it, which is expected as the latter are affected by truncation or masking. Masking leads to lower values than truncation, which is also expected.
sampling/frac_modified_importance_sampling_ratio is higher with sequence-level IS than with token-level IS.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Note

Low Risk
Low risk: changes are limited to additional metric logging around vLLM importance-sampling correction, without altering the loss or training behavior beyond minor extra tensor ops.

Overview
Adds additional GRPO vLLM importance-sampling diagnostics by tracking the raw (pre-cap) importance sampling ratios alongside the existing capped/masked ratios.

GRPOTrainer now logs sampling/raw_importance_sampling_ratio/{min,mean,max} and sampling/frac_modified_importance_sampling_ratio (fraction of ratios changed by truncation/masking), and the GRPO docs are updated to list these new metrics.

^{Written by Cursor Bugbot for commit 32a1746. This will update automatically on new commits. Configure here.}

for multi-process settings

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

qgallouedec · 2026-03-13T23:48:57Z

thanks! can you add a mention of this in grpo_trainer.md as well

muupan · 2026-03-18T06:51:14Z

It seems that grpo_trainer.md does not mention sampling/* metrics at all. I added mentions to them including existing ones.

muupan · 2026-04-07T07:27:36Z

@qgallouedec Gentle ping on this PR when you have a chance.

Log raw importance ratios and fraction of truncation/masking

fc9eebe

muupan mentioned this pull request Mar 8, 2026

Logging how vllm importance ratios are truncated/masked in GRPOTrainer #5231

Open

cursor Bot reviewed Mar 8, 2026

View reviewed changes

Comment thread trl/trainer/grpo_trainer.py

Fix compuation of frac_modified_importance_sampling_ratio

6d0c0ee

for multi-process settings

cursor Bot reviewed Mar 8, 2026

View reviewed changes

Comment thread trl/trainer/grpo_trainer.py

muupan and others added 2 commits March 9, 2026 02:44

Avoid division by zero

5881fda

Merge branch 'main' into feature/log-raw-importance-sampling-ratio

7f8fee3

Add sampling metrics to docs

32a1746

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: log raw importance ratios and fraction of truncation/masking in vLLM importance sampling correction#5243

feat: log raw importance ratios and fraction of truncation/masking in vLLM importance sampling correction#5243
muupan wants to merge 5 commits intohuggingface:mainfrom
muupan:feature/log-raw-importance-sampling-ratio

muupan commented Mar 8, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

qgallouedec commented Mar 13, 2026

Uh oh!

muupan commented Mar 18, 2026

Uh oh!

muupan commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

muupan commented Mar 8, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

qgallouedec commented Mar 13, 2026

Uh oh!

muupan commented Mar 18, 2026

Uh oh!

muupan commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

muupan commented Mar 8, 2026 •

edited by cursor Bot

Loading