(4/5) async grpo break out of generation loop (is_done) by AmineDiro · Pull Request #5321 · huggingface/trl

AmineDiro · 2026-03-20T09:44:25Z

🔴 🔴 IMPORTANT: Depends on: feature/async-grpo-data-classes

What does this PR do?

Dynamically inspects the environment factory for an is_done method upon initialization.
Passes is_done into the _generate_one method and breaks the multi-turn generation loop early if the environment signals it has completed its objective.
Includes a small fix using get_json_schema to ensure environment tools pass transformers schema validation.

Note

Medium Risk
Moderate risk: changes the rollout data model and multi-turn generation control flow, which can affect reward inputs, token accounting, and tool-calling behavior in async sampling.

Overview
Adds an environment-driven early-exit path for multi-turn tool-calling: AsyncRolloutWorker now discovers an optional is_done method per environment instance and stops _generate_one once it returns true.

Refactors rollout outputs from flat lists into structured records (RolloutCompletion/TurnRecord/ToolCallRecord) that track per-turn messages, token ids/logprobs, tool-response suffix ids, and timing; scoring/reward inputs, tool metrics, and completion masks/logprobs are updated to derive from this structure. Also pre-converts environment bound methods to JSON schema via get_json_schema so they pass Transformers tool validation.

^{Written by Cursor Bugbot for commit ae9ea3e. This will update automatically on new commits. Configure here.}

…ls schema

HuggingFaceDocBuilderDev · 2026-03-20T09:47:07Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

cursor · 2026-03-20T09:51:16Z

+                        tool_execution_duration_s=0.0,
+                    )
+                )
+                return _build_completion(turns, truncated=False, total_duration=time.monotonic() - t_start)


truncated always False even when max turns exceeded

Low Severity

The truncated field on RolloutCompletion is always set to False across all return paths. On line 600, two semantically different conditions are combined with or: (1) tool_calls_raw is None (natural completion) and (2) iteration_num >= max_num_turns (forced stop). When the model produces tool calls but max_num_turns forces an early exit, truncated is incorrectly set to False — the generation was cut short, which is truncation. The truncated value needs to depend on which sub-condition triggered the exit (e.g., truncated=(tool_calls_raw is not None)).

AmineDiro · 2026-03-20T10:09:07Z

Discussion continued from PR #5299 regarding the is_done flag:

@qgallouedec:
I strongly think that we shouldn't add any done signal, see #5235 (comment)

Requiring extra methods unnecessarily narrows the set of valid implementations and reduces substitutability. We should depend on the minimal protocol the class actually needs. There is nothing we can't do without is_done, so I'd recommend simply not having it. Plus at inference, the model generation won't have access to this flag anyway, so it's useless

@AmineDiro:
The done was added for this case (cf earlier comment):

The issue we currently have is that user can't define a stop function to exit the generation loop. For example, when in code environment, we want to exist if the tests succeed but the current setup keeps rolling until we reach max_turns or token exhaustion.

@qgallouedec:
yes I know, but ideally, the model should learn no to continue calling the tool if it gets an error each time. IMO the right way to handle done would be:

class MyEnv:
    def reset(self, **kwargs):
        self._done = False

    def my_func(self):
        """Some nice documentation"""
        if self._done:
            raise Exception("Session expired, you can't use this tool anymore!")
        if some_condition:
            self._done = True
        return

If the generation stops when the environment is done, then the model never learns to know that the [environment is done] implies that [it can't use it anymore].

@AmineDiro:
Sorry, maybe I didn't explain well the usage of the is_done in this portion codebase. The current impl does the following:

async def _generate_one(...):
    while True:
        # .... 
        if tool_calls is None or (max_iterations is not None and iteration_num >= max_iterations):
            return ...

Now we can have done() raise an Exception but I am not sure it provides a clean signal to know that this rollout had exception because of some env specific errors vs it genuinely reached end of turn. If we go with Exception route, I would image that we need to define a library side Exception like StopGeneration(Exception) to cleanly catch it and know when env has stopped. But then this is equivalent to defining a done function IMO.

Also having a done env defined tool extends tools context for the LLM which can be a good choice in some specific cases but overall maybe unnecesary ??

@qgallouedec:
So if we don't use the done flag to stop the generation, what's the point of having it? I can see it's used here, but the reason is not clear:
https://github.com/huggingface/trl/pull/5299/changes#diff-175873f8c2363ec8322c4a0af91437e83da991bb5d388f2d01342b2d31eaf8a3R822

it seems like a proxy the a new is_truncated flag, which I don't quite understand either. The truncation is equivalent to "the last token isn't an EOS", why adding this flag here?

But more generally, don't you think it's out of the scope of this PR?

@AmineDiro:

So if we don't use the done flag to stop the generation, what's the point of having it? I can see it's used here, but the reason is not clear:

To be precise, I put the previous generation loop. The changed one does use it to break : line

 if is_done is not None and is_done():
     return _build_completion(turns, truncated=False, total_duration=time.monotonic() - t_start)

it seems like a proxy the a new is_truncated flag, which I don't quite understand either. The truncation is equivalent to "the last token isn't an EOS", why adding this flag here?

Truncation is separate from done and is mainly for metrics and debugging purposes. It can be computed from "isn't last EOS" but its just a bool and recomputing something that small seems unnecessary ?

qgallouedec · 2026-03-23T01:23:25Z

I took another look at this PR, and there’s a pattern here, I’m not sure if it’s intentional. Let me know if I’m understanding this correctly:

The is_done check fires after tool execution but before the next generation turn. This means that when the environment terminates, the rollout ends with a tool response as the last message, ie, the model never generates a closing assistant turn.
[prompt | tool_call | tool_response] instead of [prompt | tool_call | tool_response | assistant_final_message]

AmineDiro · 2026-03-23T21:57:54Z

The is_done check fires after tool execution but before the next generation turn. This means that when the environment terminates, the rollout ends with a tool response as the last message, ie, the model never generates a closing assistant turn.

yes you are right

AmineDiro added 2 commits March 20, 2026 09:32

Refactor RolloutCompletion to store turns and compute fields lazily

b39ac5e

feat: Add is_done early break from generation and fix environment too…

983206b

…ls schema

AmineDiro requested review from albertvillanova and qgallouedec March 20, 2026 09:44

cursor Bot reviewed Mar 20, 2026

View reviewed changes

AmineDiro changed the title ~~async grpo break out of generation loop (is_done)~~ (4/5) async grpo break out of generation loop (is_done) Mar 20, 2026

precommit

ae9ea3e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(4/5) async grpo break out of generation loop (is_done)#5321

(4/5) async grpo break out of generation loop (is_done)#5321
AmineDiro wants to merge 3 commits intomainfrom
feature/async-grpo-is-done

AmineDiro commented Mar 20, 2026 •

edited by cursor Bot

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Mar 20, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Mar 20, 2026

Uh oh!

AmineDiro commented Mar 20, 2026 •

edited

Loading

Uh oh!

qgallouedec commented Mar 23, 2026

Uh oh!

AmineDiro commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

AmineDiro commented Mar 20, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Mar 20, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Mar 20, 2026

Choose a reason for hiding this comment

truncated always False even when max turns exceeded

Uh oh!

AmineDiro commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qgallouedec commented Mar 23, 2026

Uh oh!

AmineDiro commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AmineDiro commented Mar 20, 2026 •

edited by cursor Bot

Loading

`truncated` always `False` even when max turns exceeded

AmineDiro commented Mar 20, 2026 •

edited

Loading