Skip to content

(4/5) async grpo break out of generation loop (is_done)#5321

Open
AmineDiro wants to merge 3 commits intomainfrom
feature/async-grpo-is-done
Open

(4/5) async grpo break out of generation loop (is_done)#5321
AmineDiro wants to merge 3 commits intomainfrom
feature/async-grpo-is-done

Conversation

@AmineDiro
Copy link
Copy Markdown
Member

@AmineDiro AmineDiro commented Mar 20, 2026

🔴 🔴 IMPORTANT: Depends on: feature/async-grpo-data-classes

What does this PR do?

  • Dynamically inspects the environment factory for an is_done method upon initialization.
  • Passes is_done into the _generate_one method and breaks the multi-turn generation loop early if the environment signals it has completed its objective.
  • Includes a small fix using get_json_schema to ensure environment tools pass transformers schema validation.

Note

Medium Risk
Moderate risk: changes the rollout data model and multi-turn generation control flow, which can affect reward inputs, token accounting, and tool-calling behavior in async sampling.

Overview
Adds an environment-driven early-exit path for multi-turn tool-calling: AsyncRolloutWorker now discovers an optional is_done method per environment instance and stops _generate_one once it returns true.

Refactors rollout outputs from flat lists into structured records (RolloutCompletion/TurnRecord/ToolCallRecord) that track per-turn messages, token ids/logprobs, tool-response suffix ids, and timing; scoring/reward inputs, tool metrics, and completion masks/logprobs are updated to derive from this structure. Also pre-converts environment bound methods to JSON schema via get_json_schema so they pass Transformers tool validation.

Written by Cursor Bugbot for commit ae9ea3e. This will update automatically on new commits. Configure here.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

tool_execution_duration_s=0.0,
)
)
return _build_completion(turns, truncated=False, total_duration=time.monotonic() - t_start)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

truncated always False even when max turns exceeded

Low Severity

The truncated field on RolloutCompletion is always set to False across all return paths. On line 600, two semantically different conditions are combined with or: (1) tool_calls_raw is None (natural completion) and (2) iteration_num >= max_num_turns (forced stop). When the model produces tool calls but max_num_turns forces an early exit, truncated is incorrectly set to False — the generation was cut short, which is truncation. The truncated value needs to depend on which sub-condition triggered the exit (e.g., truncated=(tool_calls_raw is not None)).

Fix in Cursor Fix in Web

@AmineDiro AmineDiro changed the title async grpo break out of generation loop (is_done) (4/5) async grpo break out of generation loop (is_done) Mar 20, 2026
@AmineDiro
Copy link
Copy Markdown
Member Author

AmineDiro commented Mar 20, 2026

Discussion continued from PR #5299 regarding the is_done flag:


@qgallouedec:
I strongly think that we shouldn't add any done signal, see #5235 (comment)

Requiring extra methods unnecessarily narrows the set of valid implementations and reduces substitutability. We should depend on the minimal protocol the class actually needs. There is nothing we can't do without is_done, so I'd recommend simply not having it. Plus at inference, the model generation won't have access to this flag anyway, so it's useless

@AmineDiro:
The done was added for this case (cf earlier comment):

The issue we currently have is that user can't define a stop function to exit the generation loop. For example, when in code environment, we want to exist if the tests succeed but the current setup keeps rolling until we reach max_turns or token exhaustion.

@qgallouedec:
yes I know, but ideally, the model should learn no to continue calling the tool if it gets an error each time. IMO the right way to handle done would be:

class MyEnv:
    def reset(self, **kwargs):
        self._done = False

    def my_func(self):
        """Some nice documentation"""
        if self._done:
            raise Exception("Session expired, you can't use this tool anymore!")
        if some_condition:
            self._done = True
        return

If the generation stops when the environment is done, then the model never learns to know that the [environment is done] implies that [it can't use it anymore].

@AmineDiro:
Sorry, maybe I didn't explain well the usage of the is_done in this portion codebase. The current impl does the following:

async def _generate_one(...):
    while True:
        # .... 
        if tool_calls is None or (max_iterations is not None and iteration_num >= max_iterations):
            return ...

Now we can have done() raise an Exception but I am not sure it provides a clean signal to know that this rollout had exception because of some env specific errors vs it genuinely reached end of turn. If we go with Exception route, I would image that we need to define a library side Exception like StopGeneration(Exception) to cleanly catch it and know when env has stopped. But then this is equivalent to defining a done function IMO.

Also having a done env defined tool extends tools context for the LLM which can be a good choice in some specific cases but overall maybe unnecesary ??

@qgallouedec:
So if we don't use the done flag to stop the generation, what's the point of having it? I can see it's used here, but the reason is not clear:
https://github.com/huggingface/trl/pull/5299/changes#diff-175873f8c2363ec8322c4a0af91437e83da991bb5d388f2d01342b2d31eaf8a3R822

it seems like a proxy the a new is_truncated flag, which I don't quite understand either. The truncation is equivalent to "the last token isn't an EOS", why adding this flag here?

But more generally, don't you think it's out of the scope of this PR?

@AmineDiro:

So if we don't use the done flag to stop the generation, what's the point of having it? I can see it's used here, but the reason is not clear:

To be precise, I put the previous generation loop. The changed one does use it to break : line

 if is_done is not None and is_done():
     return _build_completion(turns, truncated=False, total_duration=time.monotonic() - t_start)

it seems like a proxy the a new is_truncated flag, which I don't quite understand either. The truncation is equivalent to "the last token isn't an EOS", why adding this flag here?

Truncation is separate from done and is mainly for metrics and debugging purposes. It can be computed from "isn't last EOS" but its just a bool and recomputing something that small seems unnecessary ?

@qgallouedec
Copy link
Copy Markdown
Member

I took another look at this PR, and there’s a pattern here, I’m not sure if it’s intentional. Let me know if I’m understanding this correctly:

The is_done check fires after tool execution but before the next generation turn. This means that when the environment terminates, the rollout ends with a tool response as the last message, ie, the model never generates a closing assistant turn.
[prompt | tool_call | tool_response] instead of [prompt | tool_call | tool_response | assistant_final_message]

@AmineDiro
Copy link
Copy Markdown
Member Author

The is_done check fires after tool execution but before the next generation turn. This means that when the environment terminates, the rollout ends with a tool response as the last message, ie, the model never generates a closing assistant turn.

yes you are right

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants