Support multimodal tool responses in `environment_factory` for VLM training by sergiopaniego · Pull Request #5323 · huggingface/trl

sergiopaniego · 2026-03-20T10:33:11Z

What does this PR do?

Enable environment_factory tool methods to return multimodal content blocks (images + text) for VLM training.
Previously, tool responses were always converted to str(result), discarding any visual information. Now tools can
return content block lists that include images, and the trainer handles them end-to-end: tokenization, generation,
and training forward pass with proper pixel_values.

Also migrates browsergym.py (VLM) from rollout_func to environment_factory with multimodal tool responses.

Depends on: #5235 (OpenEnv examples update) and related to huggingface/transformers#44873

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Note

Medium Risk
Touches core GRPOTrainer generation/tool-calling and multimodal forward-pass plumbing, which could affect VLM training stability and sequence-length edge cases across multiple trainers.

Overview
Adds end-to-end support for multimodal tool responses (e.g., image + text content blocks) when using GRPOTrainer with environment_factory, so tool outputs are no longer stringified and can feed VLMs with pixel_values during tool-call loops.

Updates GRPOTrainer internals to track/merge tool-returned images, tokenize tool messages correctly for VLM processors, avoid truncating mid-image token spans, and add safer mm_token_type_ids/max-length handling for VLM configs; similar image-handling fixes are applied in GFPOTrainer, GRPOWithReplayBufferTrainer, and RLOOTrainer. Also migrates the OpenEnv browsergym.py example to environment_factory with screenshot-returning tool methods, adds a new carla_vlm.py OpenEnv VLM example, and documents the new example in example_overview.md.

^{Written by Cursor Bugbot for commit cad17f3. This will update automatically on new commits. Configure here.}

…ning

qgallouedec · 2026-03-20T15:08:19Z

+            # When the processor is a VLM processor, normalize string content to content blocks.
+            # Some VLM processors (e.g., Qwen3.5, Qwen3-VL) iterate over message["content"]
+            # assuming it's always a list of content blocks, which fails when content is a plain string.
+            if hasattr(self.processing_class, "image_processor"):
+                for prompt in prompts:
+                    for message in prompt:
+                        if isinstance(message["content"], str):
+                            message["content"] = [{"type": "text", "text": message["content"]}]


you may want to use trl.data_utils.prepare_multimodal_messages here

- Fix _get_tool_suffix_ids: add tokenize=True and unbatch for VLM processors - Fix parse_response: delegate to inner tokenizer for processors (2 places) - Propagate response_schema from processor to tokenizer for tool call parsing - Fix max_position_embeddings fallback for models with composite config (e.g., Qwen3.5) - Add carla_vlm.py: CARLA VLM training with multimodal tool responses (camera images) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…l into multimodal-tool-responses

…rward pass

…l into multimodal-tool-responses

…l-responses # Conflicts: # trl/trainer/grpo_trainer.py

…nses # Conflicts: # docs/source/example_overview.md # trl/trainer/grpo_trainer.py

…pe_ids from input_ids, handle truncation - Return images from _generate to _generate_and_score_completions - Build mm_token_type_ids directly from prompt_completion_ids (detect image_pad tokens) - Place mm_token_type_ids construction after the extension block to avoid double-extension - Use image_processor directly for pixel_values (avoids re-processing mismatch) - Trim pixel_values/image_grid_thw when truncation cuts partial images - Normalize string content in tool messages for VLM processors in _get_tool_suffix_ids - Use content blocks format for dummy messages with VLM processors

… None entries

…detection - Add _get_vision_token_ids() to detect image tokens dynamically across VLM families - Add _normalize_message_content() to deduplicate string→content blocks normalization - Remove all [VLM DEBUG] print statements - Replace _build_mm_token_type_ids instance attribute with local variable - Remove unused tool_multimodal_fields return value from _get_tool_suffix_ids

…on, remove temp save_strategy

…l into multimodal-tool-responses

HuggingFaceDocBuilderDev · 2026-03-24T11:56:51Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

sergiopaniego · 2026-04-01T17:03:29Z

I don't think the current failing tests are related @albertvillanova 🤔

Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

albertvillanova · 2026-04-02T09:47:52Z

@@ -1750,7 +1963,11 @@ def _generate_and_score_completions(
            num_items_in_batch,
            sampling_per_token_logps_list,
            extra_fields,
+            images,
+            tool_images,
        ) = self._generate(prompts)
+        if images is None:
+            images = dataset_images  # restore dataset images (rollout_func path returns None)


I think only tool_images needs to cross the boundary between _generate and the caller. The caller already has the dataset images. The merge of tool_images into images could happen in _generate_and_score_completions after the unpack. Therefore, you could eliminate the save/restore pattern and the name confusion around what images means at each point.

Something like (partially implemented):

Suggested change

(

prompt_ids_list,

completion_ids_list,

tool_mask_list,

completions,

num_items_in_batch,

sampling_per_token_logps_list,

extra_fields,

tool_images,

) = self._generate(prompts)

great suggestion! It'd clear. It'd require moving the tool_images merge logic from _generate to the caller, which touches a few places 😅 I'll address this in a follow-up (i'll create an issue later)

…l into multimodal-tool-responses

albertvillanova · 2026-04-02T09:56:26Z

+                else:
+                    merged_images = [
+                        (existing or []) + new for existing, new in zip(merged_images, tool_images, strict=True)
+                    ]


The image merging logic is duplicated. Maybe better using a helper function _merge_images?

2 cases would be enough to create a helper? 🤔 I think we can leave it as it is and create the helper afterwards if needed

albertvillanova · 2026-04-02T09:56:54Z

+            # Merge tool response images into the images list for the forward pass
+            if any(imgs for imgs in tool_images):
+                if images is None:
+                    images = [imgs if imgs else None for imgs in tool_images]
+                else:
+                    images = [(existing or []) + new for existing, new in zip(images, tool_images, strict=True)]


Duplicated image merging logic.

Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

albertvillanova

Thanks for your incredible work: awesome as always! 🤗

Let's keep the pending minor issues for a subsequent PR.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 4 total unresolved issues (including 3 from previous reviews).

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

cursor · 2026-04-02T10:28:47Z

+            # and drop ALL images for any sample with a mismatch (safe fallback).
+            image_grid_thw = forward_kwargs.get("image_grid_thw")
+            if image_grid_thw is not None and num_images is not None:
+                merge_length = getattr(self.processing_class.image_processor, "merge_size", 2) ** 2


New getattr calls violate project simplicity rules

Low Severity

Several new getattr calls are introduced: getattr(self.processing_class.image_processor, "merge_size", 2), getattr(parsing_class, "tokenizer", parsing_class), and getattr(tokenizer_or_processor, "tokenizer", tokenizer_or_processor). The project rules explicitly state to avoid getattr and hasattr as symptoms of overly defensive programming. Since self._is_vlm is already available, the tokenizer extraction could use self.processing_class.tokenizer if self._is_vlm else self.processing_class. The merge_size default of 2 is a silent assumption that could cause incorrect truncation safety checks for non-Qwen VLMs.

Additional Locations (2)

trl/trainer/grpo_trainer.py#L1810-L1811

trl/chat_template_utils.py#L909-L910

^{Triggered by project rule: ../.ai/AGENTS.md}

qgallouedec · 2026-04-02T18:34:09Z

+    # VLM processors don't have parse_response directly; use the inner tokenizer
+    tokenizer = getattr(tokenizer_or_processor, "tokenizer", tokenizer_or_processor)
    try:
        parsed = tokenizer.parse_response(ids)


@Rocketknight1 should we expect vlm to have parse_response in the future?

qgallouedec · 2026-04-02T20:02:22Z

+            self._vision_token_ids_cache = cache
+        return self._vision_token_ids_cache
+
+    def _truncate_at_image_boundary(self, ids, max_length):


I understand the motivation, but based on our past experience (when we experimented with prompt truncation, a feature we ultimately removed), working with image tokens tends to be extremely brittle and difficult to maintain. I would strongly recommend prioritizing truncation of the full tool response instead. I'll open a PR for that, we could discuss there.

sergiopaniego added 2 commits March 20, 2026 11:27

Support multimodal tool responses in environment_factory for VLM trai…

28122f0

…ning

Merge branch 'main' into multimodal-tool-responses

ba1a30e

qgallouedec reviewed Mar 20, 2026

View reviewed changes

sergiopaniego and others added 21 commits March 20, 2026 17:05

Merge branch 'main' into multimodal-tool-responses

786b8f9

Merge branch 'multimodal-tool-responses' of github.com:huggingface/tr…

6ded5b8

…l into multimodal-tool-responses

Expand image tokens in tool suffix IDs and collect tool images for fo…

b03cc2b

…rward pass

Expand image tokens in tool suffix IDs and collect tool images for fo…

6ea2430

…rward pass

Merge branch 'multimodal-tool-responses' of github.com:huggingface/tr…

429b548

…l into multimodal-tool-responses

Merge branch 'multimodal-tool-responses' of github.com:huggingface/tr…

e4a00b7

…l into multimodal-tool-responses

Merge branch 'multimodal-tool-responses' of github.com:huggingface/tr…

0755a85

…l into multimodal-tool-responses

Update docs

cdf153a

Merge branch 'main' of github.com:huggingface/trl into multimodal-too…

a96e3e3

…l-responses # Conflicts: # trl/trainer/grpo_trainer.py

Add debug

f49a63a

Merge remote-tracking branch 'origin/main' into multimodal-tool-respo…

79fd46f

…nses # Conflicts: # docs/source/example_overview.md # trl/trainer/grpo_trainer.py

Fix image boundary truncation, tool_mask sync, and image logging with…

df71f29

… None entries

Use _is_vlm from SFTTrainer convention, simplify vision token detecti…

b464134

…on, remove temp save_strategy

Merge branch 'main' into multimodal-tool-responses

9714659

Merge branch 'main' into multimodal-tool-responses

6408ca4

precommit

164f035

Merge branch 'multimodal-tool-responses' of github.com:huggingface/tr…

e74a7f6

…l into multimodal-tool-responses

Merge branch 'multimodal-tool-responses' of github.com:huggingface/tr…

dc147a8

…l into multimodal-tool-responses

sergiopaniego marked this pull request as ready for review March 24, 2026 11:53

Merge branch 'multimodal-tool-responses' of github.com:huggingface/tr…

ed83a97

…l into multimodal-tool-responses

cursor Bot reviewed Mar 24, 2026

View reviewed changes

Comment thread trl/trainer/grpo_trainer.py

Fix undefined images variable in rollout_func code path

2f49156

cursor Bot reviewed Mar 24, 2026

View reviewed changes

Comment thread trl/trainer/grpo_trainer.py

Comment thread trl/trainer/grpo_trainer.py Outdated

Merge branch 'main' into multimodal-tool-responses

cf38723

sergiopaniego added 2 commits April 2, 2026 11:14

Avoid mutating original prompts during VLM content normalization

30b2be3

Merge branch 'main' into multimodal-tool-responses

9d8b4ab

albertvillanova reviewed Apr 2, 2026

View reviewed changes

Comment thread trl/trainer/grpo_trainer.py Outdated

Comment thread trl/trainer/grpo_trainer.py Outdated

Update trl/trainer/grpo_trainer.py

826e4c1

Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

albertvillanova reviewed Apr 2, 2026

View reviewed changes

Comment thread trl/trainer/grpo_trainer.py Outdated

Comment thread trl/trainer/grpo_trainer.py Outdated

Update trl/trainer/grpo_trainer.py

36a631c

Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

cursor Bot reviewed Apr 2, 2026

View reviewed changes

Comment thread trl/trainer/grpo_trainer.py Outdated

albertvillanova reviewed Apr 2, 2026

View reviewed changes

sergiopaniego added 2 commits April 2, 2026 11:50

removed defensive code

e1711f9

Merge branch 'multimodal-tool-responses' of github.com:huggingface/tr…

4775495

…l into multimodal-tool-responses

albertvillanova reviewed Apr 2, 2026

View reviewed changes

Comment thread trl/trainer/grpo_trainer.py Outdated

Update trl/trainer/grpo_trainer.py

9a5d3c4

Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

cursor Bot reviewed Apr 2, 2026

View reviewed changes

Comment thread trl/trainer/grpo_trainer.py

Comment thread trl/trainer/grpo_trainer.py

update based on cursor

cad17f3

albertvillanova approved these changes Apr 2, 2026

View reviewed changes

cursor Bot reviewed Apr 2, 2026

View reviewed changes

sergiopaniego merged commit 8501179 into main Apr 2, 2026
14 of 16 checks passed

sergiopaniego deleted the multimodal-tool-responses branch April 2, 2026 10:50

qgallouedec mentioned this pull request Apr 2, 2026

Remove unnecessary isinstance(part, dict) checks in image extraction #5439

Merged

qgallouedec reviewed Apr 2, 2026

View reviewed changes

qgallouedec mentioned this pull request Apr 2, 2026

Simplify _get_tool_suffix_ids #5440

Merged

qgallouedec reviewed Apr 2, 2026

View reviewed changes

qgallouedec mentioned this pull request Apr 3, 2026

Add test and docs for multimodal tool responses #5448

Merged

albertvillanova mentioned this pull request Apr 8, 2026

Support messages with images in prepare_multimodal_messages #5474

Merged

This was referenced Apr 11, 2026

Fix add_response_schema for VLM processors #5520

Merged

Drop, don't truncate, overlong tool results in GRPOTrainer #5521

Merged

Revert VLM support in parse_response #5561

Draft

Conversation

sergiopaniego commented Mar 20, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Mar 24, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sergiopaniego commented Apr 1, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

albertvillanova left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 2, 2026

Choose a reason for hiding this comment

New getattr calls violate project simplicity rules

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

sergiopaniego commented Mar 20, 2026 •

edited by cursor Bot

Loading

albertvillanova left a comment •

edited

Loading

New `getattr` calls violate project simplicity rules