Skip to content

Support multimodal tool responses in environment_factory for VLM training#5323

Merged
sergiopaniego merged 73 commits intomainfrom
multimodal-tool-responses
Apr 2, 2026
Merged

Support multimodal tool responses in environment_factory for VLM training#5323
sergiopaniego merged 73 commits intomainfrom
multimodal-tool-responses

Conversation

@sergiopaniego
Copy link
Copy Markdown
Member

@sergiopaniego sergiopaniego commented Mar 20, 2026

What does this PR do?

Enable environment_factory tool methods to return multimodal content blocks (images + text) for VLM training.
Previously, tool responses were always converted to str(result), discarding any visual information. Now tools can
return content block lists that include images, and the trainer handles them end-to-end: tokenization, generation,
and training forward pass with proper pixel_values.

Also migrates browsergym.py (VLM) from rollout_func to environment_factory with multimodal tool responses.

Depends on: #5235 (OpenEnv examples update) and related to huggingface/transformers#44873

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.


Note

Medium Risk
Touches core GRPOTrainer generation/tool-calling and multimodal forward-pass plumbing, which could affect VLM training stability and sequence-length edge cases across multiple trainers.

Overview
Adds end-to-end support for multimodal tool responses (e.g., image + text content blocks) when using GRPOTrainer with environment_factory, so tool outputs are no longer stringified and can feed VLMs with pixel_values during tool-call loops.

Updates GRPOTrainer internals to track/merge tool-returned images, tokenize tool messages correctly for VLM processors, avoid truncating mid-image token spans, and add safer mm_token_type_ids/max-length handling for VLM configs; similar image-handling fixes are applied in GFPOTrainer, GRPOWithReplayBufferTrainer, and RLOOTrainer. Also migrates the OpenEnv browsergym.py example to environment_factory with screenshot-returning tool methods, adds a new carla_vlm.py OpenEnv VLM example, and documents the new example in example_overview.md.

Written by Cursor Bugbot for commit cad17f3. This will update automatically on new commits. Configure here.

Comment thread trl/trainer/grpo_trainer.py Outdated
Comment on lines +1244 to +1251
# When the processor is a VLM processor, normalize string content to content blocks.
# Some VLM processors (e.g., Qwen3.5, Qwen3-VL) iterate over message["content"]
# assuming it's always a list of content blocks, which fails when content is a plain string.
if hasattr(self.processing_class, "image_processor"):
for prompt in prompts:
for message in prompt:
if isinstance(message["content"], str):
message["content"] = [{"type": "text", "text": message["content"]}]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you may want to use trl.data_utils.prepare_multimodal_messages here

sergiopaniego and others added 21 commits March 20, 2026 17:05
- Fix _get_tool_suffix_ids: add tokenize=True and unbatch for VLM processors
- Fix parse_response: delegate to inner tokenizer for processors (2 places)
- Propagate response_schema from processor to tokenizer for tool call parsing
- Fix max_position_embeddings fallback for models with composite config (e.g., Qwen3.5)
- Add carla_vlm.py: CARLA VLM training with multimodal tool responses (camera images)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…l-responses

# Conflicts:
#	trl/trainer/grpo_trainer.py
…nses

# Conflicts:
#	docs/source/example_overview.md
#	trl/trainer/grpo_trainer.py
…pe_ids from input_ids, handle truncation

- Return images from _generate to _generate_and_score_completions
- Build mm_token_type_ids directly from prompt_completion_ids (detect image_pad tokens)
- Place mm_token_type_ids construction after the extension block to avoid double-extension
- Use image_processor directly for pixel_values (avoids re-processing mismatch)
- Trim pixel_values/image_grid_thw when truncation cuts partial images
- Normalize string content in tool messages for VLM processors in _get_tool_suffix_ids
- Use content blocks format for dummy messages with VLM processors
…detection

- Add _get_vision_token_ids() to detect image tokens dynamically across VLM families
- Add _normalize_message_content() to deduplicate string→content blocks normalization
- Remove all [VLM DEBUG] print statements
- Replace _build_mm_token_type_ids instance attribute with local variable
- Remove unused tool_multimodal_fields return value from _get_tool_suffix_ids
@sergiopaniego sergiopaniego marked this pull request as ready for review March 24, 2026 11:53
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Comment thread trl/trainer/grpo_trainer.py
Comment thread trl/trainer/grpo_trainer.py
Comment thread trl/trainer/grpo_trainer.py Outdated
@sergiopaniego
Copy link
Copy Markdown
Member Author

I don't think the current failing tests are related @albertvillanova 🤔

Comment thread trl/trainer/grpo_trainer.py Outdated
Comment thread trl/trainer/grpo_trainer.py Outdated
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Comment thread trl/trainer/grpo_trainer.py Outdated
Comment thread trl/trainer/grpo_trainer.py Outdated
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Comment thread trl/trainer/grpo_trainer.py Outdated
Comment on lines 1957 to +1970
@@ -1750,7 +1963,11 @@ def _generate_and_score_completions(
num_items_in_batch,
sampling_per_token_logps_list,
extra_fields,
images,
tool_images,
) = self._generate(prompts)
if images is None:
images = dataset_images # restore dataset images (rollout_func path returns None)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think only tool_images needs to cross the boundary between _generate and the caller. The caller already has the dataset images. The merge of tool_images into images could happen in _generate_and_score_completions after the unpack. Therefore, you could eliminate the save/restore pattern and the name confusion around what images means at each point.

Something like (partially implemented):

Suggested change
(
prompt_ids_list,
completion_ids_list,
tool_mask_list,
completions,
num_items_in_batch,
sampling_per_token_logps_list,
extra_fields,
tool_images,
) = self._generate(prompts)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great suggestion! It'd clear. It'd require moving the tool_images merge logic from _generate to the caller, which touches a few places 😅 I'll address this in a follow-up (i'll create an issue later)

else:
merged_images = [
(existing or []) + new for existing, new in zip(merged_images, tool_images, strict=True)
]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The image merging logic is duplicated. Maybe better using a helper function _merge_images?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 cases would be enough to create a helper? 🤔 I think we can leave it as it is and create the helper afterwards if needed

Comment on lines +1839 to +1844
# Merge tool response images into the images list for the forward pass
if any(imgs for imgs in tool_images):
if images is None:
images = [imgs if imgs else None for imgs in tool_images]
else:
images = [(existing or []) + new for existing, new in zip(images, tool_images, strict=True)]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated image merging logic.

Comment thread trl/trainer/grpo_trainer.py Outdated
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Comment thread trl/trainer/grpo_trainer.py
Comment thread trl/trainer/grpo_trainer.py
Copy link
Copy Markdown
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your incredible work: awesome as always! 🤗

Let's keep the pending minor issues for a subsequent PR.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 4 total unresolved issues (including 3 from previous reviews).

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

# and drop ALL images for any sample with a mismatch (safe fallback).
image_grid_thw = forward_kwargs.get("image_grid_thw")
if image_grid_thw is not None and num_images is not None:
merge_length = getattr(self.processing_class.image_processor, "merge_size", 2) ** 2
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New getattr calls violate project simplicity rules

Low Severity

Several new getattr calls are introduced: getattr(self.processing_class.image_processor, "merge_size", 2), getattr(parsing_class, "tokenizer", parsing_class), and getattr(tokenizer_or_processor, "tokenizer", tokenizer_or_processor). The project rules explicitly state to avoid getattr and hasattr as symptoms of overly defensive programming. Since self._is_vlm is already available, the tokenizer extraction could use self.processing_class.tokenizer if self._is_vlm else self.processing_class. The merge_size default of 2 is a silent assumption that could cause incorrect truncation safety checks for non-Qwen VLMs.

Additional Locations (2)
Fix in Cursor Fix in Web

Triggered by project rule: ../.ai/AGENTS.md

@sergiopaniego sergiopaniego merged commit 8501179 into main Apr 2, 2026
14 of 16 checks passed
@sergiopaniego sergiopaniego deleted the multimodal-tool-responses branch April 2, 2026 10:50
Comment on lines +909 to 912
# VLM processors don't have parse_response directly; use the inner tokenizer
tokenizer = getattr(tokenizer_or_processor, "tokenizer", tokenizer_or_processor)
try:
parsed = tokenizer.parse_response(ids)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Rocketknight1 should we expect vlm to have parse_response in the future?

self._vision_token_ids_cache = cache
return self._vision_token_ids_cache

def _truncate_at_image_boundary(self, ids, max_length):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the motivation, but based on our past experience (when we experimented with prompt truncation, a feature we ultimately removed), working with image tokens tends to be extremely brittle and difficult to maintain. I would strongly recommend prioritizing truncation of the full tool response instead. I'll open a PR for that, we could discuss there.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants