Support multimodal tool responses in environment_factory for VLM training#5323
Support multimodal tool responses in environment_factory for VLM training#5323sergiopaniego merged 73 commits intomainfrom
environment_factory for VLM training#5323Conversation
| # When the processor is a VLM processor, normalize string content to content blocks. | ||
| # Some VLM processors (e.g., Qwen3.5, Qwen3-VL) iterate over message["content"] | ||
| # assuming it's always a list of content blocks, which fails when content is a plain string. | ||
| if hasattr(self.processing_class, "image_processor"): | ||
| for prompt in prompts: | ||
| for message in prompt: | ||
| if isinstance(message["content"], str): | ||
| message["content"] = [{"type": "text", "text": message["content"]}] |
There was a problem hiding this comment.
you may want to use trl.data_utils.prepare_multimodal_messages here
- Fix _get_tool_suffix_ids: add tokenize=True and unbatch for VLM processors - Fix parse_response: delegate to inner tokenizer for processors (2 places) - Propagate response_schema from processor to tokenizer for tool call parsing - Fix max_position_embeddings fallback for models with composite config (e.g., Qwen3.5) - Add carla_vlm.py: CARLA VLM training with multimodal tool responses (camera images) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…l into multimodal-tool-responses
…l into multimodal-tool-responses
…l into multimodal-tool-responses
…l into multimodal-tool-responses
…l-responses # Conflicts: # trl/trainer/grpo_trainer.py
…nses # Conflicts: # docs/source/example_overview.md # trl/trainer/grpo_trainer.py
…pe_ids from input_ids, handle truncation - Return images from _generate to _generate_and_score_completions - Build mm_token_type_ids directly from prompt_completion_ids (detect image_pad tokens) - Place mm_token_type_ids construction after the extension block to avoid double-extension - Use image_processor directly for pixel_values (avoids re-processing mismatch) - Trim pixel_values/image_grid_thw when truncation cuts partial images - Normalize string content in tool messages for VLM processors in _get_tool_suffix_ids - Use content blocks format for dummy messages with VLM processors
…detection - Add _get_vision_token_ids() to detect image tokens dynamically across VLM families - Add _normalize_message_content() to deduplicate string→content blocks normalization - Remove all [VLM DEBUG] print statements - Replace _build_mm_token_type_ids instance attribute with local variable - Remove unused tool_multimodal_fields return value from _get_tool_suffix_ids
…on, remove temp save_strategy
…l into multimodal-tool-responses
…l into multimodal-tool-responses
…l into multimodal-tool-responses
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
I don't think the current failing tests are related @albertvillanova 🤔 |
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
| @@ -1750,7 +1963,11 @@ def _generate_and_score_completions( | |||
| num_items_in_batch, | |||
| sampling_per_token_logps_list, | |||
| extra_fields, | |||
| images, | |||
| tool_images, | |||
| ) = self._generate(prompts) | |||
| if images is None: | |||
| images = dataset_images # restore dataset images (rollout_func path returns None) | |||
There was a problem hiding this comment.
I think only tool_images needs to cross the boundary between _generate and the caller. The caller already has the dataset images. The merge of tool_images into images could happen in _generate_and_score_completions after the unpack. Therefore, you could eliminate the save/restore pattern and the name confusion around what images means at each point.
Something like (partially implemented):
| ( | |
| prompt_ids_list, | |
| completion_ids_list, | |
| tool_mask_list, | |
| completions, | |
| num_items_in_batch, | |
| sampling_per_token_logps_list, | |
| extra_fields, | |
| tool_images, | |
| ) = self._generate(prompts) |
There was a problem hiding this comment.
great suggestion! It'd clear. It'd require moving the tool_images merge logic from _generate to the caller, which touches a few places 😅 I'll address this in a follow-up (i'll create an issue later)
…l into multimodal-tool-responses
| else: | ||
| merged_images = [ | ||
| (existing or []) + new for existing, new in zip(merged_images, tool_images, strict=True) | ||
| ] |
There was a problem hiding this comment.
The image merging logic is duplicated. Maybe better using a helper function _merge_images?
There was a problem hiding this comment.
2 cases would be enough to create a helper? 🤔 I think we can leave it as it is and create the helper afterwards if needed
| # Merge tool response images into the images list for the forward pass | ||
| if any(imgs for imgs in tool_images): | ||
| if images is None: | ||
| images = [imgs if imgs else None for imgs in tool_images] | ||
| else: | ||
| images = [(existing or []) + new for existing, new in zip(images, tool_images, strict=True)] |
There was a problem hiding this comment.
Duplicated image merging logic.
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 4 total unresolved issues (including 3 from previous reviews).
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| # and drop ALL images for any sample with a mismatch (safe fallback). | ||
| image_grid_thw = forward_kwargs.get("image_grid_thw") | ||
| if image_grid_thw is not None and num_images is not None: | ||
| merge_length = getattr(self.processing_class.image_processor, "merge_size", 2) ** 2 |
There was a problem hiding this comment.
New getattr calls violate project simplicity rules
Low Severity
Several new getattr calls are introduced: getattr(self.processing_class.image_processor, "merge_size", 2), getattr(parsing_class, "tokenizer", parsing_class), and getattr(tokenizer_or_processor, "tokenizer", tokenizer_or_processor). The project rules explicitly state to avoid getattr and hasattr as symptoms of overly defensive programming. Since self._is_vlm is already available, the tokenizer extraction could use self.processing_class.tokenizer if self._is_vlm else self.processing_class. The merge_size default of 2 is a silent assumption that could cause incorrect truncation safety checks for non-Qwen VLMs.
Additional Locations (2)
Triggered by project rule: ../.ai/AGENTS.md
| # VLM processors don't have parse_response directly; use the inner tokenizer | ||
| tokenizer = getattr(tokenizer_or_processor, "tokenizer", tokenizer_or_processor) | ||
| try: | ||
| parsed = tokenizer.parse_response(ids) |
There was a problem hiding this comment.
@Rocketknight1 should we expect vlm to have parse_response in the future?
| self._vision_token_ids_cache = cache | ||
| return self._vision_token_ids_cache | ||
|
|
||
| def _truncate_at_image_boundary(self, ids, max_length): |
There was a problem hiding this comment.
I understand the motivation, but based on our past experience (when we experimented with prompt truncation, a feature we ultimately removed), working with image tokens tends to be extremely brittle and difficult to maintain. I would strongly recommend prioritizing truncation of the full tool response instead. I'll open a PR for that, we could discuss there.


What does this PR do?
Enable
environment_factorytool methods to return multimodal content blocks (images + text) for VLM training.Previously, tool responses were always converted to
str(result), discarding any visual information. Now tools canreturn content block lists that include images, and the trainer handles them end-to-end: tokenization, generation,
and training forward pass with proper
pixel_values.Also migrates
browsergym.py(VLM) fromrollout_functoenvironment_factorywith multimodal tool responses.Depends on: #5235 (OpenEnv examples update) and related to huggingface/transformers#44873
Before submitting
Pull Request section?
to it if that's the case.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
Note
Medium Risk
Touches core
GRPOTrainergeneration/tool-calling and multimodal forward-pass plumbing, which could affect VLM training stability and sequence-length edge cases across multiple trainers.Overview
Adds end-to-end support for multimodal tool responses (e.g., image + text content blocks) when using
GRPOTrainerwithenvironment_factory, so tool outputs are no longer stringified and can feed VLMs withpixel_valuesduring tool-call loops.Updates
GRPOTrainerinternals to track/merge tool-returned images, tokenize tool messages correctly for VLM processors, avoid truncating mid-image token spans, and add safermm_token_type_ids/max-length handling for VLM configs; similar image-handling fixes are applied inGFPOTrainer,GRPOWithReplayBufferTrainer, andRLOOTrainer. Also migrates the OpenEnvbrowsergym.pyexample toenvironment_factorywith screenshot-returning tool methods, adds a newcarla_vlm.pyOpenEnv VLM example, and documents the new example inexample_overview.md.Written by Cursor Bugbot for commit cad17f3. This will update automatically on new commits. Configure here.