Skip to content

Add truncation to SFT DataCollatorForLanguageModeling#5315

Merged
albertvillanova merged 11 commits intohuggingface:mainfrom
albertvillanova:fu-5306
Mar 23, 2026
Merged

Add truncation to SFT DataCollatorForLanguageModeling#5315
albertvillanova merged 11 commits intohuggingface:mainfrom
albertvillanova:fu-5306

Conversation

@albertvillanova
Copy link
Copy Markdown
Member

@albertvillanova albertvillanova commented Mar 19, 2026

Add truncation to SFT DataCollatorForLanguageModeling.

This PR adds support for sequence truncation to the DataCollatorForLanguageModeling class, allowing sequences to be truncated to a specified maximum length from either the start or end. It also introduces comprehensive tests to ensure correct behavior for different truncation modes and edge cases.

This PR aligns SFT DataCollatorForLanguageModeling with the existing SFT DataCollatorForVisionLanguageModeling, which already truncates inputs.

Follow-up to:

Changes

Enhancements to sequence truncation:

  • Added max_length and truncation_mode parameters to DataCollatorForLanguageModeling, enabling truncation of sequences longer than max_length with options to keep either the start or end tokens.
  • Implemented per-sequence truncation logic in the torch_call method, applying the specified truncation mode and ensuring associated masks (such as completion_mask and assistant_masks) are truncated consistently. A ValueError is raised for unsupported truncation modes.

Testing improvements:

  • Added multiple test cases covering:
    • Truncation from the start and end.
    • Behavior when no truncation is needed.
    • Truncation with completion masks.
    • Handling of invalid truncation modes.

Note

Medium Risk
Changes how SFT batches are built by introducing truncation in DataCollatorForLanguageModeling, which can alter training inputs/labels and raise new configuration errors (especially around padding_free and dataset prep). Test coverage mitigates but behavior changes may impact existing training setups relying on untruncated batches.

Overview
Adds optional per-example truncation to DataCollatorForLanguageModeling via new max_length and truncation_mode (keep_start/keep_end), ensuring labels, completion_mask, and assistant_masks are truncated consistently and rejecting unsupported modes.

Wires SFTTrainer to pass max_length/truncation_mode into the text collator (disabled when padding_free=True) and introduces a guard that errors when skip_prepare_dataset=True is used with padding_free=True and a non-None max_length.

Extends tests to cover truncation modes, mask interactions, invalid modes, and trainer configuration propagation/validation.

Written by Cursor Bugbot for commit 281c815. This will update automatically on new commits. Configure here.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@qgallouedec
Copy link
Copy Markdown
Member

@albertvillanova
Copy link
Copy Markdown
Member Author

albertvillanova commented Mar 19, 2026

Is the next step to remove...

Yes @qgallouedec, analogously to what I already proposed in the precedent PR (see motivation 3 in #5305 (comment)), my refactoring plan is to remove the truncation after the padding, present in the body of the trainer; and require it to be done by the collator.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Comment thread trl/trainer/sft_trainer.py
Comment on lines +925 to +927
"When `padding_free=True`, `max_length` must be enforced during dataset preparation or packing, not in "
"the collator. Disable `skip_prepare_dataset`, provide already packed/truncated inputs, or set "
"`max_length=None`."
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so you're saying: if you decide to skip the dataset preparation, then padding-free isn't supported with max_length, because we would need to truncate the seq_lengths column, which is quite annoying to do

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If my understanding is correct, then I agree. Fortunately, args.dataset_kwargs.get("skip_prepare_dataset") is True is very rare as far as I know

Copy link
Copy Markdown
Member

@qgallouedec qgallouedec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it looks good!

Comment on lines +187 to +190
if self.truncation_mode == "keep_start":
sl = slice(None, self.max_length)
elif self.truncation_mode == "keep_end":
sl = slice(-self.max_length, None)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@albertvillanova albertvillanova merged commit d8a2dd5 into huggingface:main Mar 23, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants