Add truncation to SFT DataCollatorForLanguageModeling by albertvillanova · Pull Request #5315 · huggingface/trl

albertvillanova · 2026-03-19T15:06:58Z

Add truncation to SFT DataCollatorForLanguageModeling.

This PR adds support for sequence truncation to the DataCollatorForLanguageModeling class, allowing sequences to be truncated to a specified maximum length from either the start or end. It also introduces comprehensive tests to ensure correct behavior for different truncation modes and edge cases.

This PR aligns SFT DataCollatorForLanguageModeling with the existing SFT DataCollatorForVisionLanguageModeling, which already truncates inputs.

Follow-up to:

Support truncation_mode in SFT #5306

Changes

Enhancements to sequence truncation:

Added max_length and truncation_mode parameters to DataCollatorForLanguageModeling, enabling truncation of sequences longer than max_length with options to keep either the start or end tokens.
Implemented per-sequence truncation logic in the torch_call method, applying the specified truncation mode and ensuring associated masks (such as completion_mask and assistant_masks) are truncated consistently. A ValueError is raised for unsupported truncation modes.

Testing improvements:

Added multiple test cases covering:
- Truncation from the start and end.
- Behavior when no truncation is needed.
- Truncation with completion masks.
- Handling of invalid truncation modes.

Note

Medium Risk
Changes how SFT batches are built by introducing truncation in DataCollatorForLanguageModeling, which can alter training inputs/labels and raise new configuration errors (especially around padding_free and dataset prep). Test coverage mitigates but behavior changes may impact existing training setups relying on untruncated batches.

Overview
Adds optional per-example truncation to DataCollatorForLanguageModeling via new max_length and truncation_mode (keep_start/keep_end), ensuring labels, completion_mask, and assistant_masks are truncated consistently and rejecting unsupported modes.

Wires SFTTrainer to pass max_length/truncation_mode into the text collator (disabled when padding_free=True) and introduces a guard that errors when skip_prepare_dataset=True is used with padding_free=True and a non-None max_length.

Extends tests to cover truncation modes, mask interactions, invalid modes, and trainer configuration propagation/validation.

^{Written by Cursor Bugbot for commit 281c815. This will update automatically on new commits. Configure here.}

HuggingFaceDocBuilderDev · 2026-03-19T15:11:10Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

qgallouedec · 2026-03-19T15:27:20Z

Is the next step to remove https://github.com/albertvillanova/trl/blob/c3a81dea136d18c607813211f7156fbc720aecd1/trl/trainer/sft_trainer.py#L1209-L1211?

albertvillanova · 2026-03-19T15:31:34Z

Is the next step to remove...

Yes @qgallouedec, analogously to what I already proposed in the precedent PR (see motivation 3 in #5305 (comment)), my refactoring plan is to remove the truncation after the padding, present in the body of the trainer; and require it to be done by the collator.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

qgallouedec · 2026-03-19T17:28:03Z

+                "When `padding_free=True`, `max_length` must be enforced during dataset preparation or packing, not in "
+                "the collator. Disable `skip_prepare_dataset`, provide already packed/truncated inputs, or set "
+                "`max_length=None`."


Ok so you're saying: if you decide to skip the dataset preparation, then padding-free isn't supported with max_length, because we would need to truncate the seq_lengths column, which is quite annoying to do

If my understanding is correct, then I agree. Fortunately, args.dataset_kwargs.get("skip_prepare_dataset") is True is very rare as far as I know

qgallouedec

Ok, it looks good!

qgallouedec · 2026-03-20T15:20:11Z

+            if self.truncation_mode == "keep_start":
+                sl = slice(None, self.max_length)
+            elif self.truncation_mode == "keep_end":
+                sl = slice(-self.max_length, None)


albertvillanova added 2 commits March 19, 2026 16:00

Add truncation to SFT DataCollatorForLanguageModeling

3537c01

Add regression tests

c3a81de

albertvillanova added 2 commits March 19, 2026 16:24

Pass max_length and truncation_mode to DataCollatorForLanguageModeling

c6b269d

Fix condition

1fe5e2b

Add tests with padding_free

a3034c4

cursor Bot reviewed Mar 19, 2026

View reviewed changes

Comment thread trl/trainer/sft_trainer.py

albertvillanova added 5 commits March 19, 2026 16:51

Revert Fix condition

b0116c4

Revert Add tests with padding_free

d2f4d00

Fix collator for padding_free

67be676

Raise ValueError

1d1235a

Test raise ValueError

77b6a92

qgallouedec reviewed Mar 19, 2026

View reviewed changes

Merge branch 'main' into fu-5306

281c815

qgallouedec approved these changes Mar 20, 2026

View reviewed changes

albertvillanova merged commit d8a2dd5 into huggingface:main Mar 23, 2026
12 checks passed

This was referenced Mar 24, 2026

Remove post-collation truncation from SFT #5359

Merged

Simplify SFT DataCollatorForLanguageModeling #5360

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add truncation to SFT DataCollatorForLanguageModeling#5315

Add truncation to SFT DataCollatorForLanguageModeling#5315
albertvillanova merged 11 commits intohuggingface:mainfrom
albertvillanova:fu-5306

albertvillanova commented Mar 19, 2026 •

edited by cursor Bot

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Mar 19, 2026

Uh oh!

qgallouedec commented Mar 19, 2026

Uh oh!

albertvillanova commented Mar 19, 2026 •

edited

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

qgallouedec Mar 19, 2026

Uh oh!

qgallouedec Mar 19, 2026

Uh oh!

qgallouedec left a comment

Uh oh!

qgallouedec Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

albertvillanova commented Mar 19, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Uh oh!

HuggingFaceDocBuilderDev commented Mar 19, 2026

Uh oh!

qgallouedec commented Mar 19, 2026

Uh oh!

albertvillanova commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

qgallouedec Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

qgallouedec Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

qgallouedec left a comment

Choose a reason for hiding this comment

Uh oh!

qgallouedec Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

albertvillanova commented Mar 19, 2026 •

edited by cursor Bot

Loading

albertvillanova commented Mar 19, 2026 •

edited

Loading