Add truncation to SFT DataCollatorForLanguageModeling#5315
Add truncation to SFT DataCollatorForLanguageModeling#5315albertvillanova merged 11 commits intohuggingface:mainfrom
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Yes @qgallouedec, analogously to what I already proposed in the precedent PR (see motivation 3 in #5305 (comment)), my refactoring plan is to remove the truncation after the padding, present in the body of the trainer; and require it to be done by the collator. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| "When `padding_free=True`, `max_length` must be enforced during dataset preparation or packing, not in " | ||
| "the collator. Disable `skip_prepare_dataset`, provide already packed/truncated inputs, or set " | ||
| "`max_length=None`." |
There was a problem hiding this comment.
Ok so you're saying: if you decide to skip the dataset preparation, then padding-free isn't supported with max_length, because we would need to truncate the seq_lengths column, which is quite annoying to do
There was a problem hiding this comment.
If my understanding is correct, then I agree. Fortunately, args.dataset_kwargs.get("skip_prepare_dataset") is True is very rare as far as I know
| if self.truncation_mode == "keep_start": | ||
| sl = slice(None, self.max_length) | ||
| elif self.truncation_mode == "keep_end": | ||
| sl = slice(-self.max_length, None) |

Add truncation to SFT
DataCollatorForLanguageModeling.This PR adds support for sequence truncation to the
DataCollatorForLanguageModelingclass, allowing sequences to be truncated to a specified maximum length from either the start or end. It also introduces comprehensive tests to ensure correct behavior for different truncation modes and edge cases.This PR aligns SFT
DataCollatorForLanguageModelingwith the existing SFTDataCollatorForVisionLanguageModeling, which already truncates inputs.Follow-up to:
Changes
Enhancements to sequence truncation:
max_lengthandtruncation_modeparameters toDataCollatorForLanguageModeling, enabling truncation of sequences longer thanmax_lengthwith options to keep either the start or end tokens.torch_callmethod, applying the specified truncation mode and ensuring associated masks (such ascompletion_maskandassistant_masks) are truncated consistently. AValueErroris raised for unsupported truncation modes.Testing improvements:
Note
Medium Risk
Changes how SFT batches are built by introducing truncation in
DataCollatorForLanguageModeling, which can alter training inputs/labels and raise new configuration errors (especially aroundpadding_freeand dataset prep). Test coverage mitigates but behavior changes may impact existing training setups relying on untruncated batches.Overview
Adds optional per-example truncation to
DataCollatorForLanguageModelingvia newmax_lengthandtruncation_mode(keep_start/keep_end), ensuringlabels,completion_mask, andassistant_masksare truncated consistently and rejecting unsupported modes.Wires
SFTTrainerto passmax_length/truncation_modeinto the text collator (disabled whenpadding_free=True) and introduces a guard that errors whenskip_prepare_dataset=Trueis used withpadding_free=Trueand a non-Nonemax_length.Extends tests to cover truncation modes, mask interactions, invalid modes, and trainer configuration propagation/validation.
Written by Cursor Bugbot for commit 281c815. This will update automatically on new commits. Configure here.