-
Notifications
You must be signed in to change notification settings - Fork 0
Description
In TRL, sequence packing currently relies on basic concatenation or simple "Best-Fit Bin Packing." These methods often lead to sub-optimal GPU utilization and complex attention masking requirements.
OBFD (Overlapped Block-wise Fixed-size Dictionary) packing, as implemented in OLMo-core, offers a more deterministic and efficient way to pack variable-length sequences into fixed-size blocks. It would offer us-
- It minimizes padding tokens across distributed ranks, ensuring every training block is "full."
- It improves TFLOPS and training throughput by maximizing compute density.
- OLMo-3 utilizes this strategy to achieve significantly faster SFT speeds.
Implementation caveats:
a) TRL: The target is to pre-pack and save the data as the SFT-Trainer accepts using the pre-tokenized data. But if we want to implement such packing for DPO-Trainer it would be difficult without creating a fork from TRL.
b) TorchTitan- We can do it in a very similar way as pre-packing or also allow an option to dynamically do it within the SFT data loading script. In any case, this also is dependent upon implementing SFT on TorchTitan.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status