Add OBFD Packing to TRL/Torch Titan

In [TRL](https://github.com/OpenEuroLLM/post-training), sequence packing currently relies on basic concatenation or simple "Best-Fit Bin Packing." These methods often lead to sub-optimal GPU utilization and complex attention masking requirements.

[OBFD](https://arxiv.org/pdf/2404.10830) (Overlapped Block-wise Fixed-size Dictionary) packing, as implemented in [OLMo-core](https://github.com/allenai/OLMo-core), offers a more deterministic and efficient way to pack variable-length sequences into fixed-size blocks. It would offer us-

- It minimizes padding tokens across distributed ranks, ensuring every training block is "full."
- It improves TFLOPS and training throughput by maximizing compute density.
- OLMo-3 utilizes this strategy to achieve significantly faster SFT speeds.

_Implementation caveats:_

_a) TRL: The target is to pre-pack and save the data as the SFT-Trainer accepts using the pre-tokenized data. But if we want to implement such packing for DPO-Trainer it would be difficult without creating a fork from TRL._ 

_b) TorchTitan- We can do it in a very similar way as pre-packing or also allow an option to dynamically do it within the SFT data loading script. In any case, this also is dependent upon implementing SFT on [TorchTitan](https://github.com/OpenEuroLLM/titan-oellm)._



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add OBFD Packing to TRL/Torch Titan #166

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add OBFD Packing to TRL/Torch Titan #166

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions