Add arrow reader and generator by wolfgang-desalvador · Pull Request #14 · mlcommons/DLIO_local_changes

wolfgang-desalvador · 2026-04-14T11:31:22Z

This pull request adds support for the Arrow IPC data format to the DLIO benchmark, enabling both data generation and reading using efficient, zero-copy memory-mapped Arrow files. The main changes include the implementation of an Arrow data generator and reader, updates to the configuration system to support Arrow-specific options, and integration with the generator and reader factories.

Arrow IPC format support:

Implemented ArrowGenerator in arrow_generator.py, supporting both legacy and schema-driven multi-column Arrow file generation, with efficient, batched, reproducible random data creation and true zero-copy output.
Added ArrowReader in arrow_reader.py, which uses memory-mapped Arrow IPC files for zero-copy, page-cached data access, including efficient sample lookup and page faulting to ensure data is loaded.

Configuration and factory integration:

Extended ConfigArguments and LoadConfig in config.py to support Arrow-specific options (arrow_columns, arrow_generation_batch_size) and to parse them from the config file. [1] [2]
Updated generator_factory.py and reader_factory.py to instantiate the Arrow generator and reader when the Arrow format is requested. [1] [2]

wolfgang-desalvador · 2026-04-14T11:32:37Z

@russfellows if this will be accepted in a wider discussion, we will need object readers

@FileSystemGuy FYI

wolfgang-desalvador · 2026-04-14T11:48:27Z

Potentially, the Parquet reader and Arrow reader can be unified on many functions with an Abstract Class. This is an enhancement to be implemented if accepted

russfellows · 2026-04-14T17:48:40Z

Wolfgang, Excellent. A lot of improvements certainly need to be made to the parquet readers, and readers in general. I agree that PyArrow could be included in with Parquet, since that is the in memory representation. I am hoping you could take a look at the reader I updated, to see if your enhancements can be added on top of my modifications, or perhaps replace / supplant my modifications. My only goal is to make the readers as efficient as possible. I have a PR that is in process, but is stuck behind hundreds of pointless CI checks that are built-in to DLIO’s code. Here is the link: #12 Can you take a look at the code changes in my PR and see if they are complimentary, or can work together? Again, if your changes are different and better, that is fine too. Regards, —Russ

…

On Apr 14, 2026, at 5:33 AM, Wolfgang De Salvador ***@***.***> wrote: wolfgang-desalvador left a comment (mlcommons/DLIO_local_changes#14) <#14 (comment)> @russfellows <https://github.com/russfellows> if this will be accepted in a wider discussion, we will need object readers @FileSystemGuy <https://github.com/FileSystemGuy> FYI — Reply to this email directly, view it on GitHub <#14 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF64UJ6PWVVVT2VHEXOWNET4VYOWZAVCNFSM6AAAAACXYX3VOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DENBTGUZTQNZYGY>. You are receiving this because you were mentioned.

wolfgang-desalvador · 2026-04-14T21:37:16Z

Wolfgang, Excellent. A lot of improvements certainly need to be made to the parquet readers, and readers in general. I agree that PyArrow could be included in with Parquet, since that is the in memory representation. I am hoping you could take a look at the reader I updated, to see if your enhancements can be added on top of my modifications, or perhaps replace / supplant my modifications. My only goal is to make the readers as efficient as possible. I have a PR that is in process, but is stuck behind hundreds of pointless CI checks that are built-in to DLIO’s code. Here is the link: #12 Can you take a look at the code changes in my PR and see if they are complimentary, or can work together? Again, if your changes are different and better, that is fine too. Regards, —Russ
…
On Apr 14, 2026, at 5:33 AM, Wolfgang De Salvador @.***> wrote: wolfgang-desalvador left a comment (mlcommons/DLIO_local_changes#14) <#14 (comment)> @russfellows https://github.com/russfellows if this will be accepted in a wider discussion, we will need object readers @FileSystemGuy https://github.com/FileSystemGuy FYI — Reply to this email directly, view it on GitHub <#14 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF64UJ6PWVVVT2VHEXOWNET4VYOWZAVCNFSM6AAAAACXYX3VOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DENBTGUZTQNZYGY. You are receiving this because you were mentioned.

Thank you Russ. I looked at your PR, I think these changes are complementary in case we want to go in the Apache IPC format, that removes any deserialization CPU time from the storage I/O. Let me know if you want me to integrate this anyhow in your branch

I think that in case we decide to include this, we just need an S3 arrow reader.
But there is the broader discussion if we want to stick to Parquet as on-disk format or to go Arrow IPC, while both in-memory looks like Arrow objects. What changes is the deserialization burden

Add arrow reader and generator

47d0df2

wolfgang-desalvador requested a review from a team April 14, 2026 11:31

wolfgang-desalvador added the do-not-merge label Apr 14, 2026

This was referenced Apr 14, 2026

mlpstorage training run --model=flux has considerably lower I/O Throughput causing train_au_meet_expectation to fail mlcommons/storage#330

Open

[Proposal] Evaluate Arrow IPC format for reader mlcommons/storage#333

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add arrow reader and generator#14

Add arrow reader and generator#14
wolfgang-desalvador wants to merge 1 commit intomainfrom
wdesalvador/integrate-arrow-reader

wolfgang-desalvador commented Apr 14, 2026

Uh oh!

wolfgang-desalvador commented Apr 14, 2026

Uh oh!

wolfgang-desalvador commented Apr 14, 2026

Uh oh!

russfellows commented Apr 14, 2026 via email

Uh oh!

wolfgang-desalvador commented Apr 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wolfgang-desalvador commented Apr 14, 2026

Uh oh!

wolfgang-desalvador commented Apr 14, 2026

Uh oh!

wolfgang-desalvador commented Apr 14, 2026

Uh oh!

russfellows commented Apr 14, 2026 via email

Uh oh!

wolfgang-desalvador commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wolfgang-desalvador commented Apr 14, 2026 •

edited

Loading