Skip to content

Add arrow reader and generator#14

Open
wolfgang-desalvador wants to merge 1 commit intomainfrom
wdesalvador/integrate-arrow-reader
Open

Add arrow reader and generator#14
wolfgang-desalvador wants to merge 1 commit intomainfrom
wdesalvador/integrate-arrow-reader

Conversation

@wolfgang-desalvador
Copy link
Copy Markdown

This pull request adds support for the Arrow IPC data format to the DLIO benchmark, enabling both data generation and reading using efficient, zero-copy memory-mapped Arrow files. The main changes include the implementation of an Arrow data generator and reader, updates to the configuration system to support Arrow-specific options, and integration with the generator and reader factories.

Arrow IPC format support:

  • Implemented ArrowGenerator in arrow_generator.py, supporting both legacy and schema-driven multi-column Arrow file generation, with efficient, batched, reproducible random data creation and true zero-copy output.
  • Added ArrowReader in arrow_reader.py, which uses memory-mapped Arrow IPC files for zero-copy, page-cached data access, including efficient sample lookup and page faulting to ensure data is loaded.

Configuration and factory integration:

  • Extended ConfigArguments and LoadConfig in config.py to support Arrow-specific options (arrow_columns, arrow_generation_batch_size) and to parse them from the config file. [1] [2]
  • Updated generator_factory.py and reader_factory.py to instantiate the Arrow generator and reader when the Arrow format is requested. [1] [2]

@wolfgang-desalvador
Copy link
Copy Markdown
Author

@russfellows if this will be accepted in a wider discussion, we will need object readers

@FileSystemGuy FYI

@wolfgang-desalvador
Copy link
Copy Markdown
Author

Potentially, the Parquet reader and Arrow reader can be unified on many functions with an Abstract Class. This is an enhancement to be implemented if accepted

@russfellows
Copy link
Copy Markdown

russfellows commented Apr 14, 2026 via email

@wolfgang-desalvador
Copy link
Copy Markdown
Author

wolfgang-desalvador commented Apr 14, 2026

Wolfgang, Excellent. A lot of improvements certainly need to be made to the parquet readers, and readers in general. I agree that PyArrow could be included in with Parquet, since that is the in memory representation. I am hoping you could take a look at the reader I updated, to see if your enhancements can be added on top of my modifications, or perhaps replace / supplant my modifications. My only goal is to make the readers as efficient as possible. I have a PR that is in process, but is stuck behind hundreds of pointless CI checks that are built-in to DLIO’s code. Here is the link: #12 Can you take a look at the code changes in my PR and see if they are complimentary, or can work together? Again, if your changes are different and better, that is fine too. Regards, —Russ

On Apr 14, 2026, at 5:33 AM, Wolfgang De Salvador @.***> wrote: wolfgang-desalvador left a comment (mlcommons/DLIO_local_changes#14) <#14 (comment)> @russfellows https://github.com/russfellows if this will be accepted in a wider discussion, we will need object readers @FileSystemGuy https://github.com/FileSystemGuy FYI — Reply to this email directly, view it on GitHub <#14 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF64UJ6PWVVVT2VHEXOWNET4VYOWZAVCNFSM6AAAAACXYX3VOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DENBTGUZTQNZYGY. You are receiving this because you were mentioned.

Thank you Russ. I looked at your PR, I think these changes are complementary in case we want to go in the Apache IPC format, that removes any deserialization CPU time from the storage I/O. Let me know if you want me to integrate this anyhow in your branch

I think that in case we decide to include this, we just need an S3 arrow reader.
But there is the broader discussion if we want to stick to Parquet as on-disk format or to go Arrow IPC, while both in-memory looks like Arrow objects. What changes is the deserialization burden

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants