Conversation
|
@russfellows if this will be accepted in a wider discussion, we will need object readers @FileSystemGuy FYI |
|
Potentially, the Parquet reader and Arrow reader can be unified on many functions with an Abstract Class. This is an enhancement to be implemented if accepted |
|
Wolfgang,
Excellent. A lot of improvements certainly need to be made to the parquet readers, and readers in general.
I agree that PyArrow could be included in with Parquet, since that is the in memory representation.
I am hoping you could take a look at the reader I updated, to see if your enhancements can be added on top of my modifications, or perhaps replace / supplant my modifications. My only goal is to make the readers as efficient as possible.
I have a PR that is in process, but is stuck behind hundreds of pointless CI checks that are built-in to DLIO’s code. Here is the link: #12
Can you take a look at the code changes in my PR and see if they are complimentary, or can work together? Again, if your changes are different and better, that is fine too.
Regards,
—Russ
… On Apr 14, 2026, at 5:33 AM, Wolfgang De Salvador ***@***.***> wrote:
wolfgang-desalvador
left a comment
(mlcommons/DLIO_local_changes#14)
<#14 (comment)>
@russfellows <https://github.com/russfellows> if this will be accepted in a wider discussion, we will need object readers
@FileSystemGuy <https://github.com/FileSystemGuy> FYI
—
Reply to this email directly, view it on GitHub <#14 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF64UJ6PWVVVT2VHEXOWNET4VYOWZAVCNFSM6AAAAACXYX3VOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DENBTGUZTQNZYGY>.
You are receiving this because you were mentioned.
|
Thank you Russ. I looked at your PR, I think these changes are complementary in case we want to go in the Apache IPC format, that removes any deserialization CPU time from the storage I/O. Let me know if you want me to integrate this anyhow in your branch I think that in case we decide to include this, we just need an S3 arrow reader. |
This pull request adds support for the Arrow IPC data format to the DLIO benchmark, enabling both data generation and reading using efficient, zero-copy memory-mapped Arrow files. The main changes include the implementation of an Arrow data generator and reader, updates to the configuration system to support Arrow-specific options, and integration with the generator and reader factories.
Arrow IPC format support:
ArrowGeneratorinarrow_generator.py, supporting both legacy and schema-driven multi-column Arrow file generation, with efficient, batched, reproducible random data creation and true zero-copy output.ArrowReaderinarrow_reader.py, which uses memory-mapped Arrow IPC files for zero-copy, page-cached data access, including efficient sample lookup and page faulting to ensure data is loaded.Configuration and factory integration:
ConfigArgumentsandLoadConfiginconfig.pyto support Arrow-specific options (arrow_columns,arrow_generation_batch_size) and to parse them from the config file. [1] [2]generator_factory.pyandreader_factory.pyto instantiate the Arrow generator and reader when the Arrow format is requested. [1] [2]