Skip to content

When pyiceberg loads Iceberg tables containing large JSON data into memory, memory usage explodes in pyarrow #3168

@thomas-pfeiffer

Description

@thomas-pfeiffer

Apache Iceberg version

0.11.0 (latest release)

Please describe the bug 🐞

We have a table with a column containing large JSON strings (basically think of large pydantic validation results). The resulting datafile in Iceberg itself is a roughly 6MB (GZip compressed) Parquet file, but when querying it, the memory consumption goes to ~4GB. (Just for a few seconds, but long enough to cause out-of-memory issues on some systems.)

The issue is that by default pyarrow loads the strings per row into memory, which blows up the memory. If we download the datafile and open it directly via pyarrow this behaviour can be reproduced.

There are 2 workarounds in pyarrow: 1. Don't load the problematic column (given that is possible in your use case) and 2. switch to dictionary-encoding for set column (example snippet below).

from pyarrow.parquet import read_table
table = read_table("datafile.parquet", read_dictionary=["problematic_column"])

Issue in pyiceberg:
Regardless if you use table.scan(...).to_arrow() or table.scan(...).to_arrow_batch_reader(), pyiceberg has afaik currently no option to specify the dictionary encoding for certain tables, hence pyarrow uses the default encoding and the memory usage explodes.

The to_arrow_batch_reader does not help here either, because -as per my understanding- in the batch reader of pyiceberg each batch represents an individual datafile. Hence, if there is one problematic 6MB datafile, it makes no difference if you use the batch reader or not. I also have the impression that when you iterate over the reader, pyarrow has already loaded the parquet file in a separate thread and this is where the memory explosion actually happens.

So the current only workaround in pyiceberg is option 1: Don't load the problematic column by specifying the selected_fields:

from pyiceberg.catalog import load_catalog

catalog = load_catalog("default")
table = catalog.load_table("your_table")
reader = table.scan(selected_fields=("all", "other", "columns")).to_arrow_batch_reader()
for batch in reader:
...

Expected behaviour:
There should be an option somewhere, e.g. in the data_scan to specify for which columns dictionary encoding should be used. This option should be forwarded to pyarrow internally somehow, so that pyarrow uses less memory.

Remark:
I would not change the default behaviour. It would be just good to have the option to configure the encoding in pyarrow when needed.

This issue is a follow up for #1205

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions