When pyiceberg loads Iceberg tables containing large JSON data into memory, memory usage explodes in pyarrow

### Apache Iceberg version

0.11.0 (latest release)

### Please describe the bug 🐞

We have a table with a column containing large JSON strings (basically think of large `pydantic` validation results). The resulting datafile in Iceberg itself is a roughly 6MB (GZip compressed) Parquet file, but when querying it, the memory consumption goes to ~4GB. (Just for a few seconds, but long enough to cause out-of-memory issues on some systems.)

The issue is that by default `pyarrow` loads the strings per row into memory, which blows up the memory. If we download the datafile and open it directly via `pyarrow` this behaviour can be reproduced. 

There are 2 workarounds in `pyarrow`: 1. Don't load the problematic column (given that is possible in your use case) and 2. switch to dictionary-encoding for set column (example snippet below).

```py
from pyarrow.parquet import read_table
table = read_table("datafile.parquet", read_dictionary=["problematic_column"])
```

Issue in `pyiceberg`:
Regardless if you use `table.scan(...).to_arrow()` or `table.scan(...).to_arrow_batch_reader()`, `pyiceberg` has afaik currently no option to specify the dictionary encoding for certain tables, hence `pyarrow` uses the default encoding and the memory usage explodes.

The `to_arrow_batch_reader` does not help here either, because  -as per my understanding- in the batch reader of `pyiceberg` each batch represents an individual datafile. Hence, if there is one problematic 6MB datafile, it makes no difference if you use the batch reader or not. I also have the impression that when you iterate over the reader, `pyarrow` has already loaded the parquet file in a separate thread and this is where the memory explosion actually happens.

So the current only workaround in `pyiceberg` is option 1: Don't load the problematic column by specifying the `selected_fields`:

```py
from pyiceberg.catalog import load_catalog

catalog = load_catalog("default")
table = catalog.load_table("your_table")
reader = table.scan(selected_fields=("all", "other", "columns")).to_arrow_batch_reader()
for batch in reader:
...
```

Expected behaviour:
There should be an option somewhere, e.g. in the data_scan to specify for which columns  dictionary encoding should be used. This option should be forwarded to `pyarrow` internally somehow, so that `pyarrow` uses less memory.

Remark:
I would not change the default behaviour. It would be just good to have the option to configure the encoding in pyarrow when needed.

This issue is a follow up for https://github.com/apache/iceberg-python/issues/1205

### Willingness to contribute

- [ ] I can contribute a fix for this bug independently
- [ ] I would be willing to contribute a fix for this bug with guidance from the Iceberg community
- [x] I cannot contribute a fix for this bug at this time

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When pyiceberg loads Iceberg tables containing large JSON data into memory, memory usage explodes in pyarrow #3168

Apache Iceberg version

Please describe the bug 🐞

Willingness to contribute

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

When pyiceberg loads Iceberg tables containing large JSON data into memory, memory usage explodes in pyarrow #3168

Description

Apache Iceberg version

Please describe the bug 🐞

Willingness to contribute

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions