Skip to content

Chunking and rechunking functionality for large datasets#475

Closed
hombit wants to merge 5 commits intomainfrom
rechunking
Closed

Chunking and rechunking functionality for large datasets#475
hombit wants to merge 5 commits intomainfrom
rechunking

Conversation

@hombit
Copy link
Copy Markdown
Collaborator

@hombit hombit commented Mar 24, 2026

This is an alternative approach to support large nested arrays, >2**31 nested values; see #462 for another approach. It is aimed at two things: 1) prevent offsets overflow, and 2) prevent small chunks appearance and memory borrowing with re-chunking. In contrast to #462, this is not a breaking change, but it applies a lot of trade-offs and tuning hyperparameters, which are not validated for the best performance.

Closes #95

hombit added 5 commits March 6, 2026 16:39
- accessor.py: use list_lengths directly instead of np.diff(list_offsets)
- ext_array.py: remove __getstate__ (default pickle now preserves chunks)
- packer.py: view_sorted_series_as_list_array now produces properly chunked
  output using compute_chunk_boundaries instead of one giant ListArray;
  calculate_sorted_index_offsets returns int64 to avoid overflow; boundaries
  computed once in view_sorted_df_as_list_arrays and shared across columns;
  view_sorted_series_as_list_array uses keyword-only args with * separator
- packer.py: pack_lists passes explicit struct_type to pa.chunked_array so
  empty DataFrames (0-row) no longer raise ArrowInvalid
@github-actions
Copy link
Copy Markdown

Pandas Nightly Test Results (Python 3.11)

486 tests  +27   469 ✅ +27   22s ⏱️ +5s
  1 suites ± 0     0 💤 ± 0 
  1 files   ± 0    17 ❌ ± 0 

For more details on these failures, see this check.

Results for commit c8a284a. ± Comparison against base commit 509562f.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 24, 2026

Codecov Report

❌ Patch coverage is 96.96970% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 95.99%. Comparing base (32a90df) to head (c8a284a).
⚠️ Report is 23 commits behind head on main.

Files with missing lines Patch % Lines
src/nested_pandas/series/packer.py 89.28% 3 Missing ⚠️
src/nested_pandas/series/ext_array.py 98.59% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #475      +/-   ##
==========================================
- Coverage   97.30%   95.99%   -1.32%     
==========================================
  Files          19       20       +1     
  Lines        2156     2347     +191     
==========================================
+ Hits         2098     2253     +155     
- Misses         58       94      +36     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link
Copy Markdown

Before [509562f] After [db44e38] Ratio Benchmark (Parameter)
10.6±0.1ms 10.9±0.05ms 1.03 benchmarks.NestedFrameQuery.time_run
103M 105M 1.02 benchmarks.NestedFrameAddNested.peakmem_run
108M 111M 1.02 benchmarks.NestedFrameQuery.peakmem_run
107M 109M 1.02 benchmarks.NestedFrameReduce.peakmem_run
256M 258M 1.01 benchmarks.AssignSingleDfToNestedSeries.peakmem_run
136M 138M 1.01 benchmarks.CountNestedBy.peakmem_run
10.9±0.3ms 11.1±0.2ms 1.01 benchmarks.NestedFrameAddNested.time_run
1.21G 1.22G 1.01 benchmarks.ReadFewColumnsS3.peakmem_run
1.25±0.01ms 1.23±0ms 0.99 benchmarks.NestedFrameReduce.time_run
66.6±0.8ms 65.5±0.5ms 0.98 benchmarks.CountNestedBy.time_run

Click here to view all benchmarks.

@hombit hombit requested a review from dougbrn March 24, 2026 20:48
@hombit
Copy link
Copy Markdown
Collaborator Author

hombit commented Apr 7, 2026

We chose approach implemented in #462

@hombit hombit closed this Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Handle series with more than 2^31 "flat" elements

1 participant