Skip to content

Add NestedFrame.split() for splitting nested columns by categorical value#473

Merged
dougbrn merged 2 commits intolincc-frameworks:mainfrom
Ebraam-Ashraf:feat/add-nested-frame-split
Apr 9, 2026
Merged

Add NestedFrame.split() for splitting nested columns by categorical value#473
dougbrn merged 2 commits intolincc-frameworks:mainfrom
Ebraam-Ashraf:feat/add-nested-frame-split

Conversation

@Ebraam-Ashraf
Copy link
Copy Markdown
Contributor

Change Description

Closes #470

Solution Description

Adds NestedFrame.split() as a NestedFrame method rather than NestedSeries since NestedFrame already handles nested column management and leads to a cleaner workflow.

nf = nf.split("nested", by="band") 

Parameters

  • nested_col: the nested column to split
  • by: the sub-column to split on
  • values: controls which values to split on:
    • None (default): uses all unique values found in the column
    • list e.g. values=['r', 'g'] splits on a specific subset only
    • str e.g. values='rg': iterated as characters, same as ['r', 'g']
    • empty list [] or empty string '': no new columns created
    • values not present in the data: column is created but all-NaN (query matches nothing, no error raised)
  • drop_by_col: if True, removes the splitting sub-column from each new nested column
  • drop_nested: if True, removes the original nested column from the result

Usage (to be documented)

nf = generate_data(5, 5, seed=1)

# basic split
nf = nf.split("nested", by="band")  # creates nested_r, nested_g

# split on a subset of values only
nf.split("nested", by="band", values=["r"])

# drop the splitting sub-column from each result
nf.split("nested", by="band", drop_by_col=True)

# drop the original nested column after splitting
nf.split("nested", by="band", drop_nested=True)

Tests

Three test functions added to test_nestedframe.py:

  • test_split: covers all behavior: basic split, filtering correctness, all values= cases (None, list subset, string as chars, empty list, empty string, missing/mixed-type values), drop_by_col=True, drop_nested=True, both drop options together, and immutability of the original frame
  • test_split_errors: covers ValueError for invalid nested_col and invalid by sub-column
  • test_split_empty_frame: ensures an empty NestedFrame is handled correctly and returns an empty NestedFrame

Code Quality

  • I have read the Contribution Guide and agree to the Code of Conduct
  • My code follows the code style of this project
  • My code builds (or compiles) cleanly without any errors or warnings
  • My code contains relevant comments and necessary documentation

@Ebraam-Ashraf
Copy link
Copy Markdown
Contributor Author

@delucchi-cmu @dougbrn

@delucchi-cmu
Copy link
Copy Markdown
Contributor

Is this PR intended as a submission for Google Summer of Code? If so, see relevant notes from our guidelines:

Title your pull request "DONT-MERGE GSOC26: <title>", make it "draft".

I also want to make you aware that we will begin to review GSOC pull requests the week of March 23, 2026.

@hombit
Copy link
Copy Markdown
Collaborator

hombit commented Mar 19, 2026

@delucchi-cmu, @Ebraam-Ashraf has already submitted the GSoC26-related PR (astronomy-commons/hats#648), which was approved. Since we have a single-PR limit for GSoC26, we should consider this PR as a normal open-source contribution.

Copy link
Copy Markdown
Collaborator

@dougbrn dougbrn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Ebraam-Ashraf , this is a really nice first implementation, and I appreciate the work spent on the docstring and the unit test suite. I have one primary suggestion which I split into two comments.

Comment thread src/nested_pandas/nestedframe/core.py Outdated
Comment thread src/nested_pandas/nestedframe/core.py Outdated
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 19, 2026

Before [509562f] After [a9e2209] Ratio Benchmark (Parameter)
669±40ms 1.03±0.3s 1.53 benchmarks.ReadFewColumnsS3.time_run
1.08±0.2s 1.11±0.2s 1.03 benchmarks.ReadFewColumnsHTTPS.time_run
1.2G 1.23G 1.02 benchmarks.ReadFewColumnsS3.peakmem_run
28.6±1ms 28.9±2ms 1.01 benchmarks.AssignSingleDfToNestedSeries.time_run
65.7±0.4ms 66.7±0.3ms 1.01 benchmarks.CountNestedBy.time_run
263M 266M 1.01 benchmarks.ReassignHalfOfNestedSeries.peakmem_run
256M 255M 1 benchmarks.AssignSingleDfToNestedSeries.peakmem_run
137M 136M 0.99 benchmarks.CountNestedBy.peakmem_run
109M 107M 0.99 benchmarks.NestedFrameReduce.peakmem_run
105M 103M 0.98 benchmarks.NestedFrameAddNested.peakmem_run

Click here to view all benchmarks.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 19, 2026

Codecov Report

❌ Patch coverage is 90.32258% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 95.96%. Comparing base (509562f) to head (beacd79).
⚠️ Report is 10 commits behind head on main.

Files with missing lines Patch % Lines
src/nested_pandas/nestedframe/core.py 90.32% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #473      +/-   ##
==========================================
- Coverage   96.03%   95.96%   -0.08%     
==========================================
  Files          20       20              
  Lines        2247     2278      +31     
==========================================
+ Hits         2158     2186      +28     
- Misses         89       92       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Ebraam-Ashraf Ebraam-Ashraf requested a review from dougbrn March 25, 2026 14:12
@Ebraam-Ashraf
Copy link
Copy Markdown
Contributor Author

Updated split() to handle empty frames and ensure consistent columns when values is provided, including empty (None) outputs
added tests to cover these cases.

@dougbrn

@Ebraam-Ashraf
Copy link
Copy Markdown
Contributor Author

hi @delucchi-cmu
sorry for the delay.
this PR is not for GSoC I was interested in the feature and wanted to contribute.

thanks @hombit for clarification <3

Copy link
Copy Markdown
Collaborator

@dougbrn dougbrn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this looks good!

@dougbrn dougbrn merged commit e699e7b into lincc-frameworks:main Apr 9, 2026
11 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Helper Function to split a Nested Column by a categorical column value

4 participants