Skip to content

Add NestedPandasDataset#868

Open
mtauraso wants to merge 10 commits intomainfrom
codex/create-nested-pandas-dataset-class-for-hyrax
Open

Add NestedPandasDataset#868
mtauraso wants to merge 10 commits intomainfrom
codex/create-nested-pandas-dataset-class-for-hyrax

Conversation

@mtauraso
Copy link
Copy Markdown
Collaborator

@mtauraso mtauraso commented Apr 9, 2026

Motivation

  • Provide a lightweight Hyrax dataset that can read nested Parquet files produced by nested-pandas and expose nested columns via dot-notation for downstream inference and examples.

Description

  • Add NestedPandasDataset which wraps nested_pandas.read_parquet, supports data_location, read_kwargs, max_samples, and dynamically registers get_<field> accessors for nested and top-level columns.
  • Export NestedPandasDataset from hyrax.datasets and add a default config block in hyrax_default_config.toml for read_kwargs and max_samples.
  • Add nested-pandas to pyproject.toml dependencies and include an example notebook docs/notebooks/nested_pandas_time_series_with_hyrax.ipynb demonstrating usage with nested light-curve data.
  • Add unit tests tests/hyrax/test_nested_pandas_dataset.py covering reading, nested field getters, read_kwargs, max_samples, and required data_location behavior.

Testing

  • The PR adds pytest unit tests in tests/hyrax/test_nested_pandas_dataset.py which create a NestedFrame, write a parquet file, instantiate NestedPandasDataset, and assert expected getters and length.
  • No automated test run was executed as part of this rollout.

Codex Task

Copilot AI review requested due to automatic review settings April 9, 2026 19:00
@review-notebook-app
Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 9, 2026

Codecov Report

❌ Patch coverage is 94.73684% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.71%. Comparing base (a6f1a43) to head (38a6a34).

Files with missing lines Patch % Lines
src/hyrax/datasets/nested_pandas_dataset.py 94.59% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #868      +/-   ##
==========================================
+ Coverage   66.55%   66.71%   +0.16%     
==========================================
  Files          62       63       +1     
  Lines        6515     6553      +38     
==========================================
+ Hits         4336     4372      +36     
- Misses       2179     2181       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class support for reading nested-pandas Parquet datasets in Hyrax, including configuration, tests, and an example notebook.

Changes:

  • Introduces NestedPandasDataset, a HyraxDataset wrapper around nested_pandas.read_parquet with dynamic get_<field> accessors (including nested dot-notation fields).
  • Adds default config entries for NestedPandasDataset (read_kwargs, max_samples) and exports the dataset from hyrax.datasets.
  • Adds nested-pandas as a dependency plus unit tests and a documentation notebook demonstrating nested time-series usage.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/hyrax/datasets/nested_pandas_dataset.py Implements NestedPandasDataset and dynamic getters for top-level and nested fields.
src/hyrax/datasets/__init__.py Exposes NestedPandasDataset via package exports (and triggers import-time registry population).
src/hyrax/hyrax_default_config.toml Adds default config block for data_set.NestedPandasDataset (read_kwargs, max_samples).
pyproject.toml Adds nested-pandas runtime dependency.
tests/hyrax/test_nested_pandas_dataset.py Adds unit tests covering read, nested getters, read_kwargs, max_samples, and required data_location.
docs/notebooks/nested_pandas_time_series_with_hyrax.ipynb Adds an example notebook showing end-to-end usage with nested light-curve data.

Comment thread src/hyrax/datasets/nested_pandas_dataset.py Outdated
Comment thread src/hyrax/hyrax_default_config.toml Outdated
@mtauraso mtauraso marked this pull request as draft April 9, 2026 19:15
@mtauraso
Copy link
Copy Markdown
Collaborator Author

mtauraso commented Apr 9, 2026

@copilot Can you also fix all the linter errors. Ruff is having some issues with this change.

@mtauraso mtauraso self-assigned this Apr 9, 2026
@mtauraso mtauraso changed the title Add NestedPandasDataset wrapper, config, tests, and example notebook Add NestedPandasDataset Apr 9, 2026
Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 9, 2026

@copilot Can you also fix all the linter errors. Ruff is having some issues with this change.

Fixed in bd5869e. Added missing docstrings to all three test functions (D103), removed the unused typing.Any import, and the ruff-format issue is resolved.

@mtauraso mtauraso marked this pull request as ready for review April 9, 2026 19:43
@mtauraso mtauraso requested review from dougbrn and drewoldag April 9, 2026 19:43
@mtauraso
Copy link
Copy Markdown
Collaborator Author

mtauraso commented Apr 9, 2026

@copilot Take a look at docs/conf.py There is an exclusion list for classes that we use to avoid certain classes of doc errors.

If you change directory into docs and run make html you should get the same error that is happening in CI. See if you can alter conf.py to fix the error.

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 9, 2026

@copilot Take a look at docs/conf.py There is an exclusion list for classes that we use to avoid certain classes of doc errors.

If you change directory into docs and run make html you should ...

Fixed in b0384a6. The notebook (notebooks/nested_pandas_time_series_with_hyrax.ipynb) was added to exclude_patterns in docs/conf.py. The CI failure was RuntimeError: Unsupported type for cache: <class 'pandas.core.series.Series'> — nbsphinx was executing the notebook during the doc build, which triggered the DataCache size calculation on pandas Series objects. Excluding it follows the same pattern as archived_notebooks/* already in that list.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 9, 2026

Before [605fde5] After [5038489] Ratio Benchmark (Parameter)
55.2±0.07s failed n/a vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(16384, 'qdrant')
1.67G 1.72G 1.03 vector_db_benchmarks.VectorDBInsertBenchmarks.peakmem_load_vector_db(16384, 'chromadb')
10.2±0.01ms 10.5±0.2ms 1.03 vector_db_benchmarks.VectorDBSearchBenchmarks.time_search_by_vector_many_shards(128, 'chromadb')
7.13±0.02s 7.29±0.05s 1.02 benchmarks.time_help
7.00±0.04s 7.14±0.03s 1.02 benchmarks.time_lookup_help
39.1±0.06ms 39.9±0.08ms 1.02 benchmarks.time_nb_obj_construct
7.06±0.04s 7.20±0.07s 1.02 benchmarks.time_train_help
275±1ms 276±3ms 1.01 benchmarks.time_import
7.15±0.04s 7.22±0.03s 1.01 benchmarks.time_infer_help
40.4±0.9ms 40.7±0.5ms 1.01 benchmarks.time_nb_obj_dir

Click here to view all benchmarks.

Copy link
Copy Markdown
Collaborator

@dougbrn dougbrn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very cool, approving, but my one comment is that whether by looking at the PR, or trying the branch RTD site, it's not clear to me that the added notebook is navigable on the docs, so I would add links to it where appropriate.

Copy link
Copy Markdown
Collaborator

@drewoldag drewoldag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly minor comments, but one real question about the sub-columns and using dots in the getter method names.

Comment thread docs/conf.py
"archived_notebooks/*",
# This notebook requires a running Hyrax pipeline and cannot be executed
# in the CI/ReadTheDocs build environment.
"notebooks/nested_pandas_time_series_with_hyrax.ipynb",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we can just put the notebook in pre-executed, right? At least one cell has to have been run in order for pandoc not to try to rerun it when building the documentation though.

Comment thread src/hyrax/hyrax_default_config.toml Outdated

[data_set.NestedPandasDataset]
# Extra keyword arguments passed directly to the selected nested_pandas reader.
read_kwargs = {}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might recommend making this another subtable in toml since that will map nicely and users won't have to write dictionaries in toml. This new code might be the first instance of this approach. I think that the way we would usually do it is like this:

[data_set.NestedPandasDataset]
[data_set.NestedPandasDataset.read_kwargs]
thing_1 = 'foo'
thing_2 = 'bar'

I'm open to counter arguments.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right.I'm not sure what to actually put there though.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that if you leave it as just a subtable with no key/values that you'll get an empyt dict like what you're after.

[data_set.NestedPandasDataset]
[data_set.NestedPandasDataset.read_kwargs]
# message here saying to add nested kwargs here if needed.
# or a commented out, common kwarg someone might use? 

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually looks really tidy.

Comment on lines +15 to +16
settings = config.get("data_set", {}).get("NestedPandasDataset", {})
self.read_kwargs = settings.get("read_kwargs") or {}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally worried about the .get(...) usage around the config. I would like to avoid it if possible, just so that general users don't get the idea that they can provide other default values.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree.

Comment on lines +55 to +69
"h = hyrax.Hyrax()\n",
"h.config[\"data_request\"] = {\n",
" \"infer\": {\n",
" \"lc\": {\n",
" \"dataset_class\": \"NestedPandasDataset\",\n",
" \"data_location\": str(output_path),\n",
" \"primary_id_field\": \"object_id\",\n",
" # Request only nested columns using nested-pandas dot-notation.\n",
" \"fields\": [\"lightcurve.time\", \"lightcurve.flux\", \"lightcurve.flux_err\", \"lightcurve.band\"],\n",
" }\n",
" }\n",
"}\n",
"\n",
"prepared = h.prepare()\n",
"print(f\"Loaded rows: {len(prepared['infer'])}\")"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor point - perhaps move the Hyrax instantiation stuff to a new cell, just to separate the synthetic data production from the Hyrax stuff.

Comment on lines +110 to +113
"time = np.asarray(prepared[\"infer\"][0][\"lc\"][\"lightcurve.time\"])\n",
"flux = np.asarray(prepared[\"infer\"][0][\"lc\"][\"lightcurve.flux\"])\n",
"flux_err = np.asarray(prepared[\"infer\"][0][\"lc\"][\"lightcurve.flux_err\"])\n",
"band = np.asarray(prepared[\"infer\"][0][\"lc\"][\"lightcurve.band\"])\n",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These feel kind of wordy, perhaps:

data = prepared["infer"][0]["lc"]
time = data["lightcurve.time"]
...

def _all_available_fields(self) -> list[str]:
fields = list(self.nested_frame.columns)
if hasattr(self.nested_frame, "get_subcolumns"):
fields.extend(self.nested_frame.get_subcolumns())
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given what I see in the example notebook, am I correct in my understanding the the getter method names would look like this:

def get_lightcurve.time():
    ...

Are dots valid in python method names?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So apparently... they are... but you can't call them with normal syntax.

We are metaprogramming so it "just works"

Nonetheless it is awkward

dataset = prepared["infer"].prepped_datasets['lc']
dir(dataset)

yields:

[...
 'get_lightcurve',
 'get_lightcurve.band',
 'get_lightcurve.flux',
 'get_lightcurve.flux_err',
 'get_lightcurve.time',
 'get_object_id',
 ...]

And this returns a proper numpy array as one would expect:

getattr(dataset, "get_lightcurve.flux")(0)

The rough part, is that if we do any translation of "." I think it has to be up in DataProvider if we want people to be able to put "." in their data defnitions, calls to getitem etc...

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently the idiomatic python thing is to have "get_lightcurve" return an object where fundamentally ".flux()" works as well as "()" e.g. a callable class instance.

This is another thing that would probably require surgery inside DataProvider to implement....

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so all that being said, are we basically good to go? Given that internal hyrax machinery will be used to call the getters and we don't need to do anything special to to the metaprogramming to make e.g. ".flux()" act like "()"?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm gonna merge it now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants