Add NestedPandasDataset by mtauraso · Pull Request #868 · lincc-frameworks/hyrax

mtauraso · 2026-04-09T19:00:51Z

Motivation

Provide a lightweight Hyrax dataset that can read nested Parquet files produced by nested-pandas and expose nested columns via dot-notation for downstream inference and examples.

Description

Add NestedPandasDataset which wraps nested_pandas.read_parquet, supports data_location, read_kwargs, max_samples, and dynamically registers get_<field> accessors for nested and top-level columns.
Export NestedPandasDataset from hyrax.datasets and add a default config block in hyrax_default_config.toml for read_kwargs and max_samples.
Add nested-pandas to pyproject.toml dependencies and include an example notebook docs/notebooks/nested_pandas_time_series_with_hyrax.ipynb demonstrating usage with nested light-curve data.
Add unit tests tests/hyrax/test_nested_pandas_dataset.py covering reading, nested field getters, read_kwargs, max_samples, and required data_location behavior.

Testing

The PR adds pytest unit tests in tests/hyrax/test_nested_pandas_dataset.py which create a NestedFrame, write a parquet file, instantiate NestedPandasDataset, and assert expected getters and length.
No automated test run was executed as part of this rollout.

Codex Task

review-notebook-app · 2026-04-09T19:00:58Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

codecov · 2026-04-09T19:06:42Z

Codecov Report

❌ Patch coverage is 94.73684% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.71%. Comparing base (a6f1a43) to head (38a6a34).

Files with missing lines	Patch %	Lines
src/hyrax/datasets/nested_pandas_dataset.py	94.59%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #868      +/-   ##
==========================================
+ Coverage   66.55%   66.71%   +0.16%     
==========================================
  Files          62       63       +1     
  Lines        6515     6553      +38     
==========================================
+ Hits         4336     4372      +36     
- Misses       2179     2181       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Pull request overview

Adds first-class support for reading nested-pandas Parquet datasets in Hyrax, including configuration, tests, and an example notebook.

Changes:

Introduces NestedPandasDataset, a HyraxDataset wrapper around nested_pandas.read_parquet with dynamic get_<field> accessors (including nested dot-notation fields).
Adds default config entries for NestedPandasDataset (read_kwargs, max_samples) and exports the dataset from hyrax.datasets.
Adds nested-pandas as a dependency plus unit tests and a documentation notebook demonstrating nested time-series usage.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`src/hyrax/datasets/nested_pandas_dataset.py`	Implements `NestedPandasDataset` and dynamic getters for top-level and nested fields.
`src/hyrax/datasets/__init__.py`	Exposes `NestedPandasDataset` via package exports (and triggers import-time registry population).
`src/hyrax/hyrax_default_config.toml`	Adds default config block for `data_set.NestedPandasDataset` (`read_kwargs`, `max_samples`).
`pyproject.toml`	Adds `nested-pandas` runtime dependency.
`tests/hyrax/test_nested_pandas_dataset.py`	Adds unit tests covering read, nested getters, `read_kwargs`, `max_samples`, and required `data_location`.
`docs/notebooks/nested_pandas_time_series_with_hyrax.ipynb`	Adds an example notebook showing end-to-end usage with nested light-curve data.

mtauraso · 2026-04-09T19:17:33Z

@copilot Can you also fix all the linter errors. Ruff is having some issues with this change.

…t arg) Agent-Logs-Url: https://github.com/lincc-frameworks/hyrax/sessions/8c48d58f-f711-4bcf-be92-a30da2357e01 Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>

… docstrings Agent-Logs-Url: https://github.com/lincc-frameworks/hyrax/sessions/637f35ea-20ee-4019-ba3e-8d6e2c3627c8 Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>

Copilot · 2026-04-09T19:37:57Z

@copilot Can you also fix all the linter errors. Ruff is having some issues with this change.

Fixed in bd5869e. Added missing docstrings to all three test functions (D103), removed the unused typing.Any import, and the ruff-format issue is resolved.

mtauraso · 2026-04-09T19:56:43Z

@copilot Take a look at docs/conf.py There is an exclusion list for classes that we use to avoid certain classes of doc errors.

If you change directory into docs and run make html you should get the same error that is happening in CI. See if you can alter conf.py to fix the error.

…tion Agent-Logs-Url: https://github.com/lincc-frameworks/hyrax/sessions/3025a4e5-13b0-4a7e-bbc7-26462909c563 Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>

Copilot · 2026-04-09T20:06:05Z

@copilot Take a look at docs/conf.py There is an exclusion list for classes that we use to avoid certain classes of doc errors.

If you change directory into docs and run make html you should ...

Fixed in b0384a6. The notebook (notebooks/nested_pandas_time_series_with_hyrax.ipynb) was added to exclude_patterns in docs/conf.py. The CI failure was RuntimeError: Unsupported type for cache: <class 'pandas.core.series.Series'> — nbsphinx was executing the notebook during the doc build, which triggered the DataCache size calculation on pandas Series objects. Excluding it follows the same pattern as archived_notebooks/* already in that list.

github-actions · 2026-04-09T20:44:05Z

Before [`605fde5`]	After [`5038489`]	Ratio	Benchmark (Parameter)
55.2±0.07s	failed	n/a	vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(16384, 'qdrant')
1.67G	1.72G	1.03	vector_db_benchmarks.VectorDBInsertBenchmarks.peakmem_load_vector_db(16384, 'chromadb')
10.2±0.01ms	10.5±0.2ms	1.03	vector_db_benchmarks.VectorDBSearchBenchmarks.time_search_by_vector_many_shards(128, 'chromadb')
7.13±0.02s	7.29±0.05s	1.02	benchmarks.time_help
7.00±0.04s	7.14±0.03s	1.02	benchmarks.time_lookup_help
39.1±0.06ms	39.9±0.08ms	1.02	benchmarks.time_nb_obj_construct
7.06±0.04s	7.20±0.07s	1.02	benchmarks.time_train_help
275±1ms	276±3ms	1.01	benchmarks.time_import
7.15±0.04s	7.22±0.03s	1.01	benchmarks.time_infer_help
40.4±0.9ms	40.7±0.5ms	1.01	benchmarks.time_nb_obj_dir

Click here to view all benchmarks.

dougbrn

Looks very cool, approving, but my one comment is that whether by looking at the PR, or trying the branch RTD site, it's not clear to me that the added notebook is navigable on the docs, so I would add links to it where appropriate.

drewoldag

Mostly minor comments, but one real question about the sub-columns and using dots in the getter method names.

drewoldag · 2026-04-13T18:52:22Z

+    "archived_notebooks/*",
+    # This notebook requires a running Hyrax pipeline and cannot be executed
+    # in the CI/ReadTheDocs build environment.
+    "notebooks/nested_pandas_time_series_with_hyrax.ipynb",


I think that we can just put the notebook in pre-executed, right? At least one cell has to have been run in order for pandoc not to try to rerun it when building the documentation though.

drewoldag · 2026-04-13T18:57:14Z

+
+[data_set.NestedPandasDataset]
+# Extra keyword arguments passed directly to the selected nested_pandas reader.
+read_kwargs = {}


I might recommend making this another subtable in toml since that will map nicely and users won't have to write dictionaries in toml. This new code might be the first instance of this approach. I think that the way we would usually do it is like this:

[data_set.NestedPandasDataset] [data_set.NestedPandasDataset.read_kwargs] thing_1 = 'foo' thing_2 = 'bar'

I'm open to counter arguments.

I think you're right.I'm not sure what to actually put there though.

I think that if you leave it as just a subtable with no key/values that you'll get an empyt dict like what you're after.

[data_set.NestedPandasDataset] [data_set.NestedPandasDataset.read_kwargs] # message here saying to add nested kwargs here if needed. # or a commented out, common kwarg someone might use?

drewoldag · 2026-04-13T18:57:49Z

This actually looks really tidy.

drewoldag · 2026-04-13T18:59:08Z

+        settings = config.get("data_set", {}).get("NestedPandasDataset", {})
+        self.read_kwargs = settings.get("read_kwargs") or {}


I'm generally worried about the .get(...) usage around the config. I would like to avoid it if possible, just so that general users don't get the idea that they can provide other default values.

drewoldag · 2026-04-13T19:00:23Z

+    "h = hyrax.Hyrax()\n",
+    "h.config[\"data_request\"] = {\n",
+    "    \"infer\": {\n",
+    "        \"lc\": {\n",
+    "            \"dataset_class\": \"NestedPandasDataset\",\n",
+    "            \"data_location\": str(output_path),\n",
+    "            \"primary_id_field\": \"object_id\",\n",
+    "            # Request only nested columns using nested-pandas dot-notation.\n",
+    "            \"fields\": [\"lightcurve.time\", \"lightcurve.flux\", \"lightcurve.flux_err\", \"lightcurve.band\"],\n",
+    "        }\n",
+    "    }\n",
+    "}\n",
+    "\n",
+    "prepared = h.prepare()\n",
+    "print(f\"Loaded rows: {len(prepared['infer'])}\")"


Minor point - perhaps move the Hyrax instantiation stuff to a new cell, just to separate the synthetic data production from the Hyrax stuff.

drewoldag · 2026-04-13T19:01:26Z

+    "time = np.asarray(prepared[\"infer\"][0][\"lc\"][\"lightcurve.time\"])\n",
+    "flux = np.asarray(prepared[\"infer\"][0][\"lc\"][\"lightcurve.flux\"])\n",
+    "flux_err = np.asarray(prepared[\"infer\"][0][\"lc\"][\"lightcurve.flux_err\"])\n",
+    "band = np.asarray(prepared[\"infer\"][0][\"lc\"][\"lightcurve.band\"])\n",


These feel kind of wordy, perhaps:

data = prepared["infer"][0]["lc"] time = data["lightcurve.time"] ...

drewoldag · 2026-04-13T19:03:45Z

+    def _all_available_fields(self) -> list[str]:
+        fields = list(self.nested_frame.columns)
+        if hasattr(self.nested_frame, "get_subcolumns"):
+            fields.extend(self.nested_frame.get_subcolumns())


Given what I see in the example notebook, am I correct in my understanding the the getter method names would look like this:

def get_lightcurve.time(): ...

Are dots valid in python method names?

So apparently... they are... but you can't call them with normal syntax.

We are metaprogramming so it "just works"

Nonetheless it is awkward

dataset = prepared["infer"].prepped_datasets['lc'] dir(dataset)

yields:

[... 'get_lightcurve', 'get_lightcurve.band', 'get_lightcurve.flux', 'get_lightcurve.flux_err', 'get_lightcurve.time', 'get_object_id', ...]

And this returns a proper numpy array as one would expect:

getattr(dataset, "get_lightcurve.flux")(0)

The rough part, is that if we do any translation of "." I think it has to be up in DataProvider if we want people to be able to put "." in their data defnitions, calls to getitem etc...

Apparently the idiomatic python thing is to have "get_lightcurve" return an object where fundamentally ".flux()" works as well as "()" e.g. a callable class instance.

This is another thing that would probably require surgery inside DataProvider to implement....

Ok, so all that being said, are we basically good to go? Given that internal hyrax machinery will be used to call the getters and we don't need to do anything special to to the metaprogramming to make e.g. ".flux()" act like "()"?

Yeah, I'm gonna merge it now.

…-hyrax

Simplify nested pandas dataset to parquet-only and real tests

4a598bb

mtauraso added the codex label Apr 9, 2026 — with ChatGPT Codex Connector

Copilot AI review requested due to automatic review settings April 9, 2026 19:00

Copilot started reviewing on behalf of mtauraso April 9, 2026 19:01 View session

Copilot AI reviewed Apr 9, 2026

View reviewed changes

Comment thread src/hyrax/datasets/nested_pandas_dataset.py Outdated

Comment thread src/hyrax/hyrax_default_config.toml Outdated

Copilot started work on behalf of mtauraso April 9, 2026 19:15 View session

mtauraso marked this pull request as draft April 9, 2026 19:15

mtauraso self-assigned this Apr 9, 2026

mtauraso changed the title ~~Add NestedPandasDataset wrapper, config, tests, and example notebook~~ Add NestedPandasDataset Apr 9, 2026

Remove max_samples from NestedPandasDataset (no read_parquet row-limi…

f0f8a05

…t arg) Agent-Logs-Url: https://github.com/lincc-frameworks/hyrax/sessions/8c48d58f-f711-4bcf-be92-a30da2357e01 Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>

Copilot finished work on behalf of mtauraso April 9, 2026 19:19

Copilot started work on behalf of mtauraso April 9, 2026 19:29 View session

Simplify NestedPandasDataset: use nested-pandas bracket operator, add…

bd5869e

… docstrings Agent-Logs-Url: https://github.com/lincc-frameworks/hyrax/sessions/637f35ea-20ee-4019-ba3e-8d6e2c3627c8 Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>

Copilot finished work on behalf of mtauraso April 9, 2026 19:39

mtauraso marked this pull request as ready for review April 9, 2026 19:43

mtauraso requested review from dougbrn and drewoldag April 9, 2026 19:43

Fixup linter

927aa85

Copilot started work on behalf of mtauraso April 9, 2026 19:56 View session

Fix doc build: exclude NestedPandasDataset notebook from Sphinx execu…

b0384a6

…tion Agent-Logs-Url: https://github.com/lincc-frameworks/hyrax/sessions/3025a4e5-13b0-4a7e-bbc7-26462909c563 Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>

Copilot finished work on behalf of mtauraso April 9, 2026 20:07

dougbrn approved these changes Apr 9, 2026

View reviewed changes

drewoldag reviewed Apr 13, 2026

View reviewed changes

mtauraso and others added 5 commits April 13, 2026 15:13

Merge branch 'main' into codex/create-nested-pandas-dataset-class-for…

d3ac5af

…-hyrax

Merge branch 'main' into codex/create-nested-pandas-dataset-class-for…

95fcc73

…-hyrax

Fixup based on code review feedback

3df3050

Fix test

7a54d6a

Merge branch 'main' into codex/create-nested-pandas-dataset-class-for…

38a6a34

…-hyrax

		settings = config.get("data_set", {}).get("NestedPandasDataset", {})
		self.read_kwargs = settings.get("read_kwargs") or {}

Conversation

mtauraso commented Apr 9, 2026

Motivation

Description

Testing

Uh oh!

review-notebook-app bot commented Apr 9, 2026

Uh oh!

codecov bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

mtauraso commented Apr 9, 2026

Uh oh!

Copilot AI commented Apr 9, 2026

Uh oh!

mtauraso commented Apr 9, 2026

Uh oh!

Copilot AI commented Apr 9, 2026

Uh oh!

github-actions bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dougbrn left a comment

Choose a reason for hiding this comment

Uh oh!

drewoldag left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov bot commented Apr 9, 2026 •

edited

Loading

github-actions bot commented Apr 9, 2026 •

edited

Loading