Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #868 +/- ##
==========================================
+ Coverage 66.55% 66.71% +0.16%
==========================================
Files 62 63 +1
Lines 6515 6553 +38
==========================================
+ Hits 4336 4372 +36
- Misses 2179 2181 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Adds first-class support for reading nested-pandas Parquet datasets in Hyrax, including configuration, tests, and an example notebook.
Changes:
- Introduces
NestedPandasDataset, aHyraxDatasetwrapper aroundnested_pandas.read_parquetwith dynamicget_<field>accessors (including nested dot-notation fields). - Adds default config entries for
NestedPandasDataset(read_kwargs,max_samples) and exports the dataset fromhyrax.datasets. - Adds
nested-pandasas a dependency plus unit tests and a documentation notebook demonstrating nested time-series usage.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
src/hyrax/datasets/nested_pandas_dataset.py |
Implements NestedPandasDataset and dynamic getters for top-level and nested fields. |
src/hyrax/datasets/__init__.py |
Exposes NestedPandasDataset via package exports (and triggers import-time registry population). |
src/hyrax/hyrax_default_config.toml |
Adds default config block for data_set.NestedPandasDataset (read_kwargs, max_samples). |
pyproject.toml |
Adds nested-pandas runtime dependency. |
tests/hyrax/test_nested_pandas_dataset.py |
Adds unit tests covering read, nested getters, read_kwargs, max_samples, and required data_location. |
docs/notebooks/nested_pandas_time_series_with_hyrax.ipynb |
Adds an example notebook showing end-to-end usage with nested light-curve data. |
|
@copilot Can you also fix all the linter errors. Ruff is having some issues with this change. |
…t arg) Agent-Logs-Url: https://github.com/lincc-frameworks/hyrax/sessions/8c48d58f-f711-4bcf-be92-a30da2357e01 Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>
… docstrings Agent-Logs-Url: https://github.com/lincc-frameworks/hyrax/sessions/637f35ea-20ee-4019-ba3e-8d6e2c3627c8 Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>
|
@copilot Take a look at If you change directory into |
…tion Agent-Logs-Url: https://github.com/lincc-frameworks/hyrax/sessions/3025a4e5-13b0-4a7e-bbc7-26462909c563 Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>
Fixed in b0384a6. The notebook ( |
Click here to view all benchmarks. |
dougbrn
left a comment
There was a problem hiding this comment.
Looks very cool, approving, but my one comment is that whether by looking at the PR, or trying the branch RTD site, it's not clear to me that the added notebook is navigable on the docs, so I would add links to it where appropriate.
drewoldag
left a comment
There was a problem hiding this comment.
Mostly minor comments, but one real question about the sub-columns and using dots in the getter method names.
| "archived_notebooks/*", | ||
| # This notebook requires a running Hyrax pipeline and cannot be executed | ||
| # in the CI/ReadTheDocs build environment. | ||
| "notebooks/nested_pandas_time_series_with_hyrax.ipynb", |
There was a problem hiding this comment.
I think that we can just put the notebook in pre-executed, right? At least one cell has to have been run in order for pandoc not to try to rerun it when building the documentation though.
|
|
||
| [data_set.NestedPandasDataset] | ||
| # Extra keyword arguments passed directly to the selected nested_pandas reader. | ||
| read_kwargs = {} |
There was a problem hiding this comment.
I might recommend making this another subtable in toml since that will map nicely and users won't have to write dictionaries in toml. This new code might be the first instance of this approach. I think that the way we would usually do it is like this:
[data_set.NestedPandasDataset]
[data_set.NestedPandasDataset.read_kwargs]
thing_1 = 'foo'
thing_2 = 'bar'I'm open to counter arguments.
There was a problem hiding this comment.
I think you're right.I'm not sure what to actually put there though.
There was a problem hiding this comment.
I think that if you leave it as just a subtable with no key/values that you'll get an empyt dict like what you're after.
[data_set.NestedPandasDataset]
[data_set.NestedPandasDataset.read_kwargs]
# message here saying to add nested kwargs here if needed.
# or a commented out, common kwarg someone might use? There was a problem hiding this comment.
This actually looks really tidy.
| settings = config.get("data_set", {}).get("NestedPandasDataset", {}) | ||
| self.read_kwargs = settings.get("read_kwargs") or {} |
There was a problem hiding this comment.
I'm generally worried about the .get(...) usage around the config. I would like to avoid it if possible, just so that general users don't get the idea that they can provide other default values.
| "h = hyrax.Hyrax()\n", | ||
| "h.config[\"data_request\"] = {\n", | ||
| " \"infer\": {\n", | ||
| " \"lc\": {\n", | ||
| " \"dataset_class\": \"NestedPandasDataset\",\n", | ||
| " \"data_location\": str(output_path),\n", | ||
| " \"primary_id_field\": \"object_id\",\n", | ||
| " # Request only nested columns using nested-pandas dot-notation.\n", | ||
| " \"fields\": [\"lightcurve.time\", \"lightcurve.flux\", \"lightcurve.flux_err\", \"lightcurve.band\"],\n", | ||
| " }\n", | ||
| " }\n", | ||
| "}\n", | ||
| "\n", | ||
| "prepared = h.prepare()\n", | ||
| "print(f\"Loaded rows: {len(prepared['infer'])}\")" |
There was a problem hiding this comment.
Minor point - perhaps move the Hyrax instantiation stuff to a new cell, just to separate the synthetic data production from the Hyrax stuff.
| "time = np.asarray(prepared[\"infer\"][0][\"lc\"][\"lightcurve.time\"])\n", | ||
| "flux = np.asarray(prepared[\"infer\"][0][\"lc\"][\"lightcurve.flux\"])\n", | ||
| "flux_err = np.asarray(prepared[\"infer\"][0][\"lc\"][\"lightcurve.flux_err\"])\n", | ||
| "band = np.asarray(prepared[\"infer\"][0][\"lc\"][\"lightcurve.band\"])\n", |
There was a problem hiding this comment.
These feel kind of wordy, perhaps:
data = prepared["infer"][0]["lc"]
time = data["lightcurve.time"]
...| def _all_available_fields(self) -> list[str]: | ||
| fields = list(self.nested_frame.columns) | ||
| if hasattr(self.nested_frame, "get_subcolumns"): | ||
| fields.extend(self.nested_frame.get_subcolumns()) |
There was a problem hiding this comment.
Given what I see in the example notebook, am I correct in my understanding the the getter method names would look like this:
def get_lightcurve.time():
...Are dots valid in python method names?
There was a problem hiding this comment.
So apparently... they are... but you can't call them with normal syntax.
We are metaprogramming so it "just works"
Nonetheless it is awkward
dataset = prepared["infer"].prepped_datasets['lc']
dir(dataset)
yields:
[...
'get_lightcurve',
'get_lightcurve.band',
'get_lightcurve.flux',
'get_lightcurve.flux_err',
'get_lightcurve.time',
'get_object_id',
...]
And this returns a proper numpy array as one would expect:
getattr(dataset, "get_lightcurve.flux")(0)
The rough part, is that if we do any translation of "." I think it has to be up in DataProvider if we want people to be able to put "." in their data defnitions, calls to getitem etc...
There was a problem hiding this comment.
Apparently the idiomatic python thing is to have "get_lightcurve" return an object where fundamentally ".flux()" works as well as "()" e.g. a callable class instance.
This is another thing that would probably require surgery inside DataProvider to implement....
There was a problem hiding this comment.
Ok, so all that being said, are we basically good to go? Given that internal hyrax machinery will be used to call the getters and we don't need to do anything special to to the metaprogramming to make e.g. ".flux()" act like "()"?
There was a problem hiding this comment.
Yeah, I'm gonna merge it now.
Motivation
nested-pandasand expose nested columns via dot-notation for downstream inference and examples.Description
NestedPandasDatasetwhich wrapsnested_pandas.read_parquet, supportsdata_location,read_kwargs,max_samples, and dynamically registersget_<field>accessors for nested and top-level columns.NestedPandasDatasetfromhyrax.datasetsand add a default config block inhyrax_default_config.tomlforread_kwargsandmax_samples.nested-pandastopyproject.tomldependencies and include an example notebookdocs/notebooks/nested_pandas_time_series_with_hyrax.ipynbdemonstrating usage with nested light-curve data.tests/hyrax/test_nested_pandas_dataset.pycovering reading, nested field getters,read_kwargs,max_samples, and requireddata_locationbehavior.Testing
tests/hyrax/test_nested_pandas_dataset.pywhich create aNestedFrame, write a parquet file, instantiateNestedPandasDataset, and assert expected getters and length.Codex Task