Chore/199/add training data adapter sims by jeipollack · Pull Request #210 · CosmoStat/wf-psf

jeipollack · 2026-03-11T17:23:30Z

Summary

Major refactor of the data handling to unify dataset access, loading and conversion.
Introduces DataAdapter, TrainingDataAdapter, DataAdapterFactory, and NpyDatasetLoader for consistent dataset handling.
Updates TensorFlow conversion routines and provides canonical key validation.
Breaking changes: DataConfigHandler no longer loads datasets directly and data_handler.py has been (will be) removed (TBD)

Closes #205 , closes #201 , closes #124 , closes #178 , closes #68 , closes #143

What’s changed

Added NpyDatasetLoader to load .npy datasets with required canonical keys (positions, seds, target_field) and optional fields (masks, zernike_prior).
Introduced DataAdapter and DataAdapterFactory for unified dataset access across training and evaluation.
SupportsParams and SupportsMetadata protocols added for generic dataset parameter and metadata handling.
Refactored TensorFlowDatasetConverter to use canonical and optional keys and handle target_field mapping for SEDs or source images.
Updated PSFInference to use data_adapter property instead of manual preprocessing, simplifying _prepare_positions_and_seds. (TBD)
Breaking: data_config.yaml format updated; previous configs incompatible.
Updated unit and integration tests to reflect the new data handling and TensorFlow conversion pipeline.
Deprecated _prepare_positions_and_seds where data_adapter provides normalized tensors. (TBD)

How to test / verify

Ran repeatability runs at Jean-Zay.
Results reproduced and/or consistent (train-test split results in different star samples) with previous stable versions.

Scope

Indicate the type of PR:

This PR is part of a larger milestone to modernise dataset handling.

Changelog

Did this PR introduce user-visible changes?
If yes, a Scriv changelog fragment must be added and committed.

Changelog fragment added (if applicable)

Reviewer Checklist

Reviewers should confirm the following before approving and merging:

Next Steps / Notes (if applicable)

I am still updating PSFInference tests and need to delete data_handler.py after this is complete.

- Add DataAdapter abstraction to unify dataset handling - Add factory for constructing adapters from loaders - Introduce TrainingDataAdapter for training-specific inputs/targets - Move dataset normalization and canonicalization logic into adapter

…eline - Simplify SimulationDataLoader to only load raw datasets - Update TensorFlowDatasetConverter interface - Remove duplicated processing logic in data utilities

- Remove ZernikeInputs and ZernikeFactory classes - Add ZernikeDataset dataclass to expose Zernike-related inputs - Apply minor formatting and doc string corrections

- Update DataConfigHandler to expose params and metadata - Simplify TrainingConfigHandler data config handling - Normalize data parameters upstream - Add new data configuration example files with new and updated parameters entries

…missing doc strings

- Apply data adapter set up in metrics_config_handler - Update datasets and keys for data access in metrics_interface - Add simPSF arg to evaluate_model in metrics_interface - Modify parameter and arg names in metrics_interface - Change `n_bins_lda` to `n_bins_lambda` in metrics_config.yaml

…red_keys arg - Correct type hints syntax to comply with 3.9+ - Add missing required_keys arg to convert_dataset required for TF conversion - Update ValueError message with correct hints in factory.py - Fix doc string and SEDs key (lowercase) in tensorflow_converter - Set `result_dict=dict(dataset)` to keep all data (not overwrite) in tensorflow_converter

- Add import `ensure_tensor` - Add positions attribute and replace previous position extraction method - Convert positions attribute to TensorFlow type

- Correct train/split index error due to evaluating wrong dataset - Use canonical keys inside `_split` method to iterate over correct dataset size - Remove unnecessary code left in by accident

…full hyperparameters instead of loss

- Use clearer doc strings to all files - Add missing training fraction parameters for un-split datasets

- Update `data_adapter.py` and `data_config_handler.py` to use constants - Update _split method to use canonical keys class attribute

…DataAdapter

…le, and example configuration files

- Update normalise_data_envelope to allow optional params - Add conditional in Case B use loader to raise error if params is None

…or inference and model datasets

- Set DataConfigHandler.params to read_conf(file).params - Update TrainingConfigHandler.data_params - Add loggers on train_fraction and seed to data_adapter - Use DataConfigHandler to load data_conf in psf inference - Update path to data_config file in psf_inference

… .params)

…s None

…tion in DataAdapterFactory Replace the recursive numpy array check in `_resolve_dataset` with a params-driven approach that inspects the structure of configuration parameters to determine whether data is in memory or needs to be loaded from disk. Changes: - `_resolve_dataset`: updated resolution logic and docstring to reflect the new two-step approach via `normalize_data_envelope` and `_is_in_memory` - `_is_in_memory`: new module-level helper that detects in-memory data by checking for `file`/`data_dir` keys across all three config shapes (shallow, complete, split) using an internal `has_file_pointer` helper - `_build_loaded_dataset`: new module-level helper that constructs a `LoadedDataset` from a dict, dataclass, or opaque object, with a logged warning for unrecognised structures - `DatasetUtils.to_container`: promoted to a standalone module-level function in `data_utils.py` now that the class has no remaining methods - `data_adapter.py` updated to import `to_container` - `factory_test.py` new tests added for data scenarios to trigger loading

…puts - training_config_handler: add prepare_training_inputs helper to prepare training_inputs and psf_model - training_config_handler: set self.training_conf = read_conf(...).training - train.train: update method args and removed data adaptation sequence - tests: update config handler tests with these changes

…tent behaviour

- add helper methods to prepare inputs and targets in the constructor - fixes lazy stacking per train cycle - update unit test to match new behaviour

- add helper _assert_data_prepared to confirm correct training states - add comments to helper developers understand state flows

Jennifer Pollack added 8 commits March 11, 2026 15:25

refactor(data): integrate loaders and converters with DataAdapter pip…

f250dec

…eline - Simplify SimulationDataLoader to only load raw datasets - Update TensorFlowDatasetConverter interface - Remove duplicated processing logic in data utilities

refactor(data): update data_zernike_utils with DataAdapter architecture

3732379

- Remove ZernikeInputs and ZernikeFactory classes - Add ZernikeDataset dataclass to expose Zernike-related inputs - Apply minor formatting and doc string corrections

delete config/data_config.yaml example

541b9b2

refactor(training): integrate DataAdapter with training pipeline and …

4000c8f

…missing doc strings

test: update existing tests for new data pipeline

7506875

test(integration): add end-to-end training data pipeline tests

42e7f43

jeipollack marked this pull request as draft March 11, 2026 17:23

jeipollack self-assigned this Mar 11, 2026

Jennifer Pollack and others added 20 commits March 11, 2026 19:50

refactor: update TrainingConfigHandler with DataAdapter architecture

34f4edb

fix(train): add missing .params to extract data configuration params

bb96a1a

Add loggers for different data adapter steps

026e03b

fix(models): Update data access to use data adapter

6a35941

fix(psf_models): correct positions type error

833f567

- Add import `ensure_tensor` - Add positions attribute and replace previous position extraction method - Convert positions attribute to TensorFlow type

fix(data): change SEDs in REQUIRED_KEYS to lowercase

b3fb535

fix(tests): update all tests with bug fixes in corresponding modules

d93fdc5

fix(data): correct bug and clean up

6d20e88

- Correct train/split index error due to evaluating wrong dataset - Use canonical keys inside `_split` method to iterate over correct dataset size - Remove unnecessary code left in by accident

fix(data): Correct complete and shallow data loading bugs

8a7434d

fix(train): correct TrainingAdapter constructor error due to passing …

ccee180

…full hyperparameters instead of loss

Add logger message for when sources and masks are stacked

d611c91

Fix docstring and type hint error in calculate_sample_weights function

3d324bd

fix(metrics): add conditional to convert masks to numpy arrays

2cf42aa

Update factory module doc string

36a8084

Fix example data configuration files

6204443

- Use clearer doc strings to all files - Add missing training fraction parameters for un-split datasets

Add data constants.py to store default data keys and parameters

0b8d87d

- Update `data_adapter.py` and `data_config_handler.py` to use constants - Update _split method to use canonical keys class attribute

Shorten unit test doc string

45319cd

Replace SimulationDataLoader with NpyDatasetLoader

7abe906

jeipollack added enhancement New feature or request labels Mar 16, 2026

jeipollack added this to wavefront-based PSF model estimation Mar 16, 2026

Jennifer Pollack added 5 commits March 16, 2026 13:59

Replace with and update Changelog fragment

ed46291

refactor(inference): update PSFInference and unit tests to integrate …

64903c5

…DataAdapter

refactor(tests): update tests and fixtures with removal of data_handler

5acc441

refactor(inference): replace with in inference module, unit test modu…

02a0238

…le, and example configuration files

add updated psf_inference_test module (got unstaged)

4093143

jeipollack marked this pull request as ready for review March 16, 2026 16:15

jeipollack changed the title ~~Draft: Chore/199/add training data adapter sims~~ Chore/199/add training data adapter sims Mar 16, 2026

Jennifer Pollack and others added 18 commits March 16, 2026 17:18

add updated psf_inference with

066084b

fix(data): update factory methods to properly handle params

42e344c

- Update normalise_data_envelope to allow optional params - Add conditional in Case B use loader to raise error if params is None

fix(test_data): remove unneeded unit test

243fa73

fix(inference): update inference and tests to use two data adapters f…

c43b522

…or inference and model datasets

fix(test): update training_config_handler_init test

3b918eb

fix(metrics): set data_params equal to DataConfigHandler object (drop…

c234365

… .params)

fix(docs): correct typo in docstring

b0b3072

fix(data): correct error using defaults split parameters when param i…

25bb842

…s None

Update key features list in factory module doc string

33b19a8

Correct indentation errors in train.train doc string

849c6f6

refactor(data_adapter): add safer checks to split_data and log idempo…

23dcbab

…tent behaviour

refactor(training_data_adapter): eagerly prepare inputs and targets

7dac32f

- add helper methods to prepare inputs and targets in the constructor - fixes lazy stacking per train cycle - update unit test to match new behaviour

refactor(training_config_handler): add safety check on data prep

8e26f01

- add helper _assert_data_prepared to confirm correct training states - add comments to helper developers understand state flows

fix: update training_conf (drop .training)

e90fed9

Replace ccd_misalignments_input_path with ccd_misalignments_aux_path

859a837

ChaitanyaChawak merged commit 072c187 into develop Apr 8, 2026
2 checks passed

ChaitanyaChawak deleted the chore/199/add-training-data-adapter-sims branch April 8, 2026 09:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chore/199/add training data adapter sims#210

Chore/199/add training data adapter sims#210
ChaitanyaChawak merged 54 commits intodevelopfrom
chore/199/add-training-data-adapter-sims

jeipollack commented Mar 11, 2026 •

edited by ChaitanyaChawak

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jeipollack commented Mar 11, 2026 • edited by ChaitanyaChawak Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What’s changed

How to test / verify

Scope

Changelog

Reviewer Checklist

Next Steps / Notes (if applicable)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jeipollack commented Mar 11, 2026 •

edited by ChaitanyaChawak

Loading