Skip to content

Chore/199/add training data adapter sims#210

Merged
ChaitanyaChawak merged 54 commits intodevelopfrom
chore/199/add-training-data-adapter-sims
Apr 8, 2026
Merged

Chore/199/add training data adapter sims#210
ChaitanyaChawak merged 54 commits intodevelopfrom
chore/199/add-training-data-adapter-sims

Conversation

@jeipollack
Copy link
Copy Markdown
Contributor

@jeipollack jeipollack commented Mar 11, 2026

Summary

Major refactor of the data handling to unify dataset access, loading and conversion.
Introduces DataAdapter, TrainingDataAdapter, DataAdapterFactory, and NpyDatasetLoader for consistent dataset handling.
Updates TensorFlow conversion routines and provides canonical key validation.
Breaking changes: DataConfigHandler no longer loads datasets directly and data_handler.py has been (will be) removed (TBD)

Closes #205 , closes #201 , closes #124 , closes #178 , closes #68 , closes #143

What’s changed

  • Added NpyDatasetLoader to load .npy datasets with required canonical keys (positions, seds, target_field) and optional fields (masks, zernike_prior).
  • Introduced DataAdapter and DataAdapterFactory for unified dataset access across training and evaluation.
  • SupportsParams and SupportsMetadata protocols added for generic dataset parameter and metadata handling.
  • Refactored TensorFlowDatasetConverter to use canonical and optional keys and handle target_field mapping for SEDs or source images.
  • Updated PSFInference to use data_adapter property instead of manual preprocessing, simplifying _prepare_positions_and_seds. (TBD)
  • Breaking: data_config.yaml format updated; previous configs incompatible.
  • Updated unit and integration tests to reflect the new data handling and TensorFlow conversion pipeline.
  • Deprecated _prepare_positions_and_seds where data_adapter provides normalized tensors. (TBD)

How to test / verify

Ran repeatability runs at Jean-Zay.
Results reproduced and/or consistent (train-test split results in different star samples) with previous stable versions.

Scope

Indicate the type of PR:

  • Feature
  • Bug fix
  • Hotfix
  • Documentation / process change
  • Internal / refactor
  • Release

This PR is part of a larger milestone to modernise dataset handling.

Changelog

Did this PR introduce user-visible changes?
If yes, a Scriv changelog fragment must be added and committed.

  • Changelog fragment added (if applicable)

Reviewer Checklist

Reviewers should confirm the following before approving and merging:

  • The PR targets the correct base branch (develop, or main for release PRs)
  • The PR is assigned to the developer
  • Appropriate labels are applied
  • The PR is included in relevant projects and/or milestones
  • Description clearly explains what has changed
  • Issue references included, if applicable
  • Code and documentation adhere to current standards (ruff)
  • Documentation updates included, if relevant
  • CI tests are passing
  • All reviewer comments have been addressed

Next Steps / Notes (if applicable)

I am still updating PSFInference tests and need to delete data_handler.py after this is complete.

Jennifer Pollack added 8 commits March 11, 2026 15:25
- Add DataAdapter abstraction to unify dataset handling
- Add factory for constructing adapters from loaders
- Introduce TrainingDataAdapter for training-specific inputs/targets
- Move dataset normalization and canonicalization logic into adapter
…eline

- Simplify SimulationDataLoader to only load raw datasets
- Update TensorFlowDatasetConverter interface
- Remove duplicated processing logic in data utilities
- Remove ZernikeInputs and ZernikeFactory classes
- Add ZernikeDataset dataclass to expose Zernike-related inputs
- Apply minor formatting and doc string corrections
- Update DataConfigHandler to expose params and metadata
- Simplify TrainingConfigHandler data config handling
- Normalize data parameters upstream
- Add new data configuration example files with new and updated parameters entries
@jeipollack jeipollack marked this pull request as draft March 11, 2026 17:23
@jeipollack jeipollack self-assigned this Mar 11, 2026
Jennifer Pollack and others added 20 commits March 11, 2026 19:50
- Apply data adapter set up in metrics_config_handler
- Update datasets and keys for data access in metrics_interface
- Add simPSF arg to evaluate_model in metrics_interface
- Modify parameter and arg names in metrics_interface
- Change `n_bins_lda` to `n_bins_lambda` in metrics_config.yaml
…red_keys arg

- Correct type hints syntax to comply with 3.9+
- Add missing required_keys arg to convert_dataset required for TF
  conversion
- Update ValueError message with correct hints in factory.py
- Fix doc string and SEDs key (lowercase) in tensorflow_converter
- Set `result_dict=dict(dataset)` to keep all data (not overwrite) in
  tensorflow_converter
- Add import `ensure_tensor`
- Add positions attribute and replace previous position extraction
  method
- Convert positions attribute to TensorFlow type
- Correct train/split index error due to evaluating wrong dataset
  - Use canonical keys inside `_split` method to iterate over correct
    dataset size
- Remove unnecessary code left in by accident
- Use clearer doc strings to all files
- Add missing training fraction parameters for un-split datasets
- Update `data_adapter.py` and `data_config_handler.py` to use constants
- Update _split method to use canonical keys class attribute
@jeipollack jeipollack marked this pull request as ready for review March 16, 2026 16:15
@jeipollack jeipollack changed the title Draft: Chore/199/add training data adapter sims Chore/199/add training data adapter sims Mar 16, 2026
Jennifer Pollack and others added 18 commits March 16, 2026 17:18
- Update normalise_data_envelope to allow optional params
- Add conditional in Case B use loader to raise error if params is None
- Set DataConfigHandler.params to read_conf(file).params
- Update TrainingConfigHandler.data_params
- Add loggers on train_fraction and seed to data_adapter
- Use DataConfigHandler to load data_conf in psf inference
- Update path to data_config file in psf_inference
…tion in DataAdapterFactory

Replace the recursive numpy array check in `_resolve_dataset` with a
params-driven approach that inspects the structure of configuration
parameters to determine whether data is in memory or needs to be loaded
from disk.

Changes:
- `_resolve_dataset`: updated resolution logic and docstring to reflect
  the new two-step approach via `normalize_data_envelope` and `_is_in_memory`
- `_is_in_memory`: new module-level helper that detects in-memory data by
  checking for `file`/`data_dir` keys across all three config shapes
  (shallow, complete, split) using an internal `has_file_pointer` helper
- `_build_loaded_dataset`: new module-level helper that constructs a
  `LoadedDataset` from a dict, dataclass, or opaque object, with a
  logged warning for unrecognised structures
- `DatasetUtils.to_container`: promoted to a standalone module-level
  function in `data_utils.py` now that the class has no remaining methods
- `data_adapter.py` updated to import `to_container`
- `factory_test.py` new tests added for data scenarios to trigger loading
…puts

- training_config_handler: add prepare_training_inputs helper to prepare training_inputs and psf_model
- training_config_handler: set self.training_conf = read_conf(...).training
- train.train: update method args and removed data adaptation sequence
- tests: update config handler tests with these changes
- add helper methods to prepare inputs and targets in the constructor
- fixes lazy stacking per train cycle
- update unit test to match new behaviour
- add helper _assert_data_prepared to confirm correct training states
- add comments to helper developers understand state flows
@ChaitanyaChawak ChaitanyaChawak merged commit 072c187 into develop Apr 8, 2026
2 checks passed
@ChaitanyaChawak ChaitanyaChawak deleted the chore/199/add-training-data-adapter-sims branch April 8, 2026 09:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

Development

Successfully merging this pull request may close these issues.

2 participants