Chore/199/add training data adapter sims#210
Merged
ChaitanyaChawak merged 54 commits intodevelopfrom Apr 8, 2026
Merged
Conversation
added 8 commits
March 11, 2026 15:25
- Add DataAdapter abstraction to unify dataset handling - Add factory for constructing adapters from loaders - Introduce TrainingDataAdapter for training-specific inputs/targets - Move dataset normalization and canonicalization logic into adapter
…eline - Simplify SimulationDataLoader to only load raw datasets - Update TensorFlowDatasetConverter interface - Remove duplicated processing logic in data utilities
- Remove ZernikeInputs and ZernikeFactory classes - Add ZernikeDataset dataclass to expose Zernike-related inputs - Apply minor formatting and doc string corrections
- Update DataConfigHandler to expose params and metadata - Simplify TrainingConfigHandler data config handling - Normalize data parameters upstream - Add new data configuration example files with new and updated parameters entries
…missing doc strings
- Apply data adapter set up in metrics_config_handler - Update datasets and keys for data access in metrics_interface - Add simPSF arg to evaluate_model in metrics_interface - Modify parameter and arg names in metrics_interface - Change `n_bins_lda` to `n_bins_lambda` in metrics_config.yaml
…red_keys arg - Correct type hints syntax to comply with 3.9+ - Add missing required_keys arg to convert_dataset required for TF conversion - Update ValueError message with correct hints in factory.py - Fix doc string and SEDs key (lowercase) in tensorflow_converter - Set `result_dict=dict(dataset)` to keep all data (not overwrite) in tensorflow_converter
- Add import `ensure_tensor` - Add positions attribute and replace previous position extraction method - Convert positions attribute to TensorFlow type
- Correct train/split index error due to evaluating wrong dataset
- Use canonical keys inside `_split` method to iterate over correct
dataset size
- Remove unnecessary code left in by accident
…full hyperparameters instead of loss
- Use clearer doc strings to all files - Add missing training fraction parameters for un-split datasets
- Update `data_adapter.py` and `data_config_handler.py` to use constants - Update _split method to use canonical keys class attribute
added 5 commits
March 16, 2026 13:59
…le, and example configuration files
- Update normalise_data_envelope to allow optional params - Add conditional in Case B use loader to raise error if params is None
…or inference and model datasets
- Set DataConfigHandler.params to read_conf(file).params - Update TrainingConfigHandler.data_params - Add loggers on train_fraction and seed to data_adapter - Use DataConfigHandler to load data_conf in psf inference - Update path to data_config file in psf_inference
…tion in DataAdapterFactory Replace the recursive numpy array check in `_resolve_dataset` with a params-driven approach that inspects the structure of configuration parameters to determine whether data is in memory or needs to be loaded from disk. Changes: - `_resolve_dataset`: updated resolution logic and docstring to reflect the new two-step approach via `normalize_data_envelope` and `_is_in_memory` - `_is_in_memory`: new module-level helper that detects in-memory data by checking for `file`/`data_dir` keys across all three config shapes (shallow, complete, split) using an internal `has_file_pointer` helper - `_build_loaded_dataset`: new module-level helper that constructs a `LoadedDataset` from a dict, dataclass, or opaque object, with a logged warning for unrecognised structures - `DatasetUtils.to_container`: promoted to a standalone module-level function in `data_utils.py` now that the class has no remaining methods - `data_adapter.py` updated to import `to_container` - `factory_test.py` new tests added for data scenarios to trigger loading
…puts - training_config_handler: add prepare_training_inputs helper to prepare training_inputs and psf_model - training_config_handler: set self.training_conf = read_conf(...).training - train.train: update method args and removed data adaptation sequence - tests: update config handler tests with these changes
- add helper methods to prepare inputs and targets in the constructor - fixes lazy stacking per train cycle - update unit test to match new behaviour
- add helper _assert_data_prepared to confirm correct training states - add comments to helper developers understand state flows
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Major refactor of the data handling to unify dataset access, loading and conversion.
Introduces
DataAdapter,TrainingDataAdapter,DataAdapterFactory, andNpyDatasetLoaderfor consistent dataset handling.Updates TensorFlow conversion routines and provides canonical key validation.
Breaking changes:
DataConfigHandlerno longer loads datasets directly anddata_handler.pyhas been (will be) removed (TBD)Closes #205 , closes #201 , closes #124 , closes #178 , closes #68 , closes #143
What’s changed
NpyDatasetLoaderto load.npydatasets with required canonical keys (positions, seds, target_field) and optional fields (masks, zernike_prior).DataAdapterandDataAdapterFactoryfor unified dataset access across training and evaluation.SupportsParamsandSupportsMetadataprotocols added for generic dataset parameter and metadata handling.TensorFlowDatasetConverterto use canonical and optional keys and handle target_field mapping for SEDs or source images.PSFInferenceto use data_adapter property instead of manual preprocessing, simplifying_prepare_positions_and_seds. (TBD)data_config.yamlformat updated; previous configs incompatible._prepare_positions_and_sedswhere data_adapter provides normalized tensors. (TBD)How to test / verify
Ran repeatability runs at Jean-Zay.
Results reproduced and/or consistent (train-test split results in different star samples) with previous stable versions.
Scope
This PR is part of a larger milestone to modernise dataset handling.
Changelog
Reviewer Checklist
develop, ormainfor release PRs)ruff)Next Steps / Notes (if applicable)
I am still updating
PSFInferencetests and need to deletedata_handler.pyafter this is complete.