Flatten sample dict by dfulu · Pull Request #403 · openclimatefix/ocf-data-sampler

dfulu · 2026-02-27T14:28:58Z

Pull Request

Description

This PR is motivated by wanting to use the torch default_collate function and the in-built lightning functionality which can move batches to the required device and also manage their dtype (i.e. cast to float32 / float16 etc). Our samples are currently nested dictionaries and so we have to maintain code to be able to deal with this. I think this maintenance cost is too high considering we don't get much out of the nested sample structure.

Changes:

PR changes the structure of our samples so that the sample dictionary is no longer nested
Removed the stack_np_samples_into_batch(), batch_to_tensor(), and copy_batch_to_device() functions which are no longer needed since we can get these for free from pytorch and lightning
Modified the NaN-filling fill_nans_in_arrays() -> fill_nans_in_dataset_dicts() so that NaNs are filled in the datasets dict rather than the sample. This seemed a cleaner solution after the batch had been flattened and we also don't waste time trying to fill NaNs in the coordinate arrays
Minimise the batch. It costs us time for each array we need to copy to the GPU and we have lots of small coordinate arrays in the sample dict. These are now only optionally included with the new flag include_extra_metadata. This flag is accessed from the PVNetDataset classes rather than the configuration since it doesn't matter for training and running our models at inference.
Some renaming of the keys in the sample dict (please give me your opinions) - you can see them in numpy_sample/convert.py
Some light refactoring to remove duplication

Checklist:

My code follows OCF's coding style guidelines
I have performed a self-review of my own code
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
I have checked my code and corrected any misspellings

dfulu · 2026-03-02T10:38:15Z

ocf_data_sampler/numpy_sample/convert.py


 def convert_to_numpy_sample(
-    sample: dict[str, xr.DataArray | dict[str, xr.DataArray]],
+    datasets_dict: dict[str, xr.DataArray | dict[str, xr.DataArray]],


Renamed this to match the naming convention we've applied elsewhere

dfulu · 2026-03-06T16:26:22Z

tests/torch_datasets/test_pvnet_dataset.py

    _ = pickle.loads(pickle_bytes)  # noqa: S301
-
-
-def test_pvnet_dataset_batch_size_2(pvnet_config_filename):


Since we removed the batch_to_tensor and copy_batch_to_device functions I've got rid of this

Though I'm not really sure what the aim of this test was

dfulu · 2026-03-06T16:28:46Z

tests/torch_datasets/test_pvnet_dataset.py


-def test_pvnet_dataset(pvnet_config_filename):
-    dataset = PVNetDataset(pvnet_config_filename)
+def _pvnet_dataset_sample_check(sample, config, batch_dim = None):


I did some refactoring of the tests here to remove code duplication using this new helper function

dfulu · 2026-03-06T16:30:07Z

ocf_data_sampler/torch_datasets/utils/validation_utils.py

@@ -3,8 +3,7 @@
 import logging


I did some updates and refactoring of this file but after discussion I think we'll delete this file in the next PR

dfulu force-pushed the simplify_numpybatch branch 2 times, most recently from bda4959 to 0cefb2c Compare February 27, 2026 16:43

dfulu changed the title ~~Flatten the samples~~ Flatten sample dict Feb 27, 2026

Flatten the samples

d6dbaf4

dfulu force-pushed the simplify_numpybatch branch from 0cefb2c to d6dbaf4 Compare February 27, 2026 16:50

dfulu marked this pull request as ready for review March 2, 2026 10:37

dfulu commented Mar 2, 2026

View reviewed changes

dfulu commented Mar 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Flatten sample dict#403

Flatten sample dict#403
dfulu wants to merge 1 commit intodev_feb2026_speedupsfrom
simplify_numpybatch

dfulu commented Feb 27, 2026 •

edited

Loading

Uh oh!

dfulu Mar 2, 2026

Uh oh!

dfulu Mar 6, 2026

Uh oh!

dfulu Mar 6, 2026

Uh oh!

dfulu Mar 6, 2026

Uh oh!

dfulu Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		_ = pickle.loads(pickle_bytes) # noqa: S301


		def test_pvnet_dataset_batch_size_2(pvnet_config_filename):

Uh oh!

Conversation

dfulu commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request

Description

Checklist:

Uh oh!

dfulu Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

dfulu Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

dfulu Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

dfulu Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

dfulu Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dfulu commented Feb 27, 2026 •

edited

Loading