Skip to content

Flatten sample dict#403

Open
dfulu wants to merge 1 commit intodev_feb2026_speedupsfrom
simplify_numpybatch
Open

Flatten sample dict#403
dfulu wants to merge 1 commit intodev_feb2026_speedupsfrom
simplify_numpybatch

Conversation

@dfulu
Copy link
Member

@dfulu dfulu commented Feb 27, 2026

Pull Request

Description

This PR is motivated by wanting to use the torch default_collate function and the in-built lightning functionality which can move batches to the required device and also manage their dtype (i.e. cast to float32 / float16 etc). Our samples are currently nested dictionaries and so we have to maintain code to be able to deal with this. I think this maintenance cost is too high considering we don't get much out of the nested sample structure.

Changes:

  • PR changes the structure of our samples so that the sample dictionary is no longer nested
  • Removed the stack_np_samples_into_batch(), batch_to_tensor(), and copy_batch_to_device() functions which are no longer needed since we can get these for free from pytorch and lightning
  • Modified the NaN-filling fill_nans_in_arrays() -> fill_nans_in_dataset_dicts() so that NaNs are filled in the datasets dict rather than the sample. This seemed a cleaner solution after the batch had been flattened and we also don't waste time trying to fill NaNs in the coordinate arrays
  • Minimise the batch. It costs us time for each array we need to copy to the GPU and we have lots of small coordinate arrays in the sample dict. These are now only optionally included with the new flag include_extra_metadata. This flag is accessed from the PVNetDataset classes rather than the configuration since it doesn't matter for training and running our models at inference.
  • Some renaming of the keys in the sample dict (please give me your opinions) - you can see them in numpy_sample/convert.py
  • Some light refactoring to remove duplication

Checklist:

  • My code follows OCF's coding style guidelines
  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • I have checked my code and corrected any misspellings

@dfulu dfulu force-pushed the simplify_numpybatch branch 2 times, most recently from bda4959 to 0cefb2c Compare February 27, 2026 16:43
@dfulu dfulu changed the title Flatten the samples Flatten sample dict Feb 27, 2026
@dfulu dfulu force-pushed the simplify_numpybatch branch from 0cefb2c to d6dbaf4 Compare February 27, 2026 16:50
@dfulu dfulu marked this pull request as ready for review March 2, 2026 10:37

def convert_to_numpy_sample(
sample: dict[str, xr.DataArray | dict[str, xr.DataArray]],
datasets_dict: dict[str, xr.DataArray | dict[str, xr.DataArray]],
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed this to match the naming convention we've applied elsewhere

_ = pickle.loads(pickle_bytes) # noqa: S301


def test_pvnet_dataset_batch_size_2(pvnet_config_filename):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we removed the batch_to_tensor and copy_batch_to_device functions I've got rid of this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I'm not really sure what the aim of this test was


def test_pvnet_dataset(pvnet_config_filename):
dataset = PVNetDataset(pvnet_config_filename)
def _pvnet_dataset_sample_check(sample, config, batch_dim = None):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some refactoring of the tests here to remove code duplication using this new helper function

@@ -3,8 +3,7 @@
import logging
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some updates and refactoring of this file but after discussion I think we'll delete this file in the next PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant