Conversation
| # filter to variants where at least one sample has been selected | ||
| # note we have to call compute here since xarray indexing with a Dask or Cubed | ||
| # boolean array is not supported | ||
| return ds.isel(variants=ds.call_mask.any(dim="samples").compute()) |
There was a problem hiding this comment.
What are the implications here for the memory requirements when working on a large dataset?
There was a problem hiding this comment.
Good question. It will materialise a 1D boolean array of length #variants into memory, which unless a restrictive region filter has been applied first could be very large (potentially the whole genome).
This is an area where there is scope to improve - perhaps by making Xarray and the underlying distributed processing engine able to handle this case efficiently, or by using masked arrays in some way?
There was a problem hiding this comment.
I think a 1D boolean array length #variants is fine - it could be improved but it's not terrible. I was worried that it was a O(num_samples) thing.
| max(ds.sizes["variants"], int(chunk_indexes[-1] * variant_chunksize + 1)), | ||
| ) | ||
|
|
||
| ds_sliced = ds.isel(variants=variant_slice) |
There was a problem hiding this comment.
This works but could be very inefficient. Imagine we just want one small region at the start of the genome and one at the end - this would read all of the intervening chunks even though they are not needed.
What we really want is a way to index by a set of slices (or even chunks) in one go. NumPy and Xarray don't provide such a primitive, but it's something we could perhaps build.
cc-ing @keewis @TomNicholas and @alxmrs as we were discussing this exact issue in yesterday's Pangeo Distributed Computing meeting (in the context of a geospatial application Justus is working on).
There was a problem hiding this comment.
I've opened pydata/xarray#10479 to discuss this for xarray
|
Looks like the build is failing on Python 3.12 due to pyranges/sorted_nearest#10 from pyranges. Not sure why vcztools CI is working on Python 3.12 though. |
bcftools-style filtering #1329changelog.rstapi.rst