Dataset–File Lifetime Design Discussion
Background
pyfive allows Datasets to outlive the File that created them:
f = pyfive.File("some_file")
v = f["var"]
f.close()
y = max(v[:]) # works — Dataset reopens the file internally
This works on POSIX because reopening a local file is cheap (microseconds).
The parsed B-tree is already cached in the Dataset; reopening only happens to access data bytes.
The original motivation was to simplify Dask usage and minimise the number of concurrently open file handles.
The problem
This "reopen is cheap" assumption breaks in two new contexts:
| Context |
Reopen cost |
State to preserve |
| POSIX |
~microseconds |
Just the path |
| fsspec |
Varies (ms–seconds) |
URL + cache/session config |
| p5rem/SSH |
Seconds |
SSH connection + remote parsed state |
- fsspec: If the file was opened with caching, the Dataset would need to preserve storage options and cache state to reuse the cache on reopen.
- p5rem: Reopening means a full SSH handshake plus remote file parse making the pattern impractically slow.
Optimal Solution? Pluggable opener in pyfive?
Instead of Dataset storing a filename string and calling open() itself, have it store a callable or small protocol object:
class FileResource(Protocol):
def open(self) -> BinaryIO: ...
def close(self) -> None: ...
- POSIX:
LocalFileResource(path) — trivially cheap, current behaviour.
- fsspec:
FsspecFileResource(url, storage_options) — preserves cache config.
- p5rem:
RemoteFileResource(connection_pool, remote_path) — reuses SSH sessions.
This pushes the "how to get bytes" question to where it belongs without changing pyfive's "datasets are independent" contract.
This may interact with the ChunkRead class (hence the foreshadowing comment in #218).
Additional Statement of the underlying goal: Don't solve the Dask problem at the file-handle level
Modern Dask typically expects workers to independently open files through a serialisable "opener" token. Holding a parsed B-tree in the Dataset and reopening just for data reads is an optimisation that only helps local files.
For remote/networked contexts a cleaner Dask pattern is pure-function chunk readers that each worker calls independently, sidestepping the
"dataset outlives file" question entirely. This was the intent of the pyfive Dataset separation from the file.
Identification of current behaviour
DatasetID reopening is implemented primarily as open(self._filename, "rb"), which encodes a filename-based reopen policy instead of an explicit transport-aware opener/resource policy.
Specifics:
1. Reopen policy in DatasetID._fh
- File:
/Users/bnl28/Repositories/pyfive/pyfive/h5d.py
- Lines: ~914-938
What it does:
- POSIX: returns
open(self._filename, "rb") each access.
- Non-POSIX: reuses cached
self.__fh; if closed, reopens with open(self._filename, "rb").
Why this is problematic:
- Assumes path + builtin
open is sufficient to recreate the original IO semantics.
- Does not explicitly preserve opener context (fsspec filesystem, storage options, auth/session, cache wrappers).
2) Filename inference in DatasetID.__init__
- File:
/Users/bnl28/Repositories/pyfive/pyfive/h5d.py
- Lines: ~223 onward; relevant block ~251-278
What it does:
- Determines POSIX by probing
fh.fileno().
- For non-POSIX, tries to infer
self._filename from fh.path, then fh.full_name, then fh.fh.path.
Why this is problematic:
- Reopen identity is reconstructed heuristically from handle attributes.
- There is no explicit opener contract that guarantees equivalent reopen behavior.
3) Threaded chunk reads reopen via filename
- File:
/Users/bnl28/Repositories/pyfive/pyfive/h5d.py
- Lines: ~153-176 (
_read_parallel_threads)
What it does:
- Opens
open(self._filename, "rb"), uses os.pread, closes local handle.
Why this is problematic:
- Again assumes local path semantics for the read backend.
- Any non-local transport state carried by original handle is bypassed.
4) B-tree fetch strategy depends on current handle metadata
- File:
/Users/bnl28/Repositories/pyfive/pyfive/h5d.py
- Lines: ~614-650 (
_make_btree_fetch_fn)
What it does:
- For non-POSIX, tries to discover
fs and path from current handle to use cat_ranges.
Why this is problematic:
- Fast-path behavior depends on whichever handle instance is present now.
- If reopen path produced a plain file handle, transport-specific features can change/disappear.
5) File ownership split in File API
- File:
/Users/bnl28/Repositories/pyfive/pyfive/high_level.py
File.__init__: ~257-308
File.close: ~361-364
What it does:
- If input is file-like:
self._close = False (pyfive does not own close).
- If input is path:
self._close = True (pyfive owns close).
Why this matters for lifecycle:
- Ownership is tracked at file construction, but DatasetID reopen later bypasses this with direct filename reopen behavior.
Dataset–File Lifetime Design Discussion
Background
pyfive allows Datasets to outlive the File that created them:
This works on POSIX because reopening a local file is cheap (microseconds).
The parsed B-tree is already cached in the Dataset; reopening only happens to access data bytes.
The original motivation was to simplify Dask usage and minimise the number of concurrently open file handles.
The problem
This "reopen is cheap" assumption breaks in two new contexts:
Optimal Solution? Pluggable opener in pyfive?
Instead of Dataset storing a filename string and calling
open()itself, have it store a callable or small protocol object:LocalFileResource(path)— trivially cheap, current behaviour.FsspecFileResource(url, storage_options)— preserves cache config.RemoteFileResource(connection_pool, remote_path)— reuses SSH sessions.This pushes the "how to get bytes" question to where it belongs without changing pyfive's "datasets are independent" contract.
This may interact with the ChunkRead class (hence the foreshadowing comment in #218).
Additional Statement of the underlying goal: Don't solve the Dask problem at the file-handle level
Modern Dask typically expects workers to independently open files through a serialisable "opener" token. Holding a parsed B-tree in the Dataset and reopening just for data reads is an optimisation that only helps local files.
For remote/networked contexts a cleaner Dask pattern is pure-function chunk readers that each worker calls independently, sidestepping the
"dataset outlives file" question entirely. This was the intent of the pyfive Dataset separation from the file.
Identification of current behaviour
DatasetIDreopening is implemented primarily asopen(self._filename, "rb"), which encodes a filename-based reopen policy instead of an explicit transport-aware opener/resource policy.Specifics:
1. Reopen policy in
DatasetID._fh/Users/bnl28/Repositories/pyfive/pyfive/h5d.pyWhat it does:
open(self._filename, "rb")each access.self.__fh; if closed, reopens withopen(self._filename, "rb").Why this is problematic:
openis sufficient to recreate the original IO semantics.2) Filename inference in
DatasetID.__init__/Users/bnl28/Repositories/pyfive/pyfive/h5d.pyWhat it does:
fh.fileno().self._filenamefromfh.path, thenfh.full_name, thenfh.fh.path.Why this is problematic:
3) Threaded chunk reads reopen via filename
/Users/bnl28/Repositories/pyfive/pyfive/h5d.py_read_parallel_threads)What it does:
open(self._filename, "rb"), usesos.pread, closes local handle.Why this is problematic:
4) B-tree fetch strategy depends on current handle metadata
/Users/bnl28/Repositories/pyfive/pyfive/h5d.py_make_btree_fetch_fn)What it does:
fsandpathfrom current handle to usecat_ranges.Why this is problematic:
5) File ownership split in
FileAPI/Users/bnl28/Repositories/pyfive/pyfive/high_level.pyFile.__init__: ~257-308File.close: ~361-364What it does:
self._close = False(pyfive does not own close).self._close = True(pyfive owns close).Why this matters for lifecycle: