Skip to content

Datasets not truly independent from files on remote file systems. #219

@bnlawrence

Description

@bnlawrence

Dataset–File Lifetime Design Discussion

Background

pyfive allows Datasets to outlive the File that created them:

f = pyfive.File("some_file")
v = f["var"]
f.close()
y = max(v[:])   # works — Dataset reopens the file internally

This works on POSIX because reopening a local file is cheap (microseconds).
The parsed B-tree is already cached in the Dataset; reopening only happens to access data bytes.

The original motivation was to simplify Dask usage and minimise the number of concurrently open file handles.

The problem

This "reopen is cheap" assumption breaks in two new contexts:

Context Reopen cost State to preserve
POSIX ~microseconds Just the path
fsspec Varies (ms–seconds) URL + cache/session config
p5rem/SSH Seconds SSH connection + remote parsed state
  • fsspec: If the file was opened with caching, the Dataset would need to preserve storage options and cache state to reuse the cache on reopen.
  • p5rem: Reopening means a full SSH handshake plus remote file parse making the pattern impractically slow.

Optimal Solution? Pluggable opener in pyfive?

Instead of Dataset storing a filename string and calling open() itself, have it store a callable or small protocol object:

class FileResource(Protocol):
    def open(self) -> BinaryIO: ...
    def close(self) -> None: ...
  • POSIX: LocalFileResource(path) — trivially cheap, current behaviour.
  • fsspec: FsspecFileResource(url, storage_options) — preserves cache config.
  • p5rem: RemoteFileResource(connection_pool, remote_path) — reuses SSH sessions.

This pushes the "how to get bytes" question to where it belongs without changing pyfive's "datasets are independent" contract.

This may interact with the ChunkRead class (hence the foreshadowing comment in #218).

Additional Statement of the underlying goal: Don't solve the Dask problem at the file-handle level

Modern Dask typically expects workers to independently open files through a serialisable "opener" token. Holding a parsed B-tree in the Dataset and reopening just for data reads is an optimisation that only helps local files.

For remote/networked contexts a cleaner Dask pattern is pure-function chunk readers that each worker calls independently, sidestepping the
"dataset outlives file" question entirely. This was the intent of the pyfive Dataset separation from the file.

Identification of current behaviour

DatasetID reopening is implemented primarily as open(self._filename, "rb"), which encodes a filename-based reopen policy instead of an explicit transport-aware opener/resource policy.

Specifics:

1. Reopen policy in DatasetID._fh

  • File: /Users/bnl28/Repositories/pyfive/pyfive/h5d.py
  • Lines: ~914-938

What it does:

  • POSIX: returns open(self._filename, "rb") each access.
  • Non-POSIX: reuses cached self.__fh; if closed, reopens with open(self._filename, "rb").

Why this is problematic:

  • Assumes path + builtin open is sufficient to recreate the original IO semantics.
  • Does not explicitly preserve opener context (fsspec filesystem, storage options, auth/session, cache wrappers).

2) Filename inference in DatasetID.__init__

  • File: /Users/bnl28/Repositories/pyfive/pyfive/h5d.py
  • Lines: ~223 onward; relevant block ~251-278

What it does:

  • Determines POSIX by probing fh.fileno().
  • For non-POSIX, tries to infer self._filename from fh.path, then fh.full_name, then fh.fh.path.

Why this is problematic:

  • Reopen identity is reconstructed heuristically from handle attributes.
  • There is no explicit opener contract that guarantees equivalent reopen behavior.

3) Threaded chunk reads reopen via filename

  • File: /Users/bnl28/Repositories/pyfive/pyfive/h5d.py
  • Lines: ~153-176 (_read_parallel_threads)

What it does:

  • Opens open(self._filename, "rb"), uses os.pread, closes local handle.

Why this is problematic:

  • Again assumes local path semantics for the read backend.
  • Any non-local transport state carried by original handle is bypassed.

4) B-tree fetch strategy depends on current handle metadata

  • File: /Users/bnl28/Repositories/pyfive/pyfive/h5d.py
  • Lines: ~614-650 (_make_btree_fetch_fn)

What it does:

  • For non-POSIX, tries to discover fs and path from current handle to use cat_ranges.

Why this is problematic:

  • Fast-path behavior depends on whichever handle instance is present now.
  • If reopen path produced a plain file handle, transport-specific features can change/disappear.

5) File ownership split in File API

  • File: /Users/bnl28/Repositories/pyfive/pyfive/high_level.py
  • File.__init__: ~257-308
  • File.close: ~361-364

What it does:

  • If input is file-like: self._close = False (pyfive does not own close).
  • If input is path: self._close = True (pyfive owns close).

Why this matters for lifecycle:

  • Ownership is tracked at file construction, but DatasetID reopen later bypasses this with direct filename reopen behavior.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions