Datasets not truly independent from files on remote file systems.

# Dataset–File Lifetime Design Discussion

## Background

pyfive allows Datasets to outlive the File that created them:

```python
f = pyfive.File("some_file")
v = f["var"]
f.close()
y = max(v[:])   # works — Dataset reopens the file internally
```

This works on POSIX because reopening a local file is cheap (microseconds).
The parsed B-tree is already cached in the Dataset; reopening only happens to access data bytes.

The original motivation was to simplify Dask usage and minimise the number of concurrently open file handles.

## The problem

This "reopen is cheap" assumption breaks in two new contexts:

| Context | Reopen cost | State to preserve |
|---------|-------------|-------------------|
| POSIX   | ~microseconds | Just the path |
| fsspec  | Varies (ms–seconds) | URL + cache/session config |
| p5rem/SSH | Seconds | SSH connection + remote parsed state |

- **fsspec**: If the file was opened with caching, the Dataset would need to  preserve storage options and cache state to reuse the cache on reopen.
- **p5rem**: Reopening means a full SSH handshake plus remote file parse making the pattern impractically slow.

## Optimal Solution? Pluggable opener in pyfive?

Instead of Dataset storing a filename string and calling `open()` itself, have it store a callable or small protocol object:

```python
class FileResource(Protocol):
    def open(self) -> BinaryIO: ...
    def close(self) -> None: ...
```

- **POSIX**: `LocalFileResource(path)` — trivially cheap, current behaviour.
- **fsspec**: `FsspecFileResource(url, storage_options)` — preserves cache config.
- **p5rem**: `RemoteFileResource(connection_pool, remote_path)` — reuses SSH sessions.

This pushes the "how to get bytes" question to where it belongs without changing pyfive's "datasets are independent" contract.

This may interact with the ChunkRead class (hence the foreshadowing comment in #218). 

Additional Statement of the underlying goal: Don't solve the Dask problem at the file-handle level

Modern Dask typically expects workers to independently open files through a serialisable "opener" token. Holding a parsed B-tree in the Dataset and reopening just for data reads is an optimisation that only helps local files.

For remote/networked contexts a cleaner Dask pattern is pure-function chunk readers that each worker calls independently, sidestepping the
"dataset outlives file" question entirely. This was the intent of the pyfive Dataset separation from the file.

## Identification of current behaviour

`DatasetID` reopening is implemented primarily as `open(self._filename, "rb")`, which encodes a **filename-based reopen policy** instead of an explicit transport-aware opener/resource policy.

Specifics:

#### 1. Reopen policy in `DatasetID._fh`

- File: `/Users/bnl28/Repositories/pyfive/pyfive/h5d.py`
- Lines: ~914-938

What it does:

- POSIX: returns `open(self._filename, "rb")` each access.
- Non-POSIX: reuses cached `self.__fh`; if closed, reopens with `open(self._filename, "rb")`.

Why this is problematic:

- Assumes path + builtin `open` is sufficient to recreate the original IO semantics.
- Does not explicitly preserve opener context (fsspec filesystem, storage options, auth/session, cache wrappers).

#### 2) Filename inference in `DatasetID.__init__`

- File: `/Users/bnl28/Repositories/pyfive/pyfive/h5d.py`
- Lines: ~223 onward; relevant block ~251-278

What it does:

- Determines POSIX by probing `fh.fileno()`.
- For non-POSIX, tries to infer `self._filename` from `fh.path`, then `fh.full_name`, then `fh.fh.path`.

Why this is problematic:

- Reopen identity is reconstructed heuristically from handle attributes.
- There is no explicit opener contract that guarantees equivalent reopen behavior.

#### 3) Threaded chunk reads reopen via filename

- File: `/Users/bnl28/Repositories/pyfive/pyfive/h5d.py`
- Lines: ~153-176 (`_read_parallel_threads`)

What it does:

- Opens `open(self._filename, "rb")`, uses `os.pread`, closes local handle.

Why this is problematic:

- Again assumes local path semantics for the read backend.
- Any non-local transport state carried by original handle is bypassed.

#### 4) B-tree fetch strategy depends on current handle metadata

- File: `/Users/bnl28/Repositories/pyfive/pyfive/h5d.py`
- Lines: ~614-650 (`_make_btree_fetch_fn`)

What it does:

- For non-POSIX, tries to discover `fs` and `path` from current handle to use `cat_ranges`.

Why this is problematic:

- Fast-path behavior depends on whichever handle instance is present now.
- If reopen path produced a plain file handle, transport-specific features can change/disappear.

### 5) File ownership split in `File` API

- File: `/Users/bnl28/Repositories/pyfive/pyfive/high_level.py`
- `File.__init__`: ~257-308
- `File.close`: ~361-364

What it does:

- If input is file-like: `self._close = False` (pyfive does not own close).
- If input is path: `self._close = True` (pyfive owns close).

Why this matters for lifecycle:

- Ownership is tracked at file construction, but DatasetID reopen later bypasses this with direct filename reopen behavior.





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datasets not truly independent from files on remote file systems. #219

Dataset–File Lifetime Design Discussion

Background

The problem

Optimal Solution? Pluggable opener in pyfive?

Identification of current behaviour

1. Reopen policy in `DatasetID._fh`

2) Filename inference in `DatasetID.init`

3) Threaded chunk reads reopen via filename

4) B-tree fetch strategy depends on current handle metadata

5) File ownership split in `File` API

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Context	Reopen cost	State to preserve
POSIX	~microseconds	Just the path
fsspec	Varies (ms–seconds)	URL + cache/session config
p5rem/SSH	Seconds	SSH connection + remote parsed state

Datasets not truly independent from files on remote file systems. #219

Description

Dataset–File Lifetime Design Discussion

Background

The problem

Optimal Solution? Pluggable opener in pyfive?

Identification of current behaviour

1. Reopen policy in DatasetID._fh

2) Filename inference in DatasetID.__init__

3) Threaded chunk reads reopen via filename

4) B-tree fetch strategy depends on current handle metadata

5) File ownership split in File API

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Reopen policy in `DatasetID._fh`

2) Filename inference in `DatasetID.init`

5) File ownership split in `File` API