Context
There are situations where we have high latency in remote file systems, and we don't want cf-python to be opening such file-systems every time. There are also some types of remote file system that cf-python doesn't support, and adding support for each one every time seems to be an unncessary burden now we have a pure python backend available that can take fsspec objects.
Clearly I didn't write all this myself, but it is the result of a conversation with AI about what we need/want to do ...
Summary
Add a filesystem keyword argument to cf.read() (and cfdm.read()) that accepts a
pre-authenticated fsspec AbstractFileSystem
object. When present, cfdm uses filesystem.open(path, "rb") to obtain a file-like
object and passes it directly to h5netcdf.File. This requires no changes to h5netcdf
or pyfive, unlocks SSH/SFTP natively, and allows warm connection reuse for any protocol.
Background
What works today
cf.read("s3://bucket/path.nc", storage_options={...}) works because cfdm's
netcdfread.py has an explicit branch:
if u.scheme == "s3":
fs = s3fs.S3FileSystem(**storage_options)
path = fs.open(uri) # → file-like
...open h5netcdf with path...
cf.read("https://server/path.nc") works because h5netcdf/h5py recognise http URLs
and delegate to netCDF4-C's OPeNDAP support.
What does not work
cf.read("ssh://host/path.nc") raises DatasetTypeError.
Verified from source: cfdm has zero ssh/sftp handling in both cf and cfdm
packages (confirmed by exhaustive grep and runtime test).
The actual blockage
The barrier is not in h5netcdf or pyfive. It is entirely in cfdm's _datasets()
generator (cfdm read.py, line ~351):
for datasets1 in datasets:
datasets1 = expanduser(expandvars(datasets1)) # ← fails on non-strings
u = urisplit(datasets1)
if u.scheme not in (None, "file"):
yield datasets1 # remote URI passed as string — no fs object ever created
continue
...iglob, walk, etc...
Every item in datasets is required to be a str. A file-like object, a
pathlib.Path, an (fs, path) tuple, or an fsspec.core.OpenFile all fail here.
The subsequent NetCDF read path (netcdfread.py lines 520–585) only constructs an
s3fs.S3FileSystem from storage_options; for all other remote schemes the string is
passed verbatim to the netCDF4-C / h5netcdf constructors, which either reject it
(ssh://) or interpret it as an OPeNDAP URL (http://).
h5netcdf and pyfive already support file-like objects
h5netcdf.File(path, ...) explicitly handles three input types (from its source):
if isinstance(path, str):
h5file = h5py.File(path, mode, **kwargs) # string path or http URL
elif isinstance(path, h5py.File):
return path, (mode in {"r", "r+", "a"}), False # already-open h5py handle
else:
h5file = h5py.File(path, mode, **kwargs) # ← file-like object
h5py.File itself states in its docstring:
name: Name of the file on disk, or file-like object.
_open_pyfive(path, mode) simply calls pyfive.File(path, mode) — so pyfive receives
whatever h5netcdf passes, including file-like objects.
Conclusion: the entire h5netcdf / pyfive stack already handles file-like objects
today. Only cfdm's string-only input pipeline prevents their use.
Proposed Change
New keyword argument
Add filesystem to both cf.read() and cfdm.read():
cf.read(
datasets,
...,
storage_options=None, # existing
filesystem=None, # NEW: a pre-authenticated fsspec AbstractFileSystem
)
Semantics
When filesystem is not None:
datasets must be a single path string (or list of path strings) that the given
filesystem understands.
- cfdm bypasses the URI-dispatch and s3fs-construction logic entirely.
- For each path, cfdm calls
filesystem.open(path, "rb") to obtain a seekable
file-like object.
- The file-like object is passed as the
path argument to h5netcdf.File.
This makes the call sites look like:
# SSH (currently impossible)
import fsspec
fs = fsspec.filesystem("ssh", host="hpc.example.ac.uk", username="user",
key_filename="~/.ssh/id_rsa")
cf.read("/data/model/run1.nc", filesystem=fs)
# S3 with pre-authenticated, reused connection (warmed up earlier)
import s3fs
fs = s3fs.S3FileSystem(key=KEY, secret=SECRET, endpoint_url=ENDPOINT)
cf.read("s3://bucket/path/run1.nc", filesystem=fs)
# SFTP via ProxyJump (handled entirely by fsspec/asyncssh)
fs = fsspec.filesystem("sftp", host="internal.hpc", username="user",
key_filename="...", proxy_jump="gateway.example.ac.uk")
cf.read("/scratch/run1.nc", filesystem=fs)
Scope of changes in cfdm
The change is narrow and self-contained. The only file that needs modification is
cfdm/read_write/read.py (the _datasets() generator) and
cfdm/read_write/netcdf/netcdfread.py (the open logic).
_datasets() — skip string processing when filesystem is given
if kwargs.get("filesystem") is not None:
# filesystem provided — datasets items are paths on that fs, not local strings
for path in self._flat(kwargs["datasets"]):
n_datasets += 1
yield path
return
This short-circuits before expanduser, urisplit, and iglob.
netcdfread.py — open via filesystem when provided
In the existing open_netcdf / local-open block (currently if u.scheme == "s3": ...),
add a parallel branch:
filesystem = kwargs.get("filesystem")
if filesystem is not None:
file_object = filesystem.open(dataset, "rb")
nc = h5netcdf.File(file_object, mode="r", ...)
else:
# existing s3 / local / opendap dispatch
...
The dataset_type() class method that probes format also needs a guard: when
filesystem is provided, skip the string-based urisplit check and probe by attempting
to open with h5netcdf directly (or assume netCDF4/HDF5 and let the caller specify
dataset_type= explicitly if needed).
Total line count of change: estimated 20–40 lines across two files.
Why the h5netcdf / pyfive Backend Is the Right Target
The netCDF4 (C-library) backend does not natively accept file-like objects (it
has a memory= parameter for in-memory bytes buffers, but that requires a full copy
in memory before reading begins).
The h5netcdf backend (with either h5py or pyfive) accepts file-like objects natively
as shown above.
Since pyfive is a pure-Python HDF5 reader and the intended future preferred backend for
cf-python, and since pyfive's File(path, mode) already accepts anything that h5netcdf
passes, this change leverages the pure-Python stack cleanly with no C-library
constraints.
The proposal can therefore be described as:
File-like input support for the h5netcdf/pyfive backend path.
The netCDF4 backend would continue to require string paths (its existing behaviour is
unchanged).
Motivation: Connection Warm-Up for Remote Files
The immediate motivation comes from latency hiding in applications that browse remote
filesystems before opening a file.
An application (e.g. xconv2) uses fsspec to browse files on S3 or SSH while the user
navigates. When the user finally selects a file, the fsspec filesystem object is already
authenticated and connected. Without filesystem=:
- S3: must reconstruct
s3fs.S3FileSystem from credentials — nearly instant but
wasteful if credentials need re-validation or a new connection is opened.
- SSH: impossible without staging or FUSE mount;
cf.read("ssh://...") raises
DatasetTypeError.
With filesystem=:
- The warm, authenticated
AbstractFileSystem is passed directly.
- No redundant authentication round-trip.
- SSH, SFTP, and any other fsspec-supported protocol work identically.
The HTTP Case: OPeNDAP vs Plain Range-Get
What happens today with http:// URIs
cf.read("http://server/...") reaches open_netcdf() as a bare string (the
_datasets() generator yields all non-file/None scheme URIs unchanged).
Inside open_netcdf(), the s3 branch is not taken, so the string goes directly
to the backend loop:
-
h5netcdf/h5py — h5netcdf.File("http://...", "r") calls
h5py.File("http://...", "r") with no driver= argument. h5py does have a
ros3 driver in its driver list (h5fd.ROS3D is present) but in the conda
work26 build ros3 is not compiled in: attempting driver='ros3' raises
ValueError: h5py was built without ROS3 support. Without ros3, h5py treats
the URL as a local filesystem path and raises FileNotFoundError. The h5netcdf
backend therefore fails for any http URL.
-
netCDF4-C — netCDF4.Dataset("http://...", "r") uses the libnetCDF4-C
OPeNDAP stack (libdap or libcurl-based DAP client). This succeeds only if the
server speaks the DAP2/DAP4 protocol (OPeNDAP, THREDDS, Hyrax, etc.). A
plain nginx-served HDF5 file will fail with an OPeNDAP parse error because the
server returns raw HDF5 bytes, not a DAP response.
Summary: cf.read("http://...") today means OPeNDAP only. It does not support
plain HTTP servers that serve HDF5/netCDF4 files with byte-range requests (nginx,
Apache, any static file host).
Why filesystem= solves plain HTTP
fsspec's HTTPFileSystem implements the HDF5 access pattern correctly.
HTTPFile (a subclass of AbstractBufferedFile) issues Range: bytes=X-Y requests
and presents a seekable file-like interface to the caller. This can be verified from
the fsspec source: async_fetch_range() sets `headers["Range"] = f"bytes={start}-{end
- 1}"
and validatesContent-Range` in the response.
h5py's registered_drivers() includes 'fileobj' — it accepts seekable file-like
objects via that driver when h5py.File(file_like_obj, "r") is called. h5netcdf
passes non-string paths straight to h5py.File unchanged (the else branch in
_open_h5py), and pyfive similarly receives whatever h5netcdf passes.
Therefore, with the proposed filesystem= parameter:
import fsspec
fs = fsspec.filesystem("http")
# Plain nginx server, no OPeNDAP - works because fsspec issues Range requests
cf.read("/path/to/data.nc", filesystem=fs)
cfdm calls fs.open("http://server/path/to/data.nc", "rb") → returns an HTTPFile
with range-get → passed to h5netcdf.File → h5py fileobj driver → random-access
HDF5 reads over HTTP.
This path does not require ros3 in the h5py build, does not require a DAP server,
and works with any HTTP/HTTPS server that honours Range headers (nginx, Apache,
object storage HTTP endpoints, etc.).
OPeNDAP is unaffected
The existing OPeNDAP path (bare http:// string → netCDF4-C DAP client) continues
to work exactly as before for users who do not supply filesystem=. Supplying
filesystem=fsspec.filesystem("http") explicitly opts in to the range-get path instead.
Relationship to Existing storage_options
storage_options will continue to work as-is for programmatic S3 credential injection
without a pre-built filesystem. The new filesystem parameter is not a replacement;
it is an escape hatch for callers that already hold a live filesystem object.
If both are provided, filesystem takes precedence (the pre-built object is presumably
more up-to-date) and a warning can optionally be emitted.
Out of Scope
- Zarr datasets via fsspec
FSStore / zarr.storage — Zarr already has its own
fsspec integration and is a separate dispatch path in cfdm.
- CDL string datasets — unchanged.
- Write (
cf.write) — write paths are not considered here.
- Validation that the provided filesystem is seekable — left to h5netcdf/h5py/pyfive to
raise naturally.
Summary of Benefits
| Scenario |
Before |
After |
S3 with pre-authenticated s3fs |
Must reconstruct fs from storage_options |
Pass existing fs directly |
| SSH / SFTP |
Impossible (DatasetTypeError) |
Works via fsspec/asyncssh |
| SFTP via ProxyJump |
Impossible |
Works |
| HTTP OPeNDAP server |
Works (netCDF4-C DAP client) |
Unchanged — still works without filesystem= |
| HTTP plain file server (nginx, range-get) |
Fails (not OPeNDAP, no ros3) |
Works via filesystem=fsspec.filesystem("http") |
| Any future fsspec backend |
Impossible |
Works |
| Change size in cfdm |
— |
~20–40 lines across 2 files |
| Changes to h5netcdf |
— |
None required |
| Changes to pyfive |
— |
None required |
Context
There are situations where we have high latency in remote file systems, and we don't want cf-python to be opening such file-systems every time. There are also some types of remote file system that cf-python doesn't support, and adding support for each one every time seems to be an unncessary burden now we have a pure python backend available that can take
fsspecobjects.Clearly I didn't write all this myself, but it is the result of a conversation with AI about what we need/want to do ...
Summary
Add a
filesystemkeyword argument tocf.read()(andcfdm.read()) that accepts apre-authenticated fsspec
AbstractFileSystemobject. When present, cfdm uses
filesystem.open(path, "rb")to obtain a file-likeobject and passes it directly to
h5netcdf.File. This requires no changes to h5netcdfor pyfive, unlocks SSH/SFTP natively, and allows warm connection reuse for any protocol.
Background
What works today
cf.read("s3://bucket/path.nc", storage_options={...})works because cfdm'snetcdfread.pyhas an explicit branch:cf.read("https://server/path.nc")works because h5netcdf/h5py recognise http URLsand delegate to netCDF4-C's OPeNDAP support.
What does not work
cf.read("ssh://host/path.nc")raisesDatasetTypeError.Verified from source: cfdm has zero ssh/sftp handling in both
cfandcfdmpackages (confirmed by exhaustive grep and runtime test).
The actual blockage
The barrier is not in h5netcdf or pyfive. It is entirely in cfdm's
_datasets()generator (cfdm
read.py, line ~351):Every item in
datasetsis required to be astr. A file-like object, apathlib.Path, an(fs, path)tuple, or anfsspec.core.OpenFileall fail here.The subsequent NetCDF read path (
netcdfread.pylines 520–585) only constructs ans3fs.S3FileSystemfromstorage_options; for all other remote schemes the string ispassed verbatim to the netCDF4-C / h5netcdf constructors, which either reject it
(
ssh://) or interpret it as an OPeNDAP URL (http://).h5netcdf and pyfive already support file-like objects
h5netcdf.File(path, ...)explicitly handles three input types (from its source):h5py.Fileitself states in its docstring:_open_pyfive(path, mode)simply callspyfive.File(path, mode)— so pyfive receiveswhatever
h5netcdfpasses, including file-like objects.Conclusion: the entire h5netcdf / pyfive stack already handles file-like objects
today. Only cfdm's string-only input pipeline prevents their use.
Proposed Change
New keyword argument
Add
filesystemto bothcf.read()andcfdm.read():Semantics
When
filesystemis notNone:datasetsmust be a single path string (or list of path strings) that the givenfilesystem understands.
filesystem.open(path, "rb")to obtain a seekablefile-like object.
pathargument toh5netcdf.File.This makes the call sites look like:
Scope of changes in cfdm
The change is narrow and self-contained. The only file that needs modification is
cfdm/read_write/read.py(the_datasets()generator) andcfdm/read_write/netcdf/netcdfread.py(the open logic)._datasets()— skip string processing when filesystem is givenThis short-circuits before
expanduser,urisplit, andiglob.netcdfread.py— open via filesystem when providedIn the existing
open_netcdf/ local-open block (currentlyif u.scheme == "s3": ...),add a parallel branch:
The
dataset_type()class method that probes format also needs a guard: whenfilesystemis provided, skip the string-basedurisplitcheck and probe by attemptingto open with h5netcdf directly (or assume netCDF4/HDF5 and let the caller specify
dataset_type=explicitly if needed).Total line count of change: estimated 20–40 lines across two files.
Why the h5netcdf / pyfive Backend Is the Right Target
The
netCDF4(C-library) backend does not natively accept file-like objects (ithas a
memory=parameter for in-memorybytesbuffers, but that requires a full copyin memory before reading begins).
The
h5netcdfbackend (with either h5py or pyfive) accepts file-like objects nativelyas shown above.
Since pyfive is a pure-Python HDF5 reader and the intended future preferred backend for
cf-python, and since pyfive's
File(path, mode)already accepts anything that h5netcdfpasses, this change leverages the pure-Python stack cleanly with no C-library
constraints.
The proposal can therefore be described as:
The
netCDF4backend would continue to require string paths (its existing behaviour isunchanged).
Motivation: Connection Warm-Up for Remote Files
The immediate motivation comes from latency hiding in applications that browse remote
filesystems before opening a file.
An application (e.g. xconv2) uses fsspec to browse files on S3 or SSH while the user
navigates. When the user finally selects a file, the fsspec filesystem object is already
authenticated and connected. Without
filesystem=:s3fs.S3FileSystemfrom credentials — nearly instant butwasteful if credentials need re-validation or a new connection is opened.
cf.read("ssh://...")raisesDatasetTypeError.With
filesystem=:AbstractFileSystemis passed directly.The HTTP Case: OPeNDAP vs Plain Range-Get
What happens today with
http://URIscf.read("http://server/...")reachesopen_netcdf()as a bare string (the_datasets()generator yields all non-file/Nonescheme URIs unchanged).Inside
open_netcdf(), the s3 branch is not taken, so the string goes directlyto the backend loop:
h5netcdf/h5py —
h5netcdf.File("http://...", "r")callsh5py.File("http://...", "r")with nodriver=argument. h5py does have aros3driver in its driver list (h5fd.ROS3Dis present) but in the condawork26build ros3 is not compiled in: attemptingdriver='ros3'raisesValueError: h5py was built without ROS3 support. Withoutros3, h5py treatsthe URL as a local filesystem path and raises
FileNotFoundError. The h5netcdfbackend therefore fails for any http URL.
netCDF4-C —
netCDF4.Dataset("http://...", "r")uses the libnetCDF4-COPeNDAP stack (libdap or libcurl-based DAP client). This succeeds only if the
server speaks the DAP2/DAP4 protocol (OPeNDAP, THREDDS, Hyrax, etc.). A
plain nginx-served HDF5 file will fail with an OPeNDAP parse error because the
server returns raw HDF5 bytes, not a DAP response.
Summary:
cf.read("http://...")today means OPeNDAP only. It does not supportplain HTTP servers that serve HDF5/netCDF4 files with byte-range requests (nginx,
Apache, any static file host).
Why
filesystem=solves plain HTTPfsspec's
HTTPFileSystemimplements the HDF5 access pattern correctly.HTTPFile(a subclass ofAbstractBufferedFile) issuesRange: bytes=X-Yrequestsand presents a seekable file-like interface to the caller. This can be verified from
the fsspec source:
async_fetch_range()sets `headers["Range"] = f"bytes={start}-{endand validatesContent-Range` in the response.h5py's
registered_drivers()includes'fileobj'— it accepts seekable file-likeobjects via that driver when
h5py.File(file_like_obj, "r")is called. h5netcdfpasses non-string paths straight to
h5py.Fileunchanged (theelsebranch in_open_h5py), and pyfive similarly receives whatever h5netcdf passes.Therefore, with the proposed
filesystem=parameter:cfdm calls
fs.open("http://server/path/to/data.nc", "rb")→ returns anHTTPFilewith range-get → passed to
h5netcdf.File→ h5pyfileobjdriver → random-accessHDF5 reads over HTTP.
This path does not require ros3 in the h5py build, does not require a DAP server,
and works with any HTTP/HTTPS server that honours
Rangeheaders (nginx, Apache,object storage HTTP endpoints, etc.).
OPeNDAP is unaffected
The existing OPeNDAP path (bare
http://string → netCDF4-C DAP client) continuesto work exactly as before for users who do not supply
filesystem=. Supplyingfilesystem=fsspec.filesystem("http")explicitly opts in to the range-get path instead.Relationship to Existing
storage_optionsstorage_optionswill continue to work as-is for programmatic S3 credential injectionwithout a pre-built filesystem. The new
filesystemparameter is not a replacement;it is an escape hatch for callers that already hold a live filesystem object.
If both are provided,
filesystemtakes precedence (the pre-built object is presumablymore up-to-date) and a warning can optionally be emitted.
Out of Scope
FSStore/zarr.storage— Zarr already has its ownfsspec integration and is a separate dispatch path in cfdm.
cf.write) — write paths are not considered here.raise naturally.
Summary of Benefits
s3fsfsdirectlyfilesystem=filesystem=fsspec.filesystem("http")