Merged
Conversation
Adds a read_files(pattern, file_format=None, **kwargs) method to both DataPortalDataset and DataPortalProject. The method accepts a standard glob pattern string (e.g. '*.csv', 'data/**/*.tsv.gz'), filters dataset files using PurePath.match, and yields (DataPortalFile, content) tuples. File format is auto-detected from the extension (.csv/.tsv → DataFrame, .h5ad → AnnData, anything else → str) or can be specified explicitly. Parsing kwargs are forwarded to the underlying read method (e.g. sep='\t' for read_csv). Project-level read_files delegates to each dataset in turn. https://claude.ai/code/session_01TANa5jJ1qzDMzoV8qCjpuU
- Add read_json, read_parquet, read_feather, read_pickle, read_excel methods to DataPortalFile - Update _infer_file_format to detect .json, .parquet, .feather, .pkl/.pickle, .xlsx/.xls extensions - Update _read_file_with_format to dispatch to the new read methods - Update read_files docstring to document all supported formats - Add tests for new format inference and reading (parquet/feather tests skip without pyarrow) https://claude.ai/code/session_01TANa5jJ1qzDMzoV8qCjpuU
- Add _pattern_to_captures_regex() that converts {name} placeholders in
glob patterns to named regex groups (suffix-anchored like PurePath.match)
- read_files() now always yields (file, content, captures) 3-tuples;
captures is {} when the pattern has no {name} placeholders
- Patterns with {name} use regex matching; plain glob patterns continue
to use filter_files_by_pattern / PurePath.match unchanged
- Add TestPatternToRegex suite and TestDatasetReadFiles capture tests;
update all existing tests to unpack 3-tuples
https://claude.ai/code/session_01TANa5jJ1qzDMzoV8qCjpuU
read_files() now takes two mutually exclusive keyword arguments:
- glob='*.csv' → yields content per matching file
- pattern='{sample}.csv' → yields (content, captures) per matching file
Passing both or neither raises DataPortalInputError. This makes the
return type unambiguous: glob always gives a flat iterator of content,
pattern always gives (content, captures) 2-tuples.
https://claude.ai/code/session_01TANa5jJ1qzDMzoV8qCjpuU
Instead of iterating across all datasets, read_files() on DataPortalProject now requires a dataset argument (name, ID, or DataPortalDataset object) and delegates to that dataset's read_files(). The glob/pattern/file_format interface is otherwise unchanged. https://claude.ai/code/session_01TANa5jJ1qzDMzoV8qCjpuU
nathanthorpe
approved these changes
Mar 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bioinformaticians frequently embed metadata in file paths. They will likely find it useful to be able to quickly identify files which match a particular naming pattern and return both the file contents and also the file-path-embedded metadata.
Summary
Adds
read_filesgenerator methods to DataPortalDataset and DataPortalProject for reading andparsing dataset files directly into Python objects.
Iterates over files in the dataset and yields parsed content for each match. Exactly one of
globor
patternmust be provided.globmode — standard wildcard matching:patternmode — likeglobbut with {name} capture placeholders:{name}captures one path segment (no /)(content, captures)tuples where captures is a dict of values extracted from the pathThe
file_formatargument controls parsing ('csv','json','h5ad','parquet','feather','pickle','excel','text'); omitting it infers the format from the file extension.**kwargsareforwarded to the underlying parsing function (e.g.,
sep='\t'for TSV files).A thin convenience wrapper around
DataPortalDataset.read_files. Resolves thedatasetargument —which may be a name string, ID string, or
DataPortalDatasetobject — then delegates to thedataset method. The glob/pattern behavior and yield types are identical.
Usage examples