Skip to content

Add read_files function to SDK#195

Merged
sminot merged 26 commits intoCirroBio:mainfrom
sminot:claude/add-read-files-function-5LNXG
Mar 20, 2026
Merged

Add read_files function to SDK#195
sminot merged 26 commits intoCirroBio:mainfrom
sminot:claude/add-read-files-function-5LNXG

Conversation

@sminot
Copy link
Copy Markdown
Contributor

@sminot sminot commented Mar 19, 2026

Bioinformaticians frequently embed metadata in file paths. They will likely find it useful to be able to quickly identify files which match a particular naming pattern and return both the file contents and also the file-path-embedded metadata.

Summary

Adds read_files generator methods to DataPortalDataset and DataPortalProject for reading and
parsing dataset files directly into Python objects.

  DataPortalDataset.read_files(glob, pattern, file_format, **kwargs)                              

Iterates over files in the dataset and yields parsed content for each match. Exactly one of glob
or pattern must be provided.

glob mode — standard wildcard matching:

    • matches within a single path segment; ** matches across segments
  • Suffix-anchored, so *.csv matches at any depth
  • Yields parsed file content per match (e.g., a DataFrame, str, etc.)

pattern mode — like glob but with {name} capture placeholders:

  • {name} captures one path segment (no /)
  • Yields (content, captures) tuples where captures is a dict of values extracted from the path
  • Useful for pulling metadata (sample names, conditions, etc.) directly out of file paths

The file_format argument controls parsing ('csv', 'json', 'h5ad', 'parquet', 'feather',
'pickle', 'excel', 'text'); omitting it infers the format from the file extension. **kwargs are
forwarded to the underlying parsing function (e.g., sep='\t' for TSV files).

  DataPortalProject.read_files(dataset, glob, pattern, file_format, **kwargs)                     

A thin convenience wrapper around DataPortalDataset.read_files. Resolves the dataset argument —
which may be a name string, ID string, or DataPortalDataset object — then delegates to the
dataset method. The glob/pattern behavior and yield types are identical.

Usage examples

  # Read all CSV files from a dataset — yields DataFrames
  for df in dataset.read_files(glob='*.csv'):
      print(df.shape)                                                                             
                                                                                                  
  # Extract sample names from filenames — yields (DataFrame, captures) tuples                     
  for df, captures in dataset.read_files(pattern='{sample}.csv'):                                 
      print(captures['sample'], df.shape)                                                         
                                                            
  # Multi-level capture: condition directory + sample filename                                    
  for df, captures in dataset.read_files(pattern='{condition}/{sample}.csv'):
      print(captures['condition'], captures['sample'], df.shape)                                  
                                                            
  # Read gzip-compressed TSV files at any depth                                                   
  for df in dataset.read_files(glob='**/*.tsv.gz', file_format='csv', sep='\t'):
      print(df.shape)                                                                             
                                                                                                  
  # Same, but resolve the dataset by name from a project                                          
  for df in project.read_files('My Dataset', glob='*.csv'):                                       
      print(df.shape)  

claude and others added 10 commits March 19, 2026 06:29
Adds a read_files(pattern, file_format=None, **kwargs) method to both
DataPortalDataset and DataPortalProject. The method accepts a standard
glob pattern string (e.g. '*.csv', 'data/**/*.tsv.gz'), filters dataset
files using PurePath.match, and yields (DataPortalFile, content) tuples.

File format is auto-detected from the extension (.csv/.tsv → DataFrame,
.h5ad → AnnData, anything else → str) or can be specified explicitly.
Parsing kwargs are forwarded to the underlying read method (e.g. sep='\t'
for read_csv). Project-level read_files delegates to each dataset in turn.

https://claude.ai/code/session_01TANa5jJ1qzDMzoV8qCjpuU
- Add read_json, read_parquet, read_feather, read_pickle, read_excel methods to DataPortalFile
- Update _infer_file_format to detect .json, .parquet, .feather, .pkl/.pickle, .xlsx/.xls extensions
- Update _read_file_with_format to dispatch to the new read methods
- Update read_files docstring to document all supported formats
- Add tests for new format inference and reading (parquet/feather tests skip without pyarrow)

https://claude.ai/code/session_01TANa5jJ1qzDMzoV8qCjpuU
- Add _pattern_to_captures_regex() that converts {name} placeholders in
  glob patterns to named regex groups (suffix-anchored like PurePath.match)
- read_files() now always yields (file, content, captures) 3-tuples;
  captures is {} when the pattern has no {name} placeholders
- Patterns with {name} use regex matching; plain glob patterns continue
  to use filter_files_by_pattern / PurePath.match unchanged
- Add TestPatternToRegex suite and TestDatasetReadFiles capture tests;
  update all existing tests to unpack 3-tuples

https://claude.ai/code/session_01TANa5jJ1qzDMzoV8qCjpuU
read_files() now takes two mutually exclusive keyword arguments:
- glob='*.csv'            → yields content per matching file
- pattern='{sample}.csv' → yields (content, captures) per matching file

Passing both or neither raises DataPortalInputError. This makes the
return type unambiguous: glob always gives a flat iterator of content,
pattern always gives (content, captures) 2-tuples.

https://claude.ai/code/session_01TANa5jJ1qzDMzoV8qCjpuU
Instead of iterating across all datasets, read_files() on
DataPortalProject now requires a dataset argument (name, ID, or
DataPortalDataset object) and delegates to that dataset's read_files().
The glob/pattern/file_format interface is otherwise unchanged.

https://claude.ai/code/session_01TANa5jJ1qzDMzoV8qCjpuU
@sminot sminot marked this pull request as ready for review March 19, 2026 20:39
@sminot sminot merged commit 2fba771 into CirroBio:main Mar 20, 2026
5 of 6 checks passed
@sminot sminot deleted the claude/add-read-files-function-5LNXG branch March 20, 2026 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants