Add read_files function to SDK by sminot · Pull Request #195 · CirroBio/Cirro-SDK-Python

sminot · 2026-03-19T16:33:33Z

Bioinformaticians frequently embed metadata in file paths. They will likely find it useful to be able to quickly identify files which match a particular naming pattern and return both the file contents and also the file-path-embedded metadata.

Summary

Adds read_files generator methods to DataPortalDataset and DataPortalProject for reading and
parsing dataset files directly into Python objects.

  DataPortalDataset.read_files(glob, pattern, file_format, **kwargs)

Iterates over files in the dataset and yields parsed content for each match. Exactly one of glob
or pattern must be provided.

glob mode — standard wildcard matching:

- matches within a single path segment; ** matches across segments
Suffix-anchored, so *.csv matches at any depth
Yields parsed file content per match (e.g., a DataFrame, str, etc.)

pattern mode — like glob but with {name} capture placeholders:

{name} captures one path segment (no /)
Yields (content, captures) tuples where captures is a dict of values extracted from the path
Useful for pulling metadata (sample names, conditions, etc.) directly out of file paths

The file_format argument controls parsing ('csv', 'json', 'h5ad', 'parquet', 'feather',
'pickle', 'excel', 'text'); omitting it infers the format from the file extension. **kwargs are
forwarded to the underlying parsing function (e.g., sep='\t' for TSV files).

  DataPortalProject.read_files(dataset, glob, pattern, file_format, **kwargs)

A thin convenience wrapper around DataPortalDataset.read_files. Resolves the dataset argument —
which may be a name string, ID string, or DataPortalDataset object — then delegates to the
dataset method. The glob/pattern behavior and yield types are identical.

Usage examples

  # Read all CSV files from a dataset — yields DataFrames
  for df in dataset.read_files(glob='*.csv'):
      print(df.shape)                                                                             
                                                                                                  
  # Extract sample names from filenames — yields (DataFrame, captures) tuples                     
  for df, captures in dataset.read_files(pattern='{sample}.csv'):                                 
      print(captures['sample'], df.shape)                                                         
                                                            
  # Multi-level capture: condition directory + sample filename                                    
  for df, captures in dataset.read_files(pattern='{condition}/{sample}.csv'):
      print(captures['condition'], captures['sample'], df.shape)                                  
                                                            
  # Read gzip-compressed TSV files at any depth                                                   
  for df in dataset.read_files(glob='**/*.tsv.gz', file_format='csv', sep='\t'):
      print(df.shape)                                                                             
                                                                                                  
  # Same, but resolve the dataset by name from a project                                          
  for df in project.read_files('My Dataset', glob='*.csv'):                                       
      print(df.shape)

Adds a read_files(pattern, file_format=None, **kwargs) method to both DataPortalDataset and DataPortalProject. The method accepts a standard glob pattern string (e.g. '*.csv', 'data/**/*.tsv.gz'), filters dataset files using PurePath.match, and yields (DataPortalFile, content) tuples. File format is auto-detected from the extension (.csv/.tsv → DataFrame, .h5ad → AnnData, anything else → str) or can be specified explicitly. Parsing kwargs are forwarded to the underlying read method (e.g. sep='\t' for read_csv). Project-level read_files delegates to each dataset in turn. https://claude.ai/code/session_01TANa5jJ1qzDMzoV8qCjpuU

- Add read_json, read_parquet, read_feather, read_pickle, read_excel methods to DataPortalFile - Update _infer_file_format to detect .json, .parquet, .feather, .pkl/.pickle, .xlsx/.xls extensions - Update _read_file_with_format to dispatch to the new read methods - Update read_files docstring to document all supported formats - Add tests for new format inference and reading (parquet/feather tests skip without pyarrow) https://claude.ai/code/session_01TANa5jJ1qzDMzoV8qCjpuU

- Add _pattern_to_captures_regex() that converts {name} placeholders in glob patterns to named regex groups (suffix-anchored like PurePath.match) - read_files() now always yields (file, content, captures) 3-tuples; captures is {} when the pattern has no {name} placeholders - Patterns with {name} use regex matching; plain glob patterns continue to use filter_files_by_pattern / PurePath.match unchanged - Add TestPatternToRegex suite and TestDatasetReadFiles capture tests; update all existing tests to unpack 3-tuples https://claude.ai/code/session_01TANa5jJ1qzDMzoV8qCjpuU

read_files() now takes two mutually exclusive keyword arguments: - glob='*.csv' → yields content per matching file - pattern='{sample}.csv' → yields (content, captures) per matching file Passing both or neither raises DataPortalInputError. This makes the return type unambiguous: glob always gives a flat iterator of content, pattern always gives (content, captures) 2-tuples. https://claude.ai/code/session_01TANa5jJ1qzDMzoV8qCjpuU

Instead of iterating across all datasets, read_files() on DataPortalProject now requires a dataset argument (name, ID, or DataPortalDataset object) and delegates to that dataset's read_files(). The glob/pattern/file_format interface is otherwise unchanged. https://claude.ai/code/session_01TANa5jJ1qzDMzoV8qCjpuU

claude and others added 10 commits March 19, 2026 06:29

Merge branch 'CirroBio:main' into claude/add-read-files-function-5LNXG

96de2ad

Fix flake8

4cf45aa

Merge branch 'main' into pr/195

8bf8338

Get dataset by name or id

29e0c42

Add singular read_file function

7b59277

sminot marked this pull request as ready for review March 19, 2026 20:39

sminot added 4 commits March 19, 2026 13:40

Increment version

52ee650

Bugfixes

75e4e6a

Move from project to portal

30abda9

Change file_format to format

05c78b4

nathanthorpe approved these changes Mar 19, 2026

View reviewed changes

sminot added 12 commits March 19, 2026 14:45

Clean up

84c36ba

Move the primary read_files docs to the DataPortal object

96764c2

format -> filetype

595b0a2

captures -> meta

5be8998

Update README.md

adf8814

Add tests

e51ba84

Read file(s) as bytes

21550d4

Update example for running analysis

9847e4d

Optionally filter the files downloaded from a dataset

ed9916e

Add tests for reading files

220c9ea

Add get_trace and get_logs

3e271bf

Update samples

96c81a9

sminot merged commit 2fba771 into CirroBio:main Mar 20, 2026
5 of 6 checks passed

sminot deleted the claude/add-read-files-function-5LNXG branch March 20, 2026 16:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add read_files function to SDK#195

Add read_files function to SDK#195
sminot merged 26 commits intoCirroBio:mainfrom
sminot:claude/add-read-files-function-5LNXG

sminot commented Mar 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sminot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sminot commented Mar 19, 2026 •

edited

Loading