Big renaming and cleanups by jpc · Pull Request #45 · HumeAI/wsds

jpc · 2026-02-13T13:49:44Z

Renaming things to be more consistent, better tests and docstrings, better documentation.

- Add old/new index format detection (partition vs dataset_path columns) - Add _partition_col property for unified SQL partition expression - Add lookup_by_index() and lookup_by_key() methods to WSIndex - Add shard_n_samples() and shard_global_offset() via _query_shard() helper - Simplify WSDataset.__getitem__ to delegate to WSIndex lookups - Replace all raw index.query() calls in WSDataset with WSIndex methods - Update ws_tools.py to use new format detection

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Extract is_notebook() from convplayer.py into utils.py (simplified) - Remove redundant _ipython_display_ from AudioReader and WSAudio (IPython already calls _repr_html_ automatically) - Add is_notebook() guard to WSDataset and WSSample _ipython_display_ so they fall back to print() in terminal IPython sessions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- subdir → column_dir (utils, ws_sample, ws_modal_shard) - shard_name → shard_ref on shard interfaces and WSSample - dataset_path → partition in index and shard code - dataset_dir → dataset_root in ws_indexer Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Centralizes audio column lookup logic so it can be reused outside of WSSample (e.g. from plain dicts or other sample types). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Allows passing rng=42 (or a Random instance) to get deterministic sample ordering in random_sample() and sql_select(). Also removes unused needs_key variable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Remap hume_wsds.* loader paths to wsds.* for backward compatibility with old index files that reference the former package name - Use pl.first() instead of exprs[0] in sql_filter for correctness Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add comprehensive docstring to __init__.py with working doctests showcasing SQL queries, random access, lazy loading, and audio - Fix ws_dataset.py doctests (AudioReader src type, add shard_subsample=1) - Fix ws_sink.py doctest (remove invalid batch_size param) - Update tests.py to run wsds module doctests and fix imports Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

shahbaz-humeai

Approved, with two comments

* Extract audio codec layer from ws_audio.py into audio_codec.py Separates codec concerns (decoder backends, encoder, format utils) from the data model layer (AudioReader, WSAudio) for better reusability and testability. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add ModalFileReader for Modal Volume range requests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Centralize binary column decoding into ws_decode module Extract duplicated npy/pyd/txt/audio decode logic from WSShard and WSS3Shard into a shared decode_sample() function. Dispatch is now based on column type (binary) rather than column-name heuristics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Added WSModalShard * Big renaming and cleanups (#45) * Move index SQL queries from WSDataset into WSIndex - Add old/new index format detection (partition vs dataset_path columns) - Add _partition_col property for unified SQL partition expression - Add lookup_by_index() and lookup_by_key() methods to WSIndex - Add shard_n_samples() and shard_global_offset() via _query_shard() helper - Simplify WSDataset.__getitem__ to delegate to WSIndex lookups - Replace all raw index.query() calls in WSDataset with WSIndex methods - Update ws_tools.py to use new format detection * Fix env var name in README: WSDS_DATASET_PATH → WSDS_DATASET_SEARCH_PATH Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Move is_notebook() to utils, guard _ipython_display_ for terminal use - Extract is_notebook() from convplayer.py into utils.py (simplified) - Remove redundant _ipython_display_ from AudioReader and WSAudio (IPython already calls _repr_html_ automatically) - Add is_notebook() guard to WSDataset and WSSample _ipython_display_ so they fall back to print() in terminal IPython sessions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Improve naming consistency across codebase - subdir → column_dir (utils, ws_sample, ws_modal_shard) - shard_name → shard_ref on shard interfaces and WSSample - dataset_path → partition in index and shard code - dataset_dir → dataset_root in ws_indexer Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add get_audio() helper in ws_decode, use in WSSample Centralizes audio column lookup logic so it can be reused outside of WSSample (e.g. from plain dicts or other sample types). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add rng parameter to WSDataset for reproducible sampling Allows passing rng=42 (or a Random instance) to get deterministic sample ordering in random_sample() and sql_select(). Also removes unused needs_key variable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix hume_wsds module path remapping and sql_filter pl.first() usage - Remap hume_wsds.* loader paths to wsds.* for backward compatibility with old index files that reference the former package name - Use pl.first() instead of exprs[0] in sql_filter for correctness Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Update module docstring with rich examples and fix doctests - Add comprehensive docstring to __init__.py with working doctests showcasing SQL queries, random access, lazy loading, and audio - Fix ws_dataset.py doctests (AudioReader src type, add shard_subsample=1) - Fix ws_sink.py doctest (remove invalid batch_size param) - Update tests.py to run wsds module doctests and fix imports Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Apply ruff formatting Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Moved the library showcase to README.md * Use shard ref --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai>

* WSAudio: disable slots because it breaks code auto-reload * Added WSModalShard (#41) * Extract audio codec layer from ws_audio.py into audio_codec.py Separates codec concerns (decoder backends, encoder, format utils) from the data model layer (AudioReader, WSAudio) for better reusability and testability. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add ModalFileReader for Modal Volume range requests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Centralize binary column decoding into ws_decode module Extract duplicated npy/pyd/txt/audio decode logic from WSShard and WSS3Shard into a shared decode_sample() function. Dispatch is now based on column type (binary) rather than column-name heuristics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Added WSModalShard * Big renaming and cleanups (#45) * Move index SQL queries from WSDataset into WSIndex - Add old/new index format detection (partition vs dataset_path columns) - Add _partition_col property for unified SQL partition expression - Add lookup_by_index() and lookup_by_key() methods to WSIndex - Add shard_n_samples() and shard_global_offset() via _query_shard() helper - Simplify WSDataset.__getitem__ to delegate to WSIndex lookups - Replace all raw index.query() calls in WSDataset with WSIndex methods - Update ws_tools.py to use new format detection * Fix env var name in README: WSDS_DATASET_PATH → WSDS_DATASET_SEARCH_PATH Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Move is_notebook() to utils, guard _ipython_display_ for terminal use - Extract is_notebook() from convplayer.py into utils.py (simplified) - Remove redundant _ipython_display_ from AudioReader and WSAudio (IPython already calls _repr_html_ automatically) - Add is_notebook() guard to WSDataset and WSSample _ipython_display_ so they fall back to print() in terminal IPython sessions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Improve naming consistency across codebase - subdir → column_dir (utils, ws_sample, ws_modal_shard) - shard_name → shard_ref on shard interfaces and WSSample - dataset_path → partition in index and shard code - dataset_dir → dataset_root in ws_indexer Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add get_audio() helper in ws_decode, use in WSSample Centralizes audio column lookup logic so it can be reused outside of WSSample (e.g. from plain dicts or other sample types). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add rng parameter to WSDataset for reproducible sampling Allows passing rng=42 (or a Random instance) to get deterministic sample ordering in random_sample() and sql_select(). Also removes unused needs_key variable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix hume_wsds module path remapping and sql_filter pl.first() usage - Remap hume_wsds.* loader paths to wsds.* for backward compatibility with old index files that reference the former package name - Use pl.first() instead of exprs[0] in sql_filter for correctness Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Update module docstring with rich examples and fix doctests - Add comprehensive docstring to __init__.py with working doctests showcasing SQL queries, random access, lazy loading, and audio - Fix ws_dataset.py doctests (AudioReader src type, add shard_subsample=1) - Fix ws_sink.py doctest (remove invalid batch_size param) - Update tests.py to run wsds module doctests and fix imports Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Apply ruff formatting Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Moved the library showcase to README.md * Use shard ref --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai>

* WSSample: print fields with missing shards last * WSAudio: disable slots because it breaks code auto-reload (#38) * WSAudio: disable slots because it breaks code auto-reload * Added WSModalShard (#41) * Extract audio codec layer from ws_audio.py into audio_codec.py Separates codec concerns (decoder backends, encoder, format utils) from the data model layer (AudioReader, WSAudio) for better reusability and testability. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add ModalFileReader for Modal Volume range requests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Centralize binary column decoding into ws_decode module Extract duplicated npy/pyd/txt/audio decode logic from WSShard and WSS3Shard into a shared decode_sample() function. Dispatch is now based on column type (binary) rather than column-name heuristics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Added WSModalShard * Big renaming and cleanups (#45) * Move index SQL queries from WSDataset into WSIndex - Add old/new index format detection (partition vs dataset_path columns) - Add _partition_col property for unified SQL partition expression - Add lookup_by_index() and lookup_by_key() methods to WSIndex - Add shard_n_samples() and shard_global_offset() via _query_shard() helper - Simplify WSDataset.__getitem__ to delegate to WSIndex lookups - Replace all raw index.query() calls in WSDataset with WSIndex methods - Update ws_tools.py to use new format detection * Fix env var name in README: WSDS_DATASET_PATH → WSDS_DATASET_SEARCH_PATH Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Move is_notebook() to utils, guard _ipython_display_ for terminal use - Extract is_notebook() from convplayer.py into utils.py (simplified) - Remove redundant _ipython_display_ from AudioReader and WSAudio (IPython already calls _repr_html_ automatically) - Add is_notebook() guard to WSDataset and WSSample _ipython_display_ so they fall back to print() in terminal IPython sessions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Improve naming consistency across codebase - subdir → column_dir (utils, ws_sample, ws_modal_shard) - shard_name → shard_ref on shard interfaces and WSSample - dataset_path → partition in index and shard code - dataset_dir → dataset_root in ws_indexer Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add get_audio() helper in ws_decode, use in WSSample Centralizes audio column lookup logic so it can be reused outside of WSSample (e.g. from plain dicts or other sample types). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add rng parameter to WSDataset for reproducible sampling Allows passing rng=42 (or a Random instance) to get deterministic sample ordering in random_sample() and sql_select(). Also removes unused needs_key variable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix hume_wsds module path remapping and sql_filter pl.first() usage - Remap hume_wsds.* loader paths to wsds.* for backward compatibility with old index files that reference the former package name - Use pl.first() instead of exprs[0] in sql_filter for correctness Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Update module docstring with rich examples and fix doctests - Add comprehensive docstring to __init__.py with working doctests showcasing SQL queries, random access, lazy loading, and audio - Fix ws_dataset.py doctests (AudioReader src type, add shard_subsample=1) - Fix ws_sink.py doctest (remove invalid batch_size param) - Update tests.py to run wsds module doctests and fix imports Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Apply ruff formatting Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Moved the library showcase to README.md * Use shard ref --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai>

* WSSample: always validate that keys match across subdirs * WSSample: print fields with missing shards last (#37) * WSSample: print fields with missing shards last * WSAudio: disable slots because it breaks code auto-reload (#38) * WSAudio: disable slots because it breaks code auto-reload * Added WSModalShard (#41) * Extract audio codec layer from ws_audio.py into audio_codec.py Separates codec concerns (decoder backends, encoder, format utils) from the data model layer (AudioReader, WSAudio) for better reusability and testability. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add ModalFileReader for Modal Volume range requests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Centralize binary column decoding into ws_decode module Extract duplicated npy/pyd/txt/audio decode logic from WSShard and WSS3Shard into a shared decode_sample() function. Dispatch is now based on column type (binary) rather than column-name heuristics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Added WSModalShard * Big renaming and cleanups (#45) * Move index SQL queries from WSDataset into WSIndex - Add old/new index format detection (partition vs dataset_path columns) - Add _partition_col property for unified SQL partition expression - Add lookup_by_index() and lookup_by_key() methods to WSIndex - Add shard_n_samples() and shard_global_offset() via _query_shard() helper - Simplify WSDataset.__getitem__ to delegate to WSIndex lookups - Replace all raw index.query() calls in WSDataset with WSIndex methods - Update ws_tools.py to use new format detection * Fix env var name in README: WSDS_DATASET_PATH → WSDS_DATASET_SEARCH_PATH Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Move is_notebook() to utils, guard _ipython_display_ for terminal use - Extract is_notebook() from convplayer.py into utils.py (simplified) - Remove redundant _ipython_display_ from AudioReader and WSAudio (IPython already calls _repr_html_ automatically) - Add is_notebook() guard to WSDataset and WSSample _ipython_display_ so they fall back to print() in terminal IPython sessions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Improve naming consistency across codebase - subdir → column_dir (utils, ws_sample, ws_modal_shard) - shard_name → shard_ref on shard interfaces and WSSample - dataset_path → partition in index and shard code - dataset_dir → dataset_root in ws_indexer Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add get_audio() helper in ws_decode, use in WSSample Centralizes audio column lookup logic so it can be reused outside of WSSample (e.g. from plain dicts or other sample types). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add rng parameter to WSDataset for reproducible sampling Allows passing rng=42 (or a Random instance) to get deterministic sample ordering in random_sample() and sql_select(). Also removes unused needs_key variable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix hume_wsds module path remapping and sql_filter pl.first() usage - Remap hume_wsds.* loader paths to wsds.* for backward compatibility with old index files that reference the former package name - Use pl.first() instead of exprs[0] in sql_filter for correctness Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Update module docstring with rich examples and fix doctests - Add comprehensive docstring to __init__.py with working doctests showcasing SQL queries, random access, lazy loading, and audio - Fix ws_dataset.py doctests (AudioReader src type, add shard_subsample=1) - Fix ws_sink.py doctest (remove invalid batch_size param) - Update tests.py to run wsds module doctests and fix imports Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Apply ruff formatting Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Moved the library showcase to README.md * Use shard ref --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai>

…andling; ws_indexer: ensure we use relative shard paths when possible (#35) * WSDataset: added an ignore_index option * ws_indexer: improved error handling * ws_indexer: ensure we use relative shard paths when possible * WSSample: always validate that keys match across subdirs (#36) * WSSample: always validate that keys match across subdirs * WSSample: print fields with missing shards last (#37) * WSSample: print fields with missing shards last * WSAudio: disable slots because it breaks code auto-reload (#38) * WSAudio: disable slots because it breaks code auto-reload * Added WSModalShard (#41) * Extract audio codec layer from ws_audio.py into audio_codec.py Separates codec concerns (decoder backends, encoder, format utils) from the data model layer (AudioReader, WSAudio) for better reusability and testability. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add ModalFileReader for Modal Volume range requests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Centralize binary column decoding into ws_decode module Extract duplicated npy/pyd/txt/audio decode logic from WSShard and WSS3Shard into a shared decode_sample() function. Dispatch is now based on column type (binary) rather than column-name heuristics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Added WSModalShard * Big renaming and cleanups (#45) * Move index SQL queries from WSDataset into WSIndex - Add old/new index format detection (partition vs dataset_path columns) - Add _partition_col property for unified SQL partition expression - Add lookup_by_index() and lookup_by_key() methods to WSIndex - Add shard_n_samples() and shard_global_offset() via _query_shard() helper - Simplify WSDataset.__getitem__ to delegate to WSIndex lookups - Replace all raw index.query() calls in WSDataset with WSIndex methods - Update ws_tools.py to use new format detection * Fix env var name in README: WSDS_DATASET_PATH → WSDS_DATASET_SEARCH_PATH Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Move is_notebook() to utils, guard _ipython_display_ for terminal use - Extract is_notebook() from convplayer.py into utils.py (simplified) - Remove redundant _ipython_display_ from AudioReader and WSAudio (IPython already calls _repr_html_ automatically) - Add is_notebook() guard to WSDataset and WSSample _ipython_display_ so they fall back to print() in terminal IPython sessions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Improve naming consistency across codebase - subdir → column_dir (utils, ws_sample, ws_modal_shard) - shard_name → shard_ref on shard interfaces and WSSample - dataset_path → partition in index and shard code - dataset_dir → dataset_root in ws_indexer Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add get_audio() helper in ws_decode, use in WSSample Centralizes audio column lookup logic so it can be reused outside of WSSample (e.g. from plain dicts or other sample types). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add rng parameter to WSDataset for reproducible sampling Allows passing rng=42 (or a Random instance) to get deterministic sample ordering in random_sample() and sql_select(). Also removes unused needs_key variable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix hume_wsds module path remapping and sql_filter pl.first() usage - Remap hume_wsds.* loader paths to wsds.* for backward compatibility with old index files that reference the former package name - Use pl.first() instead of exprs[0] in sql_filter for correctness Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Update module docstring with rich examples and fix doctests - Add comprehensive docstring to __init__.py with working doctests showcasing SQL queries, random access, lazy loading, and audio - Fix ws_dataset.py doctests (AudioReader src type, add shard_subsample=1) - Fix ws_sink.py doctest (remove invalid batch_size param) - Update tests.py to run wsds module doctests and fix imports Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Apply ruff formatting Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Moved the library showcase to README.md * Use shard ref --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai>

* WSDataset: scan the dataset folder even if the index contains a field list * WSShardInterface: remove source_dataset from the from_link interface * WSS3Shard: remote audio shards on S3 * pupyarrow: a pure-Python PyArrow implementation with good lazy-loading support * WSDataset: added an ignore_index option; ws_indexer: improved error handling; ws_indexer: ensure we use relative shard paths when possible (#35) * WSDataset: added an ignore_index option * ws_indexer: improved error handling * ws_indexer: ensure we use relative shard paths when possible * WSSample: always validate that keys match across subdirs (#36) * WSSample: always validate that keys match across subdirs * WSSample: print fields with missing shards last (#37) * WSSample: print fields with missing shards last * WSAudio: disable slots because it breaks code auto-reload (#38) * WSAudio: disable slots because it breaks code auto-reload * Added WSModalShard (#41) * Extract audio codec layer from ws_audio.py into audio_codec.py Separates codec concerns (decoder backends, encoder, format utils) from the data model layer (AudioReader, WSAudio) for better reusability and testability. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add ModalFileReader for Modal Volume range requests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Centralize binary column decoding into ws_decode module Extract duplicated npy/pyd/txt/audio decode logic from WSShard and WSS3Shard into a shared decode_sample() function. Dispatch is now based on column type (binary) rather than column-name heuristics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Added WSModalShard * Big renaming and cleanups (#45) * Move index SQL queries from WSDataset into WSIndex - Add old/new index format detection (partition vs dataset_path columns) - Add _partition_col property for unified SQL partition expression - Add lookup_by_index() and lookup_by_key() methods to WSIndex - Add shard_n_samples() and shard_global_offset() via _query_shard() helper - Simplify WSDataset.__getitem__ to delegate to WSIndex lookups - Replace all raw index.query() calls in WSDataset with WSIndex methods - Update ws_tools.py to use new format detection * Fix env var name in README: WSDS_DATASET_PATH → WSDS_DATASET_SEARCH_PATH Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Move is_notebook() to utils, guard _ipython_display_ for terminal use - Extract is_notebook() from convplayer.py into utils.py (simplified) - Remove redundant _ipython_display_ from AudioReader and WSAudio (IPython already calls _repr_html_ automatically) - Add is_notebook() guard to WSDataset and WSSample _ipython_display_ so they fall back to print() in terminal IPython sessions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Improve naming consistency across codebase - subdir → column_dir (utils, ws_sample, ws_modal_shard) - shard_name → shard_ref on shard interfaces and WSSample - dataset_path → partition in index and shard code - dataset_dir → dataset_root in ws_indexer Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add get_audio() helper in ws_decode, use in WSSample Centralizes audio column lookup logic so it can be reused outside of WSSample (e.g. from plain dicts or other sample types). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add rng parameter to WSDataset for reproducible sampling Allows passing rng=42 (or a Random instance) to get deterministic sample ordering in random_sample() and sql_select(). Also removes unused needs_key variable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix hume_wsds module path remapping and sql_filter pl.first() usage - Remap hume_wsds.* loader paths to wsds.* for backward compatibility with old index files that reference the former package name - Use pl.first() instead of exprs[0] in sql_filter for correctness Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Update module docstring with rich examples and fix doctests - Add comprehensive docstring to __init__.py with working doctests showcasing SQL queries, random access, lazy loading, and audio - Fix ws_dataset.py doctests (AudioReader src type, add shard_subsample=1) - Fix ws_sink.py doctest (remove invalid batch_size param) - Update tests.py to run wsds module doctests and fix imports Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Apply ruff formatting Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Moved the library showcase to README.md * Use shard ref --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai>

* Always keep fields as a list of tuples. * WSS3Shard: remote audio shards on S3 (#34) * WSDataset: scan the dataset folder even if the index contains a field list * WSShardInterface: remove source_dataset from the from_link interface * WSS3Shard: remote audio shards on S3 * pupyarrow: a pure-Python PyArrow implementation with good lazy-loading support * WSDataset: added an ignore_index option; ws_indexer: improved error handling; ws_indexer: ensure we use relative shard paths when possible (#35) * WSDataset: added an ignore_index option * ws_indexer: improved error handling * ws_indexer: ensure we use relative shard paths when possible * WSSample: always validate that keys match across subdirs (#36) * WSSample: always validate that keys match across subdirs * WSSample: print fields with missing shards last (#37) * WSSample: print fields with missing shards last * WSAudio: disable slots because it breaks code auto-reload (#38) * WSAudio: disable slots because it breaks code auto-reload * Added WSModalShard (#41) * Extract audio codec layer from ws_audio.py into audio_codec.py Separates codec concerns (decoder backends, encoder, format utils) from the data model layer (AudioReader, WSAudio) for better reusability and testability. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add ModalFileReader for Modal Volume range requests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Centralize binary column decoding into ws_decode module Extract duplicated npy/pyd/txt/audio decode logic from WSShard and WSS3Shard into a shared decode_sample() function. Dispatch is now based on column type (binary) rather than column-name heuristics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Added WSModalShard * Big renaming and cleanups (#45) * Move index SQL queries from WSDataset into WSIndex - Add old/new index format detection (partition vs dataset_path columns) - Add _partition_col property for unified SQL partition expression - Add lookup_by_index() and lookup_by_key() methods to WSIndex - Add shard_n_samples() and shard_global_offset() via _query_shard() helper - Simplify WSDataset.__getitem__ to delegate to WSIndex lookups - Replace all raw index.query() calls in WSDataset with WSIndex methods - Update ws_tools.py to use new format detection * Fix env var name in README: WSDS_DATASET_PATH → WSDS_DATASET_SEARCH_PATH Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Move is_notebook() to utils, guard _ipython_display_ for terminal use - Extract is_notebook() from convplayer.py into utils.py (simplified) - Remove redundant _ipython_display_ from AudioReader and WSAudio (IPython already calls _repr_html_ automatically) - Add is_notebook() guard to WSDataset and WSSample _ipython_display_ so they fall back to print() in terminal IPython sessions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Improve naming consistency across codebase - subdir → column_dir (utils, ws_sample, ws_modal_shard) - shard_name → shard_ref on shard interfaces and WSSample - dataset_path → partition in index and shard code - dataset_dir → dataset_root in ws_indexer Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add get_audio() helper in ws_decode, use in WSSample Centralizes audio column lookup logic so it can be reused outside of WSSample (e.g. from plain dicts or other sample types). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add rng parameter to WSDataset for reproducible sampling Allows passing rng=42 (or a Random instance) to get deterministic sample ordering in random_sample() and sql_select(). Also removes unused needs_key variable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix hume_wsds module path remapping and sql_filter pl.first() usage - Remap hume_wsds.* loader paths to wsds.* for backward compatibility with old index files that reference the former package name - Use pl.first() instead of exprs[0] in sql_filter for correctness Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Update module docstring with rich examples and fix doctests - Add comprehensive docstring to __init__.py with working doctests showcasing SQL queries, random access, lazy loading, and audio - Fix ws_dataset.py doctests (AudioReader src type, add shard_subsample=1) - Fix ws_sink.py doctest (remove invalid batch_size param) - Update tests.py to run wsds module doctests and fix imports Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Apply ruff formatting Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Moved the library showcase to README.md * Use shard ref --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Shahbaz Mogal <shahbaz@hume.ai>

jpc and others added 10 commits February 11, 2026 20:25

Fix env var name in README: WSDS_DATASET_PATH → WSDS_DATASET_SEARCH_PATH

fc94710

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add get_audio() helper in ws_decode, use in WSSample

7303ca4

Centralizes audio column lookup logic so it can be reused outside of WSSample (e.g. from plain dicts or other sample types). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add rng parameter to WSDataset for reproducible sampling

d02e352

Allows passing rng=42 (or a Random instance) to get deterministic sample ordering in random_sample() and sql_select(). Also removes unused needs_key variable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Apply ruff formatting

f01a58a

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Moved the library showcase to README.md

7e2ac16

shahbaz-humeai reviewed Mar 11, 2026

View reviewed changes

Comment thread wsds/ws_modal_shard.py Outdated

shahbaz-humeai reviewed Mar 11, 2026

View reviewed changes

Comment thread wsds/ws_s3_shard.py Outdated

shahbaz-humeai approved these changes Mar 11, 2026

View reviewed changes

Use shard ref

475fed6

shahbaz-humeai merged commit 2749465 into jpc/modal-shards Mar 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Big renaming and cleanups#45

Big renaming and cleanups#45
shahbaz-humeai merged 11 commits intojpc/modal-shardsfrom
jpc/big-renaming

jpc commented Feb 13, 2026

Uh oh!

Uh oh!

Uh oh!

shahbaz-humeai left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jpc commented Feb 13, 2026

Uh oh!

Uh oh!

Uh oh!

shahbaz-humeai left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants