diff --git a/docs/virtual_db.md b/docs/virtual_db.md index e3b40ac..3ec4b45 100644 --- a/docs/virtual_db.md +++ b/docs/virtual_db.md @@ -1,27 +1,31 @@ # VirtualDB -VirtualDB provides a unified query interface across heterogeneous datasets with -different experimental condition structures and terminologies. Each dataset -defines experimental conditions in its own way, with properties stored at -different hierarchy levels (repository, dataset, or field) and using different -naming conventions. VirtualDB uses an external YAML configuration to map these -varying structures to a common schema, normalize factor level names (e.g., -"D-glucose", "dextrose", "glu" all become "glucose"), and enable cross-dataset -queries with standardized field names and values. +VirtualDB provides a SQL query interface across heterogeneous HuggingFace +datasets using an in-memory DuckDB database. Each dataset defines experimental +conditions in its own way, with properties stored at different hierarchy levels +(repository, dataset, or field) and using different naming conventions. +VirtualDB uses an external YAML configuration to map these varying structures +to a common schema, normalize factor level names (e.g., "D-glucose", +"dextrose", "glu" all become "glucose"), and enable cross-dataset queries with +standardized field names and values. -## API Reference +For primary datasets, VirtualDB creates: -::: tfbpapi.virtual_db.VirtualDB - options: - show_root_heading: true - show_source: true +- **`_meta`** -- one row per sample with derived metadata columns +- **``** -- full measurement-level data joined to the metadata view -### Helper Functions +For comparative analysis datasets, VirtualDB creates: -::: tfbpapi.virtual_db.get_nested_value - options: - show_root_heading: true +- **`_expanded`** -- the raw data with composite ID fields parsed + into `_source` (aliased to configured `db_name`) and + `_id` (sample_id) columns + +See the [configuration guide](virtual_db_configuration.md) for setup details +and the [tutorial](tutorials/virtual_db_tutorial.ipynb) for usage examples. + +## API Reference -::: tfbpapi.virtual_db.normalize_value +::: tfbpapi.virtual_db.VirtualDB options: show_root_heading: true + show_source: true diff --git a/tfbpapi/virtual_db.py b/tfbpapi/virtual_db.py index 4fd6f3c..96097a3 100644 --- a/tfbpapi/virtual_db.py +++ b/tfbpapi/virtual_db.py @@ -6,13 +6,14 @@ https://brentlab.github.io/tfbpapi/huggingface_datacard/. Next, a developer can create a virtualDB configuration file that describes which huggingface repos and datasets to use, a set of common fields, datasets that contain comparative analytics, and more. -VirtualDB, this code, then uses DuckDB to construct tables and views are -which are lazily created over Parquet files which are cached locally. VirtualDB uses -the information in the datacard to create metadata views which describe sample level -features. Derived columns are attached to both the metadata and full data views. Any -comparative analysis datasets are also parsed and joined to the primary datasets' -metadata views. The expectation is that a developer will use this interface to write -SQL queries against the views to provide an API to downstream users and applications. +VirtualDB, this code, then uses DuckDB to construct views that are lazily created +over Parquet files cached locally. For primary datasets, VirtualDB creates metadata +views (one row per sample with derived columns) and full data views (measurement-level +data joined to metadata). For comparative analysis datasets, VirtualDB creates expanded +views that parse composite ID fields into ``_source`` (aliased to the configured +db_name) and ``_id`` (sample_id) columns. The expectation is that a developer will +use this interface to write SQL queries against the views to provide an API to +downstream users and applications. Example Usage::