Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 22 additions & 18 deletions docs/virtual_db.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,31 @@
# VirtualDB

VirtualDB provides a unified query interface across heterogeneous datasets with
different experimental condition structures and terminologies. Each dataset
defines experimental conditions in its own way, with properties stored at
different hierarchy levels (repository, dataset, or field) and using different
naming conventions. VirtualDB uses an external YAML configuration to map these
varying structures to a common schema, normalize factor level names (e.g.,
"D-glucose", "dextrose", "glu" all become "glucose"), and enable cross-dataset
queries with standardized field names and values.
VirtualDB provides a SQL query interface across heterogeneous HuggingFace
datasets using an in-memory DuckDB database. Each dataset defines experimental
conditions in its own way, with properties stored at different hierarchy levels
(repository, dataset, or field) and using different naming conventions.
VirtualDB uses an external YAML configuration to map these varying structures
to a common schema, normalize factor level names (e.g., "D-glucose",
"dextrose", "glu" all become "glucose"), and enable cross-dataset queries with
standardized field names and values.

## API Reference
For primary datasets, VirtualDB creates:

::: tfbpapi.virtual_db.VirtualDB
options:
show_root_heading: true
show_source: true
- **`<db_name>_meta`** -- one row per sample with derived metadata columns
- **`<db_name>`** -- full measurement-level data joined to the metadata view

### Helper Functions
For comparative analysis datasets, VirtualDB creates:

::: tfbpapi.virtual_db.get_nested_value
options:
show_root_heading: true
- **`<db_name>_expanded`** -- the raw data with composite ID fields parsed
into `<link_field>_source` (aliased to configured `db_name`) and
`<link_field>_id` (sample_id) columns

See the [configuration guide](virtual_db_configuration.md) for setup details
and the [tutorial](tutorials/virtual_db_tutorial.ipynb) for usage examples.

## API Reference

::: tfbpapi.virtual_db.normalize_value
::: tfbpapi.virtual_db.VirtualDB
options:
show_root_heading: true
show_source: true
15 changes: 8 additions & 7 deletions tfbpapi/virtual_db.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,14 @@
https://brentlab.github.io/tfbpapi/huggingface_datacard/. Next, a developer can create
a virtualDB configuration file that describes which huggingface repos and datasets to
use, a set of common fields, datasets that contain comparative analytics, and more.
VirtualDB, this code, then uses DuckDB to construct tables and views are
which are lazily created over Parquet files which are cached locally. VirtualDB uses
the information in the datacard to create metadata views which describe sample level
features. Derived columns are attached to both the metadata and full data views. Any
comparative analysis datasets are also parsed and joined to the primary datasets'
metadata views. The expectation is that a developer will use this interface to write
SQL queries against the views to provide an API to downstream users and applications.
VirtualDB, this code, then uses DuckDB to construct views that are lazily created
over Parquet files cached locally. For primary datasets, VirtualDB creates metadata
views (one row per sample with derived columns) and full data views (measurement-level
data joined to metadata). For comparative analysis datasets, VirtualDB creates expanded
views that parse composite ID fields into ``_source`` (aliased to the configured
db_name) and ``_id`` (sample_id) columns. The expectation is that a developer will
use this interface to write SQL queries against the views to provide an API to
downstream users and applications.

Example Usage::

Expand Down