BioETL is a robust, scalable data engineering framework designed to acquire, normalize, and process bioactivity data from major public repositories (ChEMBL, PubChem, UniProt, etc.) into a unified, analysis-ready Delta Lake warehouse.
- Medallion Architecture: Structured data flow (Bronze -> Silver -> Gold) ensuring data quality and traceability.
- Delta Lake Core: ACID transactions, schema enforcement, and time travel capabilities.
- Resilience: Built-in circuit breakers, exponential backoff retries, and dead-letter queues (Quarantine).
- Local-First Design: In-memory locking, local file storage -- no external services required (ADR-010).
- Deterministic Writes: Reproducible outputs and deterministic retries (ADR-014).
- Observability by Design: Metrics, tracing, and logging ports (ADR-017).
- Unified HTTP Client: Standardized rate limiting, retry, and telemetry (ADR-032).
- Strict Governance: Comprehensive rules for schema evolution, data contracts, and operational procedures.
BioETL follows Hexagonal Architecture (Ports & Adapters) with Domain-Driven Design patterns:
┌─────────────────────────────────────────────────────────────┐
│ INTERFACES (CLI) │
├─────────────────────────────────────────────────────────────┤
│ COMPOSITION (DI) │
│ bootstrap_pipeline() → Factories │
├─────────────────────────────────────────────────────────────┤
│ APPLICATION │
│ PipelineRunner → Executor → BaseTransformer │
├─────────────────────────────────────────────────────────────┤
│ DOMAIN (DDD) │
│ Ports │ Aggregates │ Value Objects │ Entities │ Schemas │
├─────────────────────────────────────────────────────────────┤
│ INFRASTRUCTURE │
│ ChEMBL │ PubChem │ UniProt │ Delta Lake │ Observability │
└─────────────────────────────────────────────────────────────┘
Data Flow: External API -> Bronze (JSONL+zstd) -> Silver (Delta Lake) -> Gold (Analytics)
The domain layer implements Domain-Driven Design patterns:
| Component | Description |
|---|---|
| Ports | Protocol interfaces for dependency inversion (domain/ports/) |
| Aggregates | Domain aggregates with invariant protection (domain/aggregates/) |
| Value Objects | Immutable domain primitives (domain/value_objects/) |
| Entities | Domain entities per provider (domain/entities/) |
| Schemas | Pydantic models for data validation (domain/schemas/) |
| Provider | Entity Types | Status | Rate Limit |
|---|---|---|---|
| ChEMBL | Activity, Assay, Molecule, Target, Target Component, Protein Class, Cell Line, Compound Record, Publication, Publication Term/Similarity, Subcellular Fraction, Tissue | Production | None |
| PubChem | Compound | Production | 5 req/sec |
| UniProt | Protein, ID Mapping | Production | 100 req/sec |
| PubMed | Publication | Production | 3 req/sec |
| CrossRef | Publication | Production | Polite pool |
| OpenAlex | Publication | Production | ~10 req/sec |
| Semantic Scholar | Publication | Production | 100 req/5min |
| Document | Description |
|---|---|
| API Reference | Full API documentation with mkdocstrings |
| Architecture Decisions | 39 ADRs explaining design choices |
| Ubiquitous Language | Domain terminology and canonical naming |
| RULES.md | Project governance and requirements (v5.22) |
| Project Map | Documentation navigator and code map |
| CLI Reference | Command-line interface documentation |
| Operations Runbooks | Incident response and procedures |
- Python: Version 3.11 or higher.
- Make: For running automation commands.
- Docker: Optional, legacy-only (see Legacy Distributed Mode).
Use the scripts/dev/dev_setup.sh script for a complete automated setup:
git clone https://github.com/SatoryKono/BioactivityDataAcquisition2.git
cd BioactivityDataAcquisition2
./scripts/dev/dev_setup.shThe script will:
- Check prerequisites (Python 3.11+, Git, Make)
- Create virtual environment and install dependencies
- Set up pre-commit hooks
- Configure environment variables
- Run verification checks
For quick setup without tests: ./scripts/dev/dev_setup.sh --quick
-
Clone and Install: Initialize the virtual environment and install project dependencies.
git clone https://github.com/SatoryKono/BioactivityDataAcquisition2.git cd BioactivityDataAcquisition2 make install -
Configure Environment (optional): Copy the example configuration if you need API keys for providers.
cp .env.example .env
Note: Secrets follow the pattern
BIOETL_{PROVIDER}_{KEY}. -
Verify Installation: Run tests to ensure everything works.
make lint && make test
Note: BioETL uses local file storage by default (
data/directory). No Docker or external services required. See Local Storage Layout and ADR-010 for details.
Activate the virtual environment:
# Linux/macOS
source .venv/bin/activate
# Windows
.venv\Scripts\activateRun the ETL pipeline using the CLI:
# Run incremental update for ChEMBL
bioetl run --pipeline chembl_activity --run-type incremental
# Run backfill with resume capability
bioetl run --pipeline chembl_activity --run-type backfill --resume
# Inspect quarantined records
bioetl quarantine inspect --pipeline chembl_activity --limit 10
# List checkpoints
bioetl checkpoint list- Do not store domain datasets or reference data files in repository root.
- Keep machine-consumed reference datasets under semantic paths in
data/(for example,data/input/reference/). - Keep optional human-facing spreadsheet copies under
docs/reference/. - Unified publication classifier canonical format is CSV at
data/input/reference/unified_classification.csv; Excel is optional documentation copy atdocs/reference/unified_classification.xlsx.
Локальные диагностические файлы (например, git_commit_*.txt, *_gitshow_err.txt, log_test.txt) не должны храниться в корне репозитория и не коммитятся в Git.
- Временные диагностические дампы сохраняйте в
tmp/. - Логи локальных запусков сохраняйте в
logs/. - Для ad-hoc команд используйте явное перенаправление (
> logs/<name>.log 2>&1или> tmp/<name>.txt 2>&1).
The project uses pytest for testing, split into Unit, Integration, and Architecture tests.
-
Setup Plugins (pytest + pre-commit):
make setup-plugins
This command validates required pytest plugins and installs pre-commit hooks.
-
Quick Check (with dependencies auto-synced and coverage):
./scripts/run_pytest.sh
The helper bootstraps the virtual environment (installs
pytest-cov,orjson,syrupy, and other test-only dependencies) and reproduces the default CI command with coverage output.If you prefer to run the command manually, activate the local virtual environment first to avoid
--covargument errors:source .venv/bin/activate # Install test extras so pytest-asyncio/pytest-cov options are available pip install -e ".[dev,tests]" python -m pytest tests --cov=src/bioetl --cov-report=term
With
uv, the equivalent is:uv sync --extra dev --extra tests uv run python -m pytest tests --cov=src/bioetl --cov-report=term
To include tracing and pre-commit plugin setup:
uv sync --extra dev --extra tests --extra tracing uv run python -m pre_commit install --install-hooks
Если
pytestсообщает об отсутствии обязательных плагинов (pytest-asyncio,pytest-cov), выполните повторную синхронизацию:uv sync --extra dev --extra tests --extra tracing
Скрипт
./scripts/run_pytest.shпроверяет наличие плагинов и автоматически доустанавливает их при необходимости. -
Run All Tests:
make test -
Run Unit Tests Only (Fast, no I/O):
make test-unit
-
Run Integration Tests (Uses VCR.py cassettes, no network required):
make test-integration
-
Run Architecture Tests:
make arch-test
-
Sync project skills into Codex:
make setup-skills
This syncs local project skills from
.codex/skillsinto$CODEX_HOME/skills(default~/.codex/skills).
Strict quality standards are enforced using ruff, mypy, and other tools.
- Linting & Formatting:
make lint # Check only make lint-fix # Auto-fix and format
- Type Checking:
make typecheck # Strict mypy - Complexity Check:
make complexity
Build and serve local documentation:
make docs-serveAccess the docs at http://localhost:8000.
.
├── configs/ # YAML pipeline configurations
├── docs/ # Documentation (Architecture, Guides, Runbooks)
│ ├── 02-architecture/ # Layer docs, diagrams, ADRs (39 decisions)
│ ├── 00-project/
│ │ ├── glossary.md # Ubiquitous Language glossary
│ │ └── RULES.md # Project governance (v5.22)
│ └── ...
├── src/
│ └── bioetl/
│ ├── domain/ # Pure business logic (DDD), NO I/O
│ │ ├── ports/ # Protocol interfaces (Ports)
│ │ ├── aggregates/ # DDD Aggregates with invariants
│ │ ├── value_objects/ # Immutable domain primitives
│ │ ├── entities/ # Domain entities per provider
│ │ ├── schemas/ # Pydantic/Pandera validation schemas
│ │ └── exceptions/ # Classified exceptions (Critical/Recoverable/DQ)
│ ├── application/ # Pipeline orchestration & services
│ │ ├── core/ # PipelineRunner, Executor, BaseTransformer
│ │ ├── pipelines/ # ChEMBL, PubChem, UniProt, PubMed, CrossRef, OpenAlex, Semantic Scholar (+ common utilities)
│ │ └── services/ # Application services (lifecycle, vacuum, cleanup)
│ ├── composition/ # Composition Root (DI, bootstrap)
│ │ ├── factories/ # Pipeline, storage, data source factories
│ │ └── providers/ # Provider registry
│ ├── infrastructure/ # Adapters (API clients, Delta Lake, Storage)
│ │ ├── adapters/ # HTTP clients with unified resilience
│ │ ├── storage/ # Bronze/Silver/Gold writers
│ │ ├── locking/ # In-memory locks (MemoryLock)
│ │ └── observability/ # Metrics, tracing, logging
│ └── interfaces/ # External interfaces
│ ├── cli/ # Click CLI commands
│ └── orchestration/ # Reserved (empty; signal handlers removed 2025-12-31, shutdown logic in application/core/shutdown.py)
├── tests/ # Unit, Integration, Architecture & E2E tests
├── scripts/ # Utility scripts (lint_terminology.py, etc.)
├── Makefile # Automation commands
└── pyproject.toml # Dependencies & Tool configuration
Repository root is protected by scripts/audit_root_cleanliness.py (pre-commit + CI job root-hygiene).
Only approved top-level entries are allowed.
Core allowed root entries:
- Source and tests:
src/,tests/ - Documentation and references:
docs/,README.md,CHANGELOG.md - Build/configuration:
pyproject.toml,uv.lock,Makefile,.pre-commit-config.yaml,.github/ - Operational/project assets:
configs/,scripts/,assets/,data/,reports/,grafana/ - Legacy tracked root artifacts listed in the allowlist inside
scripts/audit_root_cleanliness.py
Where to place artifacts:
- Test artifacts and run reports →
reports/ - Logs and diagnostic dumps →
reports/(or nested folder by run date/provider) - Coverage artifacts (
coverage.xml,htmlcov/,.coverage*) → keep out of git, generate locally/CI only - Reference datasets and static lookup files →
docs/(documentation reference) ordata/(runtime/local data)
CRITICAL WARNING: Distributed deployment and Redis Locking are STRICTLY PROHIBITED by ADR-010. The instructions below are for historical reference only and must NOT be used for new deployments.
For distributed deployments with Redis locking and S3-compatible storage, you can use Docker Compose:
# Start infrastructure services (Postgres, Redis, MinIO)
make docker-up
# Run E2E tests with Docker
make test-e2e
# Stop services
make docker-downDecision: We have officially abandoned Redis Locks in favor of a strictly Local-Only architecture.
Please review our Security Policy for:
- Threat model and trust boundaries
- Secret management guidelines
- Data validation architecture
- Vulnerability reporting process
Please read RULES.md before contributing.
- Ensure all tests pass:
make test - Check types and linting:
make lint - Follow the RFC 2119 keywords in requirements.
This project is licensed under the MIT License.