Skip to content

BioETL is a data processing framework for acquiring, normalizing, and validating bioactivity-related datasets from multiple external sources.

License

Notifications You must be signed in to change notification settings

SatoryKono/BioactivityDataAcquisition

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7,547 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

BioETL: Bioactivity Data Acquisition Pipeline

Python 3.11+ License: MIT Code Style: Ruff Checked with mypy Coverage Version Security Policy

BioETL is a robust, scalable data engineering framework designed to acquire, normalize, and process bioactivity data from major public repositories (ChEMBL, PubChem, UniProt, etc.) into a unified, analysis-ready Delta Lake warehouse.


Key Features

  • Medallion Architecture: Structured data flow (Bronze -> Silver -> Gold) ensuring data quality and traceability.
  • Delta Lake Core: ACID transactions, schema enforcement, and time travel capabilities.
  • Resilience: Built-in circuit breakers, exponential backoff retries, and dead-letter queues (Quarantine).
  • Local-First Design: In-memory locking, local file storage -- no external services required (ADR-010).
  • Deterministic Writes: Reproducible outputs and deterministic retries (ADR-014).
  • Observability by Design: Metrics, tracing, and logging ports (ADR-017).
  • Unified HTTP Client: Standardized rate limiting, retry, and telemetry (ADR-032).
  • Strict Governance: Comprehensive rules for schema evolution, data contracts, and operational procedures.

Architecture Overview

BioETL follows Hexagonal Architecture (Ports & Adapters) with Domain-Driven Design patterns:

┌─────────────────────────────────────────────────────────────┐
│                     INTERFACES (CLI)                        │
├─────────────────────────────────────────────────────────────┤
│                    COMPOSITION (DI)                         │
│              bootstrap_pipeline() → Factories               │
├─────────────────────────────────────────────────────────────┤
│                     APPLICATION                             │
│         PipelineRunner → Executor → BaseTransformer         │
├─────────────────────────────────────────────────────────────┤
│                       DOMAIN (DDD)                          │
│     Ports │ Aggregates │ Value Objects │ Entities │ Schemas │
├─────────────────────────────────────────────────────────────┤
│                    INFRASTRUCTURE                           │
│    ChEMBL │ PubChem │ UniProt │ Delta Lake │ Observability  │
└─────────────────────────────────────────────────────────────┘

Data Flow: External API -> Bronze (JSONL+zstd) -> Silver (Delta Lake) -> Gold (Analytics)

Domain Layer (DDD)

The domain layer implements Domain-Driven Design patterns:

Component Description
Ports Protocol interfaces for dependency inversion (domain/ports/)
Aggregates Domain aggregates with invariant protection (domain/aggregates/)
Value Objects Immutable domain primitives (domain/value_objects/)
Entities Domain entities per provider (domain/entities/)
Schemas Pydantic models for data validation (domain/schemas/)

Supported Providers

Provider Entity Types Status Rate Limit
ChEMBL Activity, Assay, Molecule, Target, Target Component, Protein Class, Cell Line, Compound Record, Publication, Publication Term/Similarity, Subcellular Fraction, Tissue Production None
PubChem Compound Production 5 req/sec
UniProt Protein, ID Mapping Production 100 req/sec
PubMed Publication Production 3 req/sec
CrossRef Publication Production Polite pool
OpenAlex Publication Production ~10 req/sec
Semantic Scholar Publication Production 100 req/5min

Documentation

Document Description
API Reference Full API documentation with mkdocstrings
Architecture Decisions 39 ADRs explaining design choices
Ubiquitous Language Domain terminology and canonical naming
RULES.md Project governance and requirements (v5.22)
Project Map Documentation navigator and code map
CLI Reference Command-line interface documentation
Operations Runbooks Incident response and procedures

Quick Start

Prerequisites

  • Python: Version 3.11 or higher.
  • Make: For running automation commands.
  • Docker: Optional, legacy-only (see Legacy Distributed Mode).

Installation

Option A: Automated Setup (Recommended)

Use the scripts/dev/dev_setup.sh script for a complete automated setup:

git clone https://github.com/SatoryKono/BioactivityDataAcquisition2.git
cd BioactivityDataAcquisition2
./scripts/dev/dev_setup.sh

The script will:

  • Check prerequisites (Python 3.11+, Git, Make)
  • Create virtual environment and install dependencies
  • Set up pre-commit hooks
  • Configure environment variables
  • Run verification checks

For quick setup without tests: ./scripts/dev/dev_setup.sh --quick

Option B: Manual Setup

  1. Clone and Install: Initialize the virtual environment and install project dependencies.

    git clone https://github.com/SatoryKono/BioactivityDataAcquisition2.git
    cd BioactivityDataAcquisition2
    make install
  2. Configure Environment (optional): Copy the example configuration if you need API keys for providers.

    cp .env.example .env

    Note: Secrets follow the pattern BIOETL_{PROVIDER}_{KEY}.

  3. Verify Installation: Run tests to ensure everything works.

    make lint && make test

Note: BioETL uses local file storage by default (data/ directory). No Docker or external services required. See Local Storage Layout and ADR-010 for details.

Running Pipelines

Activate the virtual environment:

# Linux/macOS
source .venv/bin/activate

# Windows
.venv\Scripts\activate

Run the ETL pipeline using the CLI:

# Run incremental update for ChEMBL
bioetl run --pipeline chembl_activity --run-type incremental

# Run backfill with resume capability
bioetl run --pipeline chembl_activity --run-type backfill --resume

# Inspect quarantined records
bioetl quarantine inspect --pipeline chembl_activity --limit 10

# List checkpoints
bioetl checkpoint list

Development

Repository Hygiene

  • Do not store domain datasets or reference data files in repository root.
  • Keep machine-consumed reference datasets under semantic paths in data/ (for example, data/input/reference/).
  • Keep optional human-facing spreadsheet copies under docs/reference/.
  • Unified publication classifier canonical format is CSV at data/input/reference/unified_classification.csv; Excel is optional documentation copy at docs/reference/unified_classification.xlsx.

Local diagnostic artifacts

Локальные диагностические файлы (например, git_commit_*.txt, *_gitshow_err.txt, log_test.txt) не должны храниться в корне репозитория и не коммитятся в Git.

  • Временные диагностические дампы сохраняйте в tmp/.
  • Логи локальных запусков сохраняйте в logs/.
  • Для ad-hoc команд используйте явное перенаправление (> logs/<name>.log 2>&1 или > tmp/<name>.txt 2>&1).

Testing

The project uses pytest for testing, split into Unit, Integration, and Architecture tests.

  • Setup Plugins (pytest + pre-commit):

    make setup-plugins

    This command validates required pytest plugins and installs pre-commit hooks.

  • Quick Check (with dependencies auto-synced and coverage):

    ./scripts/run_pytest.sh

    The helper bootstraps the virtual environment (installs pytest-cov, orjson, syrupy, and other test-only dependencies) and reproduces the default CI command with coverage output.

    If you prefer to run the command manually, activate the local virtual environment first to avoid --cov argument errors:

    source .venv/bin/activate
    # Install test extras so pytest-asyncio/pytest-cov options are available
    pip install -e ".[dev,tests]"
    python -m pytest tests --cov=src/bioetl --cov-report=term

    With uv, the equivalent is:

    uv sync --extra dev --extra tests
    uv run python -m pytest tests --cov=src/bioetl --cov-report=term

    To include tracing and pre-commit plugin setup:

    uv sync --extra dev --extra tests --extra tracing
    uv run python -m pre_commit install --install-hooks

    Если pytest сообщает об отсутствии обязательных плагинов (pytest-asyncio, pytest-cov), выполните повторную синхронизацию:

    uv sync --extra dev --extra tests --extra tracing

    Скрипт ./scripts/run_pytest.sh проверяет наличие плагинов и автоматически доустанавливает их при необходимости.

  • Run All Tests:

    make test
  • Run Unit Tests Only (Fast, no I/O):

    make test-unit
  • Run Integration Tests (Uses VCR.py cassettes, no network required):

    make test-integration
  • Run Architecture Tests:

    make arch-test

Codex Skills

  • Sync project skills into Codex:

    make setup-skills

    This syncs local project skills from .codex/skills into $CODEX_HOME/skills (default ~/.codex/skills).

Code Quality

Strict quality standards are enforced using ruff, mypy, and other tools.

  • Linting & Formatting:
    make lint      # Check only
    make lint-fix  # Auto-fix and format
  • Type Checking:
    make typecheck # Strict mypy
  • Complexity Check:
    make complexity

Documentation

Build and serve local documentation:

make docs-serve

Access the docs at http://localhost:8000.

Project Structure

.
├── configs/                  # YAML pipeline configurations
├── docs/                     # Documentation (Architecture, Guides, Runbooks)
│   ├── 02-architecture/      # Layer docs, diagrams, ADRs (39 decisions)
│   ├── 00-project/
│   │   ├── glossary.md       # Ubiquitous Language glossary
│   │   └── RULES.md          # Project governance (v5.22)
│   └── ...
├── src/
│   └── bioetl/
│       ├── domain/           # Pure business logic (DDD), NO I/O
│       │   ├── ports/        # Protocol interfaces (Ports)
│       │   ├── aggregates/   # DDD Aggregates with invariants
│       │   ├── value_objects/ # Immutable domain primitives
│       │   ├── entities/     # Domain entities per provider
│       │   ├── schemas/      # Pydantic/Pandera validation schemas
│       │   └── exceptions/   # Classified exceptions (Critical/Recoverable/DQ)
│       ├── application/      # Pipeline orchestration & services
│       │   ├── core/         # PipelineRunner, Executor, BaseTransformer
│       │   ├── pipelines/    # ChEMBL, PubChem, UniProt, PubMed, CrossRef, OpenAlex, Semantic Scholar (+ common utilities)
│       │   └── services/     # Application services (lifecycle, vacuum, cleanup)
│       ├── composition/      # Composition Root (DI, bootstrap)
│       │   ├── factories/    # Pipeline, storage, data source factories
│       │   └── providers/    # Provider registry
│       ├── infrastructure/   # Adapters (API clients, Delta Lake, Storage)
│       │   ├── adapters/     # HTTP clients with unified resilience
│       │   ├── storage/      # Bronze/Silver/Gold writers
│       │   ├── locking/      # In-memory locks (MemoryLock)
│       │   └── observability/ # Metrics, tracing, logging
│       └── interfaces/       # External interfaces
│           ├── cli/          # Click CLI commands
│           └── orchestration/ # Reserved (empty; signal handlers removed 2025-12-31, shutdown logic in application/core/shutdown.py)
├── tests/                    # Unit, Integration, Architecture & E2E tests
├── scripts/                  # Utility scripts (lint_terminology.py, etc.)
├── Makefile                  # Automation commands
└── pyproject.toml            # Dependencies & Tool configuration

Root layout policy

Repository root is protected by scripts/audit_root_cleanliness.py (pre-commit + CI job root-hygiene). Only approved top-level entries are allowed.

Core allowed root entries:

  • Source and tests: src/, tests/
  • Documentation and references: docs/, README.md, CHANGELOG.md
  • Build/configuration: pyproject.toml, uv.lock, Makefile, .pre-commit-config.yaml, .github/
  • Operational/project assets: configs/, scripts/, assets/, data/, reports/, grafana/
  • Legacy tracked root artifacts listed in the allowlist inside scripts/audit_root_cleanliness.py

Where to place artifacts:

  • Test artifacts and run reports → reports/
  • Logs and diagnostic dumps → reports/ (or nested folder by run date/provider)
  • Coverage artifacts (coverage.xml, htmlcov/, .coverage*) → keep out of git, generate locally/CI only
  • Reference datasets and static lookup files → docs/ (documentation reference) or data/ (runtime/local data)

Legacy Distributed Mode (REJECTED / UNSUPPORTED)

CRITICAL WARNING: Distributed deployment and Redis Locking are STRICTLY PROHIBITED by ADR-010. The instructions below are for historical reference only and must NOT be used for new deployments.

For distributed deployments with Redis locking and S3-compatible storage, you can use Docker Compose:

# Start infrastructure services (Postgres, Redis, MinIO)
make docker-up

# Run E2E tests with Docker
make test-e2e

# Stop services
make docker-down

Decision: We have officially abandoned Redis Locks in favor of a strictly Local-Only architecture.

Security

Please review our Security Policy for:

  • Threat model and trust boundaries
  • Secret management guidelines
  • Data validation architecture
  • Vulnerability reporting process

Contributing

Please read RULES.md before contributing.

  1. Ensure all tests pass: make test
  2. Check types and linting: make lint
  3. Follow the RFC 2119 keywords in requirements.

License

This project is licensed under the MIT License.

About

BioETL is a data processing framework for acquiring, normalizing, and validating bioactivity-related datasets from multiple external sources.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 7