BioETL: Bioactivity Data Acquisition Pipeline

BioETL is a robust, scalable data engineering framework designed to acquire, normalize, and process bioactivity data from major public repositories (ChEMBL, PubChem, UniProt, etc.) into a unified, analysis-ready Delta Lake warehouse.

Key Features

Medallion Architecture: Structured data flow (Bronze -> Silver -> Gold) ensuring data quality and traceability.
Delta Lake Core: ACID transactions, schema enforcement, and time travel capabilities.
Resilience: Built-in circuit breakers, exponential backoff retries, and dead-letter queues (Quarantine).
Local-First Design: In-memory locking, local file storage -- no external services required (ADR-010).
Deterministic Writes: Reproducible outputs and deterministic retries (ADR-014).
Observability by Design: Metrics, tracing, and logging ports (ADR-017).
Unified HTTP Client: Standardized rate limiting, retry, and telemetry (ADR-032).
Strict Governance: Comprehensive rules for schema evolution, data contracts, and operational procedures.

Architecture Overview

BioETL follows Hexagonal Architecture (Ports & Adapters) with Domain-Driven Design patterns:

┌─────────────────────────────────────────────────────────────┐
│                     INTERFACES (CLI)                        │
├─────────────────────────────────────────────────────────────┤
│                    COMPOSITION (DI)                         │
│              bootstrap_pipeline() → Factories               │
├─────────────────────────────────────────────────────────────┤
│                     APPLICATION                             │
│         PipelineRunner → Executor → BaseTransformer         │
├─────────────────────────────────────────────────────────────┤
│                       DOMAIN (DDD)                          │
│     Ports │ Aggregates │ Value Objects │ Entities │ Schemas │
├─────────────────────────────────────────────────────────────┤
│                    INFRASTRUCTURE                           │
│    ChEMBL │ PubChem │ UniProt │ Delta Lake │ Observability  │
└─────────────────────────────────────────────────────────────┘

Data Flow: External API -> Bronze (JSONL+zstd) -> Silver (Delta Lake) -> Gold (Analytics)

Domain Layer (DDD)

The domain layer implements Domain-Driven Design patterns:

Component	Description
Ports	Protocol interfaces for dependency inversion (`domain/ports/`)
Aggregates	Domain aggregates with invariant protection (`domain/aggregates/`)
Value Objects	Immutable domain primitives (`domain/value_objects/`)
Entities	Domain entities per provider (`domain/entities/`)
Schemas	Pydantic models for data validation (`domain/schemas/`)

Supported Providers

Provider	Entity Types	Status	Rate Limit
ChEMBL	Activity, Assay, Molecule, Target, Target Component, Protein Class, Cell Line, Compound Record, Publication, Publication Term/Similarity, Subcellular Fraction, Tissue	Production	None
PubChem	Compound	Production	5 req/sec
UniProt	Protein, ID Mapping	Production	100 req/sec
PubMed	Publication	Production	3 req/sec
CrossRef	Publication	Production	Polite pool
OpenAlex	Publication	Production	~10 req/sec
Semantic Scholar	Publication	Production	100 req/5min

Documentation

Document	Description
API Reference	Full API documentation with mkdocstrings
Architecture Decisions	39 ADRs explaining design choices
Ubiquitous Language	Domain terminology and canonical naming
RULES.md	Project governance and requirements (v5.22)
Project Map	Documentation navigator and code map
CLI Reference	Command-line interface documentation
Operations Runbooks	Incident response and procedures

Quick Start

Prerequisites

Python: Version 3.11 or higher.
Make: For running automation commands.
Docker: Optional, legacy-only (see Legacy Distributed Mode).

Installation

Option A: Automated Setup (Recommended)

Use the scripts/dev/dev_setup.sh script for a complete automated setup:

git clone https://github.com/SatoryKono/BioactivityDataAcquisition2.git
cd BioactivityDataAcquisition2
./scripts/dev/dev_setup.sh

The script will:

Check prerequisites (Python 3.11+, Git, Make)
Create virtual environment and install dependencies
Set up pre-commit hooks
Configure environment variables
Run verification checks

For quick setup without tests: ./scripts/dev/dev_setup.sh --quick

Option B: Manual Setup

Clone and Install: Initialize the virtual environment and install project dependencies.

git clone https://github.com/SatoryKono/BioactivityDataAcquisition2.git
cd BioactivityDataAcquisition2
make install

Configure Environment (optional): Copy the example configuration if you need API keys for providers.
```
cp .env.example .env
```
Note: Secrets follow the pattern BIOETL_{PROVIDER}_{KEY}.
Verify Installation: Run tests to ensure everything works.
```
make lint && make test
```

Note: BioETL uses local file storage by default (data/ directory). No Docker or external services required. See Local Storage Layout and ADR-010 for details.

Running Pipelines

Activate the virtual environment:

# Linux/macOS
source .venv/bin/activate

# Windows
.venv\Scripts\activate

Run the ETL pipeline using the CLI:

# Run incremental update for ChEMBL
bioetl run --pipeline chembl_activity --run-type incremental

# Run backfill with resume capability
bioetl run --pipeline chembl_activity --run-type backfill --resume

# Inspect quarantined records
bioetl quarantine inspect --pipeline chembl_activity --limit 10

# List checkpoints
bioetl checkpoint list

Development

Repository Hygiene

Do not store domain datasets or reference data files in repository root.
Keep machine-consumed reference datasets under semantic paths in data/ (for example, data/input/reference/).
Keep optional human-facing spreadsheet copies under docs/reference/.
Unified publication classifier canonical format is CSV at data/input/reference/unified_classification.csv; Excel is optional documentation copy at docs/reference/unified_classification.xlsx.

Local diagnostic artifacts

Локальные диагностические файлы (например, git_commit_*.txt, *_gitshow_err.txt, log_test.txt) не должны храниться в корне репозитория и не коммитятся в Git.

Временные диагностические дампы сохраняйте в tmp/.
Логи локальных запусков сохраняйте в logs/.
Для ad-hoc команд используйте явное перенаправление (> logs/<name>.log 2>&1 или > tmp/<name>.txt 2>&1).

Testing

The project uses pytest for testing, split into Unit, Integration, and Architecture tests.

Setup Plugins (pytest + pre-commit):
```
make setup-plugins
```
This command validates required pytest plugins and installs pre-commit hooks.
Quick Check (with dependencies auto-synced and coverage):
```
./scripts/run_pytest.sh
```
The helper bootstraps the virtual environment (installs pytest-cov, orjson, syrupy, and other test-only dependencies) and reproduces the default CI command with coverage output.

If you prefer to run the command manually, activate the local virtual environment first to avoid --cov argument errors:
```
source .venv/bin/activate
# Install test extras so pytest-asyncio/pytest-cov options are available
pip install -e ".[dev,tests]"
python -m pytest tests --cov=src/bioetl --cov-report=term
```
With uv, the equivalent is:
```
uv sync --extra dev --extra tests
uv run python -m pytest tests --cov=src/bioetl --cov-report=term
```
To include tracing and pre-commit plugin setup:
```
uv sync --extra dev --extra tests --extra tracing
uv run python -m pre_commit install --install-hooks
```
Если pytest сообщает об отсутствии обязательных плагинов (pytest-asyncio, pytest-cov), выполните повторную синхронизацию:
```
uv sync --extra dev --extra tests --extra tracing
```
Скрипт ./scripts/run_pytest.sh проверяет наличие плагинов и автоматически доустанавливает их при необходимости.
Run All Tests:
```
make test
```
Run Unit Tests Only (Fast, no I/O):
```
make test-unit
```
Run Integration Tests (Uses VCR.py cassettes, no network required):
```
make test-integration
```
Run Architecture Tests:
```
make arch-test
```

Codex Skills

Sync project skills into Codex:
```
make setup-skills
```
This syncs local project skills from .codex/skills into $CODEX_HOME/skills (default ~/.codex/skills).

Code Quality

Strict quality standards are enforced using ruff, mypy, and other tools.

Linting & Formatting:

make lint      # Check only
make lint-fix  # Auto-fix and format

Type Checking:
```
make typecheck # Strict mypy
```
Complexity Check:
```
make complexity
```

Documentation

Build and serve local documentation:

make docs-serve

Access the docs at http://localhost:8000.

Project Structure

.
├── configs/                  # YAML pipeline configurations
├── docs/                     # Documentation (Architecture, Guides, Runbooks)
│   ├── 02-architecture/      # Layer docs, diagrams, ADRs (39 decisions)
│   ├── 00-project/
│   │   ├── glossary.md       # Ubiquitous Language glossary
│   │   └── RULES.md          # Project governance (v5.22)
│   └── ...
├── src/
│   └── bioetl/
│       ├── domain/           # Pure business logic (DDD), NO I/O
│       │   ├── ports/        # Protocol interfaces (Ports)
│       │   ├── aggregates/   # DDD Aggregates with invariants
│       │   ├── value_objects/ # Immutable domain primitives
│       │   ├── entities/     # Domain entities per provider
│       │   ├── schemas/      # Pydantic/Pandera validation schemas
│       │   └── exceptions/   # Classified exceptions (Critical/Recoverable/DQ)
│       ├── application/      # Pipeline orchestration & services
│       │   ├── core/         # PipelineRunner, Executor, BaseTransformer
│       │   ├── pipelines/    # ChEMBL, PubChem, UniProt, PubMed, CrossRef, OpenAlex, Semantic Scholar (+ common utilities)
│       │   └── services/     # Application services (lifecycle, vacuum, cleanup)
│       ├── composition/      # Composition Root (DI, bootstrap)
│       │   ├── factories/    # Pipeline, storage, data source factories
│       │   └── providers/    # Provider registry
│       ├── infrastructure/   # Adapters (API clients, Delta Lake, Storage)
│       │   ├── adapters/     # HTTP clients with unified resilience
│       │   ├── storage/      # Bronze/Silver/Gold writers
│       │   ├── locking/      # In-memory locks (MemoryLock)
│       │   └── observability/ # Metrics, tracing, logging
│       └── interfaces/       # External interfaces
│           ├── cli/          # Click CLI commands
│           └── orchestration/ # Reserved (empty; signal handlers removed 2025-12-31, shutdown logic in application/core/shutdown.py)
├── tests/                    # Unit, Integration, Architecture & E2E tests
├── scripts/                  # Utility scripts (lint_terminology.py, etc.)
├── Makefile                  # Automation commands
└── pyproject.toml            # Dependencies & Tool configuration

Root layout policy

Repository root is protected by scripts/audit_root_cleanliness.py (pre-commit + CI job root-hygiene). Only approved top-level entries are allowed.

Core allowed root entries:

Source and tests: src/, tests/
Documentation and references: docs/, README.md, CHANGELOG.md
Build/configuration: pyproject.toml, uv.lock, Makefile, .pre-commit-config.yaml, .github/
Operational/project assets: configs/, scripts/, assets/, data/, reports/, grafana/
Legacy tracked root artifacts listed in the allowlist inside scripts/audit_root_cleanliness.py

Where to place artifacts:

Test artifacts and run reports → reports/
Logs and diagnostic dumps → reports/ (or nested folder by run date/provider)
Coverage artifacts (coverage.xml, htmlcov/, .coverage*) → keep out of git, generate locally/CI only
Reference datasets and static lookup files → docs/ (documentation reference) or data/ (runtime/local data)

Legacy Distributed Mode (REJECTED / UNSUPPORTED)

CRITICAL WARNING: Distributed deployment and Redis Locking are STRICTLY PROHIBITED by ADR-010. The instructions below are for historical reference only and must NOT be used for new deployments.

For distributed deployments with Redis locking and S3-compatible storage, you can use Docker Compose:

# Start infrastructure services (Postgres, Redis, MinIO)
make docker-up

# Run E2E tests with Docker
make test-e2e

# Stop services
make docker-down

Decision: We have officially abandoned Redis Locks in favor of a strictly Local-Only architecture.

Security

Please review our Security Policy for:

Threat model and trust boundaries
Secret management guidelines
Data validation architecture
Vulnerability reporting process

Contributing

Please read RULES.md before contributing.

Ensure all tests pass: make test
Check types and linting: make lint
Follow the RFC 2119 keywords in requirements.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 7,547 Commits
.ai/memory		.ai/memory
.claude		.claude
.github		.github
assets		assets
configs		configs
data		data
docs		docs
grafana		grafana
reports		reports
scripts		scripts
src		src
tests		tests
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitleaks.toml		.gitleaks.toml
.importlinter		.importlinter
.jscpd.json		.jscpd.json
.pre-commit-config.yaml		.pre-commit-config.yaml
.secrets.baseline		.secrets.baseline
.setup_wsl_codex.sh		.setup_wsl_codex.sh
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
commitlint.config.js		commitlint.config.js
docker-compose.monitoring.yml		docker-compose.monitoring.yml
docker-compose.yml		docker-compose.yml
entrypoint.sh		entrypoint.sh
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BioETL: Bioactivity Data Acquisition Pipeline

Key Features

Architecture Overview

Domain Layer (DDD)

Supported Providers

Documentation

Quick Start

Prerequisites

Installation

Option A: Automated Setup (Recommended)

Option B: Manual Setup

Running Pipelines

Development

Repository Hygiene

Local diagnostic artifacts

Testing

Codex Skills

Code Quality

Documentation

Project Structure

Root layout policy

Legacy Distributed Mode (REJECTED / UNSUPPORTED)

Security

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 7

Uh oh!

Languages

License

SatoryKono/BioactivityDataAcquisition

Folders and files

Latest commit

History

Repository files navigation

BioETL: Bioactivity Data Acquisition Pipeline

Key Features

Architecture Overview

Domain Layer (DDD)

Supported Providers

Documentation

Quick Start

Prerequisites

Installation

Option A: Automated Setup (Recommended)

Option B: Manual Setup

Running Pipelines

Development

Repository Hygiene

Local diagnostic artifacts

Testing

Codex Skills

Code Quality

Documentation

Project Structure

Root layout policy

Legacy Distributed Mode (REJECTED / UNSUPPORTED)

Security

Contributing

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 7

Uh oh!

Languages

Packages