BioETL integrates with the following public APIs:
- ChEMBL — Bioactivity data from scientific literature
- PubChem — Chemical compound information
- UniProt — Protein sequence and functional data
- PubMed — Scientific publication metadata
| Boundary | Risk Level | Mitigation |
|---|---|---|
| API responses | Medium | Schema validation, input sanitization |
| User configuration | Low | YAML schema validation |
| Local filesystem | Low | Path traversal prevention |
| Vector | Description | Mitigation |
|---|---|---|
| API Injection | Malformed data in API responses | Strict schema validation at adapter level |
| Data Poisoning | Corrupted or malicious records | Multi-layer validation (Bronze → Silver → Gold) |
| Secrets Exposure | Credentials in logs/code/VCR cassettes | Environment variables, sanitization hooks |
| Dependency Vulnerabilities | CVEs in third-party packages | Regular pip-audit scans, Dependabot |
All secrets MUST follow the naming convention:
BIOETL_{PROVIDER}_{KEY}
Examples:
BIOETL_CHEMBL_API_KEYBIOETL_PUBCHEM_API_KEYBIOETL_UNIPROT_API_KEY
| Requirement | Status |
|---|---|
Never commit .env files |
✅ Enforced via .gitignore |
| VCR cassettes sanitized | ✅ before_record hooks remove secrets |
| No hardcoded credentials | ✅ Checked by linting rules |
| Secrets not in logs | ✅ Structlog filtering configured |
import os
# CORRECT: Load from environment
api_key = os.environ.get("BIOETL_CHEMBL_API_KEY")
# INCORRECT: Hardcoded value
api_key = "sk-1234567890" # NEVER do this┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Bronze │ ──► │ Silver │ ──► │ Gold │
│ (Raw API) │ │ (Cleaned) │ │ (Validated) │
└─────────────┘ └─────────────┘ └─────────────┘
│ │ │
▼ ▼ ▼
Schema checks PyArrow schema Pandera rules
at adapters enforcement + business logic
| Layer | Tool | Validation Type |
|---|---|---|
| Adapters | Custom schemas | Input format, required fields |
| Silver | PyArrow/Delta | Schema enforcement, type coercion |
| Gold | Pandera | Business rules, cross-field validation |
- Records failing validation are sent to Quarantine
- Quarantine records include: original data, error details, timestamp
- Thresholds: >5% errors = warning, >20% errors = batch failure
# Run dependency audit
pip-audit
# Check for known vulnerabilities
make security-check- Dependabot alerts: Enabled via
.github/dependabot.yml(pip + GitHub Actions) - detect-secrets: Runs in CI (
.github/workflows/security.yml) to prevent credential leaks - pip-audit: Runs in CI (
.github/workflows/security.yml) for dependency vulnerability scanning - Trivy: Container image scanning in
.github/workflows/docker.yml(SHA-pinned) - Update policy: Security patches applied within 72 hours
All production dependencies are pinned in pyproject.toml with exact versions to ensure reproducible builds and prevent supply chain attacks.
- Email: security@example.com
- Response time: 72 hours for initial acknowledgment
- Report — Send details to security email (encrypted if needed)
- Acknowledgment — We respond within 72 hours
- Investigation — We assess severity and impact
- Fix — Patch developed and tested
- Disclosure — Coordinated public disclosure after fix
- Description of the vulnerability
- Steps to reproduce
- Potential impact assessment
- Suggested fix (if any)
| In Scope | Out of Scope |
|---|---|
| BioETL codebase | Third-party API security |
| Configuration handling | Infrastructure not owned by us |
| Data validation logic | Social engineering attacks |
| Secret management | Physical security |
- No hardcoded secrets or credentials
- Input validation for external data
- Proper error handling (no stack traces in production)
- VCR cassettes sanitized before commit
- Dependencies from trusted sources only
# ❌ Hardcoded secrets
API_KEY = "sk-secret-key-here"
# ❌ Unsanitized input in queries
query = f"SELECT * FROM {user_input}"
# ❌ Secrets in logs
logger.info(f"Using API key: {api_key}")
# ❌ Disabled SSL verification
httpx.get(url, verify=False)# ✅ Environment variables
api_key = os.environ["BIOETL_CHEMBL_API_KEY"]
# ✅ Parameterized queries
query = "SELECT * FROM compounds WHERE id = ?"
# ✅ Masked logging
logger.info("API key configured", has_key=bool(api_key))
# ✅ SSL verification enabled (default)
httpx.get(url) # verify=True by defaultLast updated: 2025-12-25