Skip to content

Add organism classification service (taxonomy-priority) with diagnostics and tests#2380

Closed
SatoryKono wants to merge 1 commit intomainfrom
codex/add-organism-classification-function
Closed

Add organism classification service (taxonomy-priority) with diagnostics and tests#2380
SatoryKono wants to merge 1 commit intomainfrom
codex/add-organism-classification-function

Conversation

@SatoryKono
Copy link
Copy Markdown
Owner

Motivation

  • Provide deterministic high-level organism classification (acellular / unicellular / multicellular) for assay records using the existing assay_taxonomy_id and assay_organism inputs.
  • Ensure taxonomy ID is the source-of-truth when available and record diagnostics when assay_organism and assay_taxonomy_id disagree.
  • Keep logic pure domain code (no I/O) and expose a typed, easily extensible API for downstream use in transformers/validation.

Description

  • Added a new pure-domain service classify_organism(assay_organism, assay_taxonomy_id) -> OrganismClassificationResult with the OrganismClass enum and OrganismClassificationResult dataclass to capture organism_class, normalized_organism, taxonomy_id, source, source_conflict and reason (file: src/bioetl/domain/services/organism_classification_service.py).
  • Implemented deterministic normalization (trim, lower-case, remove parenthetical annotations), alias mapping (hiv, eel, rice, monkey) and lightweight heuristic hints plus lookup tables for known taxonomy IDs and canonical names.
  • Taxonomy ID is given priority when valid; conflicts set source_conflict=True and populate reason; unmapped-but-valid taxonomy IDs return source="unresolved" with explanation.
  • Exported the new API from the services package and added focused unit tests (file: tests/unit/domain/services/test_organism_classification_service.py).
  • Small ancillary fixes for formatting/imports and a few type/comment adjustments in config/system modules to satisfy repo checks (files touched: src/bioetl/domain/services/__init__.py, src/bioetl/infrastructure/config/*, src/bioetl/infrastructure/system/memory_monitor.py, src/bioetl/__init__.py, and one test formatting change).

Testing

  • Ran unit tests for the new service with uv run python -m pytest tests/unit/domain/services/test_organism_classification_service.py -q and they passed. ✅
  • Verified architecture/style gates with uv run python -m pytest tests/architecture/test_code_formatting.py tests/architecture/test_code_metrics.py -q and both passed after formatting fixes. ✅
  • Type checking with uv run python -m mypy --strict src/bioetl/ completed with no errors. ✅
  • Ran the full test suite with uv run python -m pytest tests/ -x -q; the run failed on an unrelated live e2e test due to an upstream ChEMBL API 500 Internal Server Error (tests/e2e/test_full_pipeline.py::TestChEMBLPipelineE2E::test_chembl_activity_full_run), which is external to the classification logic and not caused by this change. ⚠️

Codex Task

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant