Skip to content

Bacdive media fix#174

Closed
realmarcin wants to merge 9 commits intomasterfrom
bacdive_media_fix
Closed

Bacdive media fix#174
realmarcin wants to merge 9 commits intomasterfrom
bacdive_media_fix

Conversation

@realmarcin
Copy link
Copy Markdown
Collaborator

No description provided.

@turbomam turbomam requested a review from Copilot August 13, 2025 16:51
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes issues in the BacDive media processing pipeline by refactoring data accumulation and edge creation logic. The changes address data structure inconsistencies and improve the handling of metabolite utilization and enzyme activity data.

  • Removes unused NCBI_TO_ENZYME_EDGE import and refactors enzyme activity handling
  • Changes data structures from dictionaries to tuples for consistent processing
  • Adds accumulation logic to collect data per NCBITaxon before writing edges

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

)
for assay_id in info["assays"]:
# Unpacking the assay information stored as tuples
assay_curie, assay_value, utilization_type = assay_id
Copy link

Copilot AI Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tuple unpacking assumes all assay items have exactly 3 elements, but enzyme activities are stored as 2-element tuples (lines 749, 758) while metabolite utilizations are stored as 3-element tuples. This will cause a ValueError when processing enzyme data.

Suggested change
assay_curie, assay_value, utilization_type = assay_id
# Unpacking the assay information stored as tuples (handle both 2- and 3-element tuples)
if len(assay_id) == 3:
assay_curie, assay_value, utilization_type = assay_id
elif len(assay_id) == 2:
assay_curie, assay_value = assay_id
utilization_type = None
else:
raise ValueError(f"Unexpected assay tuple length: {len(assay_id)} for {assay_id}")

Copilot uses AI. Check for mistakes.
BACDIVE_PREFIX + key,
]
edge_writer.writerow(meta_util_edges_to_write)
for k, _, _ in positive_chebi_activity:
Copy link

Copilot AI Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is incorrectly indented and creates a syntax error. It should be aligned with the previous if statement or properly nested within it.

Copilot uses AI. Check for mistakes.
writer_2.writerow(phys_and_meta_data)

if ncbitaxon_id:
if ncbitaxon_id not in self.ncbitaxon_info:
Copy link

Copilot AI Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This data accumulation logic is duplicated later in the code (lines 604-656 and 767-814). The duplicate code should be consolidated into a single location or extracted into a helper method to improve maintainability.

Copilot uses AI. Check for mistakes.
edge_writer.writerow(
[
ncbitaxon_id,
NCBI_TO_METABOLITE_UTILIZATION_EDGE,
Copy link

Copilot AI Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All assay data is being written with NCBI_TO_METABOLITE_UTILIZATION_EDGE edge type, but enzyme activities should use a different edge type since they represent different biological relationships.

Suggested change
NCBI_TO_METABOLITE_UTILIZATION_EDGE,
# Select edge type based on utilization_type
if utilization_type == "enzyme_activity":
edge_type = ENZYME_TO_ASSAY_EDGE
else:
edge_type = NCBI_TO_METABOLITE_UTILIZATION_EDGE
edge_writer.writerow(
[
ncbitaxon_id,
edge_type,

Copilot uses AI. Check for mistakes.
@turbomam
Copy link
Copy Markdown
Collaborator

Closing this PR — bacdive.py has been rewritten extensively since June 2024 (from ~925 lines to 2,809 lines) and this PR has unresolvable merge conflicts.

What this PR was trying to fix

  1. Duplicate edges — when multiple BacDive strain records mapped to the same NCBITaxon ID, the transform wrote redundant medium/assay edges. The PR introduced a self.ncbitaxon_info accumulator dict to collect unique values per taxon and write them once at the end.

  2. Data structure simplification — refactored enzyme/metabolite data from single-entry dicts to tuples for cleaner iteration.

  3. Isolation source handling — added isolation source node/edge creation.

  4. Enzyme edge consolidation — removed NCBI_TO_ENZYME_EDGE and routed enzyme data through the same accumulator as metabolite utilization.

Why it's superseded

  • The accumulator pattern was not adopted on master. Deduplication is now handled differently.
  • Enzyme handling was completely rewritten with a richer data model (METPO predicates, CAPABLE_OF relations, dicts with ec_id/label/predicate keys).
  • Isolation source handling was implemented independently on master.
  • Copilot review (Aug 2025) flagged bugs that were never addressed: tuple unpacking mismatch between 2-element enzyme tuples and 3-element metabolite tuples, and all assays being written with NCBI_TO_METABOLITE_UTILIZATION_EDGE regardless of type.

Preserved as issues

The underlying problems (duplicate edge creation, enzyme/metabolite edge type conflation) may or may not still exist in the current code. Filing separate issues to verify.

Thank you @realmarcin for identifying these problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants