Bacdive media fix by realmarcin · Pull Request #174 · Knowledge-Graph-Hub/kg-microbe

realmarcin · 2024-06-04T19:31:54Z

No description provided.

Copilot

Pull Request Overview

This PR fixes issues in the BacDive media processing pipeline by refactoring data accumulation and edge creation logic. The changes address data structure inconsistencies and improve the handling of metabolite utilization and enzyme activity data.

Removes unused NCBI_TO_ENZYME_EDGE import and refactors enzyme activity handling
Changes data structures from dictionaries to tuples for consistent processing
Adds accumulation logic to collect data per NCBITaxon before writing edges

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-08-13T16:53:13Z

+                )
+            for assay_id in info["assays"]:
+                # Unpacking the assay information stored as tuples
+                assay_curie, assay_value, utilization_type = assay_id


The tuple unpacking assumes all assay items have exactly 3 elements, but enzyme activities are stored as 2-element tuples (lines 749, 758) while metabolite utilizations are stored as 3-element tuples. This will cause a ValueError when processing enzyme data.

Suggested change

assay_curie, assay_value, utilization_type = assay_id

# Unpacking the assay information stored as tuples (handle both 2- and 3-element tuples)

if len(assay_id) == 3:

assay_curie, assay_value, utilization_type = assay_id

elif len(assay_id) == 2:

assay_curie, assay_value = assay_id

utilization_type = None

else:

raise ValueError(f"Unexpected assay tuple length: {len(assay_id)} for {assay_id}")

Copilot · 2025-08-13T16:53:13Z

-                                        BACDIVE_PREFIX + key,
-                                    ]
-                                    edge_writer.writerow(meta_util_edges_to_write)
+                        for k, _, _ in positive_chebi_activity:


This line is incorrectly indented and creates a syntax error. It should be aligned with the previous if statement or properly nested within it.

Copilot · 2025-08-13T16:53:14Z

                        writer_2.writerow(phys_and_meta_data)

+                    if ncbitaxon_id:
+                        if ncbitaxon_id not in self.ncbitaxon_info:


This data accumulation logic is duplicated later in the code (lines 604-656 and 767-814). The duplicate code should be consolidated into a single location or extracted into a helper method to improve maintainability.

Copilot · 2025-08-13T16:53:14Z

+                edge_writer.writerow(
+                    [
+                        ncbitaxon_id,
+                        NCBI_TO_METABOLITE_UTILIZATION_EDGE,


All assay data is being written with NCBI_TO_METABOLITE_UTILIZATION_EDGE edge type, but enzyme activities should use a different edge type since they represent different biological relationships.

Suggested change

NCBI_TO_METABOLITE_UTILIZATION_EDGE,

# Select edge type based on utilization_type

if utilization_type == "enzyme_activity":

edge_type = ENZYME_TO_ASSAY_EDGE

else:

edge_type = NCBI_TO_METABOLITE_UTILIZATION_EDGE

edge_writer.writerow(

[

ncbitaxon_id,

edge_type,

turbomam · 2026-03-18T19:23:56Z

Closing this PR — bacdive.py has been rewritten extensively since June 2024 (from ~925 lines to 2,809 lines) and this PR has unresolvable merge conflicts.

What this PR was trying to fix

Duplicate edges — when multiple BacDive strain records mapped to the same NCBITaxon ID, the transform wrote redundant medium/assay edges. The PR introduced a self.ncbitaxon_info accumulator dict to collect unique values per taxon and write them once at the end.
Data structure simplification — refactored enzyme/metabolite data from single-entry dicts to tuples for cleaner iteration.
Isolation source handling — added isolation source node/edge creation.
Enzyme edge consolidation — removed NCBI_TO_ENZYME_EDGE and routed enzyme data through the same accumulator as metabolite utilization.

Why it's superseded

The accumulator pattern was not adopted on master. Deduplication is now handled differently.
Enzyme handling was completely rewritten with a richer data model (METPO predicates, CAPABLE_OF relations, dicts with ec_id/label/predicate keys).
Isolation source handling was implemented independently on master.
Copilot review (Aug 2025) flagged bugs that were never addressed: tuple unpacking mismatch between 2-element enzyme tuples and 3-element metabolite tuples, and all assays being written with NCBI_TO_METABOLITE_UTILIZATION_EDGE regardless of type.

Preserved as issues

The underlying problems (duplicate edge creation, enzyme/metabolite edge type conflation) may or may not still exist in the current code. Filing separate issues to verify.

Thank you @realmarcin for identifying these problems.

realmarcin added 9 commits June 3, 2024 23:14

strain to taxon, strain to medium. also isolation source

7141959

error fix

8dbc5b5

error fix

308a243

error fix

88c5e31

error fix

4290d2c

error fix

1d1cb72

error fix

9412f78

error fix

b8562bd

error fix

cb65501

turbomam requested a review from Copilot August 13, 2025 16:51

Copilot AI reviewed Aug 13, 2025

View reviewed changes

realmarcin assigned sujaypatil96 Sep 5, 2025

turbomam closed this Mar 18, 2026

This was referenced Mar 18, 2026

Verify BacDive transform does not produce duplicate edges for multi-strain taxa #521

Open

Verify BacDive enzyme activities use correct edge type (not metabolite utilization) #522

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bacdive media fix#174

Bacdive media fix#174
realmarcin wants to merge 9 commits intomasterfrom
bacdive_media_fix

realmarcin commented Jun 4, 2024

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Aug 13, 2025

Uh oh!

Copilot AI Aug 13, 2025

Uh oh!

Copilot AI Aug 13, 2025

Uh oh!

Copilot AI Aug 13, 2025

Uh oh!

turbomam commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

-                assay_curie, assay_value, utilization_type = assay_id
+                # Unpacking the assay information stored as tuples (handle both 2- and 3-element tuples)
+                if len(assay_id) == 3:
+                    assay_curie, assay_value, utilization_type = assay_id
+                elif len(assay_id) == 2:
+                    assay_curie, assay_value = assay_id
+                    utilization_type = None
+                else:
+                    raise ValueError(f"Unexpected assay tuple length: {len(assay_id)} for {assay_id}")

-                        NCBI_TO_METABOLITE_UTILIZATION_EDGE,
+                # Select edge type based on utilization_type
+                if utilization_type == "enzyme_activity":
+                    edge_type = ENZYME_TO_ASSAY_EDGE
+                else:
+                    edge_type = NCBI_TO_METABOLITE_UTILIZATION_EDGE
+                edge_writer.writerow(
+                    [
+                        ncbitaxon_id,
+                        edge_type,

Conversation

realmarcin commented Jun 4, 2024

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

turbomam commented Mar 18, 2026

What this PR was trying to fix

Why it's superseded

Preserved as issues

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants