Skip to content

REFACTOR: Improve dna2protein bottleneck#422

Open
m-huertasp wants to merge 9 commits intodevfrom
feat/229-improve-dna2protein-bottleneck
Open

REFACTOR: Improve dna2protein bottleneck#422
m-huertasp wants to merge 9 commits intodevfrom
feat/229-improve-dna2protein-bottleneck

Conversation

@m-huertasp
Copy link
Collaborator

@m-huertasp m-huertasp commented Feb 24, 2026

Summary

This PR optimizes the DNA2PROTEINMAPPING step by reducing its bottleneck. Previously, the pipeline executed multiple serial requests to Ensembl REST APIs (roughly 2× the number of genes in a panel), which frequently led to rate-limiting blocks and significant latency for large panels.

Based on suggestions from @FerriolCalvet , the individual calls were replaced with a bulk-retrieval strategy. We now download the full GFF annotation via FTP and process it locally.

Main Changes

  • New logic: Replaced multiple REST API calls with a single FTP parsing of the Ensembl GFF. The document is not downloaded but filtered on the fly and briefly kept in memory until transformation into a dataframe.
  • Refactor: Updated downstream functions to handle DataFrames instead of the JSON objects previously returned by the API.
  • Flexibility: Linked the Ensembl release version to the vep_cache_version Nextflow parameter, allowing users to specify the genomic release dynamically.
  • Documentation: Added docstrings and internal comments to explain the new and old parsing logic.

closes #229

- also simplified plot_coverage_per_gene
- Removed functions to get the api calls
- New functions to obtain the gff and filter it
- Modified functions to process the data
- Added docstrings
- Also modified the order
- Removed unused functions
- Improved efficiency of find_exon by applying vectorized dataframe
  operations
- Fix errors in docstrings and parameter definition
- Add log information and remove info from matplotlib
- Added click option to script
- Added definition to nextflow process
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the DNA2PROTEINMAPPING step to reduce Ensembl REST API bottlenecks by switching to a bulk retrieval + local parsing approach using the Ensembl GFF3 (fetched from Ensembl FTP), and threads an Ensembl release parameter into the mapping step.

Changes:

  • Add an optional --ensembl-release argument wiring in the Nextflow module.
  • Replace transcript/exon/CDS retrieval from Ensembl REST with streaming download + local filtering of Ensembl GFF3 and DataFrame-based downstream processing.
  • Refactor exon-ID assignment and related plotting/coverage helpers.

Reviewed changes

Copilot reviewed 1 out of 2 changed files in this pull request and generated 2 comments.

File Description
modules/local/dna2protein/main.nf Adds --ensembl-release argument passing to the mapping script.
bin/panels_computedna2protein.py Implements GFF3 streaming retrieval/parsing and refactors coordinate mapping, exon lookup, and coverage plotting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Collaborator

@FerriolCalvet FerriolCalvet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all good! great Marta!

generator
A generator that yields lines from the GFF file that correspond to exon and CDS features.
"""
url = f"https://ftp.ensembl.org/pub/release-{release}/gff3/homo_sapiens/Homo_sapiens.GRCh38.{release}.gff3.gz"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might use this script for mice as well, we should fix this to accept mouse also version 111 (and assembly should be already mm39 by default)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GRCm39

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

code-review 👩‍💻 Tasks associated with the code-review efficiency-related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

use panel annotation to define already the protein position of each CDS position

3 participants