Open
Conversation
- also simplified plot_coverage_per_gene
- Removed functions to get the api calls - New functions to obtain the gff and filter it - Modified functions to process the data - Added docstrings
- Also modified the order - Removed unused functions
- Improved efficiency of find_exon by applying vectorized dataframe operations - Fix errors in docstrings and parameter definition - Add log information and remove info from matplotlib
- Added click option to script - Added definition to nextflow process
Contributor
There was a problem hiding this comment.
Pull request overview
This PR refactors the DNA2PROTEINMAPPING step to reduce Ensembl REST API bottlenecks by switching to a bulk retrieval + local parsing approach using the Ensembl GFF3 (fetched from Ensembl FTP), and threads an Ensembl release parameter into the mapping step.
Changes:
- Add an optional
--ensembl-releaseargument wiring in the Nextflow module. - Replace transcript/exon/CDS retrieval from Ensembl REST with streaming download + local filtering of Ensembl GFF3 and DataFrame-based downstream processing.
- Refactor exon-ID assignment and related plotting/coverage helpers.
Reviewed changes
Copilot reviewed 1 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
modules/local/dna2protein/main.nf |
Adds --ensembl-release argument passing to the mapping script. |
bin/panels_computedna2protein.py |
Implements GFF3 streaming retrieval/parsing and refactors coordinate mapping, exon lookup, and coverage plotting. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
FerriolCalvet
requested changes
Feb 24, 2026
Collaborator
FerriolCalvet
left a comment
There was a problem hiding this comment.
all good! great Marta!
bin/panels_computedna2protein.py
Outdated
| generator | ||
| A generator that yields lines from the GFF file that correspond to exon and CDS features. | ||
| """ | ||
| url = f"https://ftp.ensembl.org/pub/release-{release}/gff3/homo_sapiens/Homo_sapiens.GRCh38.{release}.gff3.gz" |
Collaborator
There was a problem hiding this comment.
we might use this script for mice as well, we should fix this to accept mouse also version 111 (and assembly should be already mm39 by default)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR optimizes the
DNA2PROTEINMAPPINGstep by reducing its bottleneck. Previously, the pipeline executed multiple serial requests to Ensembl REST APIs (roughly 2× the number of genes in a panel), which frequently led to rate-limiting blocks and significant latency for large panels.Based on suggestions from @FerriolCalvet , the individual calls were replaced with a bulk-retrieval strategy. We now download the full GFF annotation via FTP and process it locally.
Main Changes
closes #229