tabix/tabix: update module to support region-based VCF extraction#11064
tabix/tabix: update module to support region-based VCF extraction#11064luisas wants to merge 21 commits intonf-core:masterfrom
Conversation
- Reverts tabix/tabix to its original indexing behaviour - New tabix/extract module: takes a bgzipped VCF + tbi + optional regions file - Outputs bgzipped VCF (always) and tbi index (optional, controlled by create_index input) - Follows samtools/view pattern for optional outputs
Always runs extraction + bgzip + tabix index. Both vcf and tbi outputs are optional: true — callers use whichever channels they need.
- Drop tabix/extract — functionality merged into tabix/tabix - New input: optional regions file (pass [] to use indexing mode) - New input: index path (required for extraction mode) - index output: optional, emitted in indexing mode - vcf output: optional, emitted in extraction mode - tbi output: optional, emitted in extraction mode
There was a problem hiding this comment.
Ok I started this review but then I realized that this is in my eyes a different module. Maybe @nf-core/maintainers have the same opinion but I would rather have this separate because from tabix/tabix I just expect indexing and I would otherwise call it tabix/extract or something similar.
Edit: I just saw that that was you original idea and now I am a little confused what led to the change :D
modules/nf-core/tabix/tabix/main.nf
Outdated
| tuple val(meta), path(tab) | ||
| tuple val(meta2), path(regions) |
There was a problem hiding this comment.
Please make this into one input tuple :)
| tuple val(meta), path(tab) | |
| tuple val(meta2), path(regions) | |
| tuple val(meta), path(tab), path(regions) |
There was a problem hiding this comment.
They are actually 2 different files that I believe conceptually work better in 2 tuples: the first one is the input file (with its own meta) and the second one is the subsetting instructions file (also should probably have its own meta) - do you think I could leave it separate?
There was a problem hiding this comment.
One might want to do sample specific subsetting, which is much easier with 1 tuple.
It's easier to join inputs then it is to pass two inputs together in sync 😉
This is one of the reasons we are moving to a flatter input structure.
There was a problem hiding this comment.
Oh ok! Fixing it then :)
|
Hi @famosab, thanks for starting the review! I started doing it like tabix_extract but then I had a chat with @FriederikeHanssen on how to handle these modules that have different optional outputs depending on the input flags and because of maintainance burdens we thought it would be better to have it in one module? Similar to samtools/view I am torn tbh, I am also liking the tabix/extract version. |
|
I get the point! I am at times unsure maybe we need to discuss this also in the maintainers meeting. If I follow your arguments I would then vote for the other submodules of tabix being deprecated (because what makes them different "enough") or add this as a separate module. But I would find it confusing that some things are excluded and some are part of the "base" module. Do you get what I mean? Maybe I am a bit strict on this matter but I want to have a structure that is understandable and sort of follows the same arguments? We could also have a vote or ask for more opinions in the slack and then adjust all tabix modules? (sorry for making this bigger than it is) |
I see the point. Doing tabix/extract was also my original intuition but I also see the other side of having something like samtools/view. Pinging @FriederikeHanssen @adamrtalbot so we can decide how to proceed |
@famosab given the discussion on the team-maintainers team in slack there seems not to be the general solution for every module but there is a general agreement on trying to keep options in the same module. Here there is also 2 other modules under tabix but they are using different command lines (bgzip, bgzip + tabix, tabix) What do you think? |
famosab
left a comment
There was a problem hiding this comment.
Few more requests, I am fine with keeping this in tabix/tabix
| then { | ||
| assertAll ( | ||
| { assert process.success }, | ||
| { assert snapshot(sanitizeOutput(process.out, unstableKeys: ["vcf"])).match() } |
There was a problem hiding this comment.
Can we not use nft-vcf to test the md5sum of the vcf?
There was a problem hiding this comment.
I did not know about that - thank you! Will add now
Co-authored-by: Famke Bäuerle <45968370+famosab@users.noreply.github.com>
Co-authored-by: Famke Bäuerle <45968370+famosab@users.noreply.github.com>
Co-authored-by: Famke Bäuerle <45968370+famosab@users.noreply.github.com>
Co-authored-by: Famke Bäuerle <45968370+famosab@users.noreply.github.com>
Co-authored-by: Famke Bäuerle <45968370+famosab@users.noreply.github.com>
|
@famosab ok now it should be all done! Thank you a lot for the help! |
Add support for region-based VCF extraction
PR checklist
topic: versions- See version_topicslabelnf-core modules test <MODULE> --profile dockernf-core modules test <MODULE> --profile singularitynf-core modules test <MODULE> --profile condanf-core subworkflows test <SUBWORKFLOW> --profile dockernf-core subworkflows test <SUBWORKFLOW> --profile singularitynf-core subworkflows test <SUBWORKFLOW> --profile conda