Conversation
|
Technically shouldn't it be the symbolic allele <*>, the dot allele being missing data? |
|
Not |
|
Ideally, yes. Unfortunately the VCF specification only allows non-negative integers or |
|
I think @mp15 meant to use |
|
I am using bcftools norm at the moment and I think I have observed a problematic case during my use of the behaviour of bcftools norm for GT splitting, just as @jpiper described. I am splitting multi allelic haploid sites to filter on the ratio of alternate allele observation counts over read depth. However when I try merging bi allelic sites into multi allelic sites after the filtering, the conflicting GT are badly resolved for everything but the first allele I think. Consider this : (I modified by hand the AD value for my test cases) If I filter to keep only calls supported by more than 0.75 of the observation with the following command : I get the following result: The third sample was correctly set to GT='.' whether the fourth one is now "0" where I would like him to be "." By any chance do you think the proposed enhancement here is coming for bcftools ? Or by chance would you know an easier way of filtering each sample based with an ALT call based on the frequency of the most frequent allele observation like this ? For now the only fix I have thought about is adding an extra filtering for the reference calls that seem to be "erroneous", such as this : but I think this might be a bit dangerous. Like this, both entries for the fourth sample in the MWE I presented are set to "." and stay "." when merged. Thanks a lot for your help and thanks a lot for the tool ! |
|
I think its a fundamental problem with representing multi allelics as binary.
If we try and convert to two bi-allelics using A as the reference we can represent the following genotypes
We lose the non-reference mutual heterozygotes i.e. the B/C genotype class The * allele code is for long deletions causing a third allele. Variant graphs will sort this out in the long term. Short term: Filter them out and hope it's not a signal lost for analysis in GWAS. |
Hi All,
Second time lucky (first pull request was polluted with stuff). It appears to me there's a pretty serious flaw with GT flag splitting in the normalisation. This is what I'm seeing:
Splitting multiallelic sites such as
with a GT of
1/2withbcftools norm -m- <in>incorrectly yields two records with a GT1/0and0/1, which should really be be1/.and./1In the integration tests, I also notice sites with a
2getting split to a0and a1site...to
which should surely be
.and1(For humans, you wouldn't expect a haploid call in this specific region - the spec says we would only expect this in M or Y for humans)I have patched here to replace
with
and altered the expected output from the unit tests accordingly, in case you find this is actually a bug rather than intended behaviour.