Skip to content
/ GenoSnap Public

Fast and useful helper scripts for analysis of genotype data in VCF file

Notifications You must be signed in to change notification settings

WGLab/GenoSnap

Repository files navigation

VCF Extract — Genotype Data Helper Scripts

A small collection of fast, standalone Python scripts for processing genotype data from VCF files and related resources (reference FASTA, HPO, ClinVar).

Requirements: Python 3. No heavy dependencies; scripts use the standard library (plus optional network access for ClinVar).


Scripts

vcf_to_flanks_fasta.py

Extracts flanking reference sequence for each variant in a VCF.

  • Input: VCF + whole-genome FASTA
  • Output: FASTA with one record per variant (configurable window, e.g. 100 bp up + ref + 100 bp down)
  • Features: Uses a .fai index (built if missing), supports --include-ref, chr1/1 normalization, REF validation
python vcf_to_flanks_fasta.py --fasta genome.fa --vcf variants.vcf --out flanks.fasta
python vcf_to_flanks_fasta.py --fasta genome.fa --vcf variants.vcf --out flanks.fasta --include-ref --up 100 --down 100

vcf_clinvar_pathogenic.py

Checks VCF variants against ClinVar via NCBI E-utilities and reports pathogenic / likely pathogenic hits.

  • Input: VCF
  • Output: Tab-delimited file of variants that are pathogenic or likely pathogenic in ClinVar
  • Features: GRCh37 or GRCh38 via --assembly; rate-limited NCBI requests
python vcf_clinvar_pathogenic.py --vcf input.vcf --assembly GRCh38 --out clinvar_hits.tsv

hpo_terms_to_gene.py

Predicts likely causal genes from a list of HPO phenotype terms using an HPO annotation database.

  • Input: File of HPO term IDs (one per line) + phenotype annotation (HPOA or custom TSV)
  • Output: Ranked candidate genes (and optionally top diseases)
  • Features: Binary or IC-weighted scoring, optional OBO propagation, disease→gene mapping
python hpo_terms_to_gene.py --hpo_terms patient_hpo.txt --hpoa phenotype.hpoa --topk 20 --show_diseases
python hpo_terms_to_gene.py --hpo_terms terms.txt --tsv disease_hpo_gene.tsv --out results.tsv --weights ic

make_example_genome.py

Builds a small example reference FASTA (contigs 16 and 20) with REF bases matching ex2.vcf, for testing vcf_to_flanks_fasta.py without a full genome.

python make_example_genome.py
# Creates genome.fa in the current directory

Quick start

  1. Clone or download this repo.
  2. Run any script with --help for full options, e.g.
    python vcf_to_flanks_fasta.py --help
  3. Use ex2.vcf and the genome produced by make_example_genome.py to try the flank extraction pipeline.

License

Use and modify as needed for your projects.

About

Fast and useful helper scripts for analysis of genotype data in VCF file

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages