TE Pipeline

Snakemake pipeline that runs RepeatMasker on GENCODE transcripts, extracts TE features per transcript, and compares distributions between protein-coding and lncRNA classes via statistical tests and plots.

Requirements

Conda / Mamba (environment: workflow/envs/te_analysis.yaml)
RepeatMasker (installed in the conda env)
Snakemake ≥ 9

Quick start

# copy and edit the default config
cp config/config_v47.yaml config/config_mine.yaml
# edit paths: gencode_gtf, gencode_fasta, pc/lncrna FASTAs or ID files

snakemake --snakefile Snakefile \
          --configfile config/config_mine.yaml \
          --profile profiles/default \
          --use-conda

# or simply
snakemake

Classification modes

Set classification_mode in the config to one of:

Mode	Required keys
`fasta`	`pc_transcripts_fasta`, `lncrna_transcripts_fasta`
`id_file`	`pc_transcript_ids_file`, `lncrna_transcript_ids_file`

Pipeline steps

Step	Rule	Output
1	`parse_gencode_gtf`	`annotation/transcripts_from_gtf.bed`, `transcript_biotypes.txt`
1.5	`extract_transcript_lengths`	`annotation/transcript_lengths.txt`
2	`extract_ids_from_fasta`	`annotation/pc_transcript_ids.txt`, `lncrna_transcript_ids.txt`
4	`index_transcripts`	`annotation/all_transcripts.fa.fai`
5	`check_fasta_headers`	`annotation/all_transcripts_headers_checked.fa`
5	`run_repeatmasker_full`	`repeatmasker/all_transcripts.out{,.gff}`
6	`extract_all_features`	`features/all_transcripts_te_features.csv`
7	`filter_features_by_class`	`features/{pc,lncrna}_transcripts_te_features.csv`
8	`combine_classified_features`	`combined/classified_te_features.csv`
9	`statistical_analysis`	`analysis/univariate_tests.csv`, `summary_report.txt`, …
10	`generate_visualizations`	`plots/te_presence_comparison.png`, `volcano_plot.png`, `pca_plot.png`

All outputs are namespaced under results/{dataset}/.

Configs

File	Purpose
`config/config_47.yaml`	GENCODE v47 run
`config/config_49.yaml`	GENCODE v49 run

Dev info

See te_pipeline_manifest.yaml for the full rule catalogue, script inventory, dataset run status, and cleanup candidates.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
archive		archive
config		config
profiles/default		profiles/default
workflow		workflow
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
Snakefile		Snakefile
te_pipeline_manifest.yaml		te_pipeline_manifest.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TE Pipeline

Requirements

Quick start

Classification modes

Pipeline steps

Configs

Dev info

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TE Pipeline

Requirements

Quick start

Classification modes

Pipeline steps

Configs

Dev info

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages