Skip to content

blab/esm-selection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

186 Commits
 
 
 
 
 
 
 
 

Repository files navigation

ESM Selection

This is a colleciton of scripts and figures that went into the ESM selection project, currently in progress.

Table of Contents

Flu Snakemake Pipeline

Workflow

flowchart TD;
    translate_nucleotide_json-->extract_node_fastas;
    extract_node_fastas-->extract_max_freq;
    extract_max_freq-->calc_ll_esm;
    calc_ll_esm-->generate_max_freq_v_ll_plots;
Loading

Setup

Install nextstrain apptainer container and place in envs folder

apptainer build nextstrain-base.sif docker://nextstrain/base

For running on Conatus Server:

Load Snakemake and PyTorch Module

module load snakemake PyTorch/2.1.2-foss-2023a-CUDA-12.1.1

Example

Run on 650M Model in parallel

snakemake --cores 8 --resources gpu=8 --use-singularity --config model=650M

Run on 3B Model in parallel

snakemake --cores 8 --resources gpu=8 --use-singularity --config model=3B

Warning

Running the 15B model is memory intensive, it is recomended to not run all segments at once

Run on 15B Model

snakemake --cores 1 --resources gpu=1 --use-singularity --config model=15B

Options

List commands for the pipline:

Command Description
--cores How many cpus to allocate, more cores is faster as it will run processes in parallel
--resources Choose how many GPUs to allocate for ESM Log Likelihood Calculation, use gpu=
--use-singularity If running on conatus server use singularity
--use-conda Run Workflow with conda, used for local development on mac
--config Choose what model to run ESM with, options: 650M, 3B, 15B

Output

├── max_freqs                                       # Extracted Max Frequencies from nextrain trees
│   ├── Max_Freq_Fasta_ha.csv
│   ├── Max_Freq_Fasta_mp.csv
│   ├── Max_Freq_Fasta_na.csv
│   ├── Max_Freq_Fasta_np.csv
│   ├── Max_Freq_Fasta_ns.csv
│   ├── Max_Freq_Fasta_pa.csv
│   ├── Max_Freq_Fasta_pb1.csv
│   └── Max_Freq_Fasta_pb2.csv
├── max_freqs_log_likelyhood_esm2_t33_650M_UR50D    # Extracted Log likelihood for 650M Model
│   ├── Max_Freq_Fasta_LL_esm2_t33_650M_UR50D_ha.csv
│   ├── Max_Freq_Fasta_LL_esm2_t33_650M_UR50D_mp.csv
│   ├── Max_Freq_Fasta_LL_esm2_t33_650M_UR50D_na.csv
│   ├── Max_Freq_Fasta_LL_esm2_t33_650M_UR50D_np.csv
│   ├── Max_Freq_Fasta_LL_esm2_t33_650M_UR50D_ns.csv
│   ├── Max_Freq_Fasta_LL_esm2_t33_650M_UR50D_pa.csv
│   ├── Max_Freq_Fasta_LL_esm2_t33_650M_UR50D_pb1.csv
│   └── Max_Freq_Fasta_LL_esm2_t33_650M_UR50D_pb2.csv
├── max_freqs_log_likelyhood_esm2_t36_3B_UR50D      # Extracted Log likelihood for 3B Model
│   ├── Max_Freq_Fasta_LL_esm2_t36_3B_UR50D_ha.csv
│   ├── Max_Freq_Fasta_LL_esm2_t36_3B_UR50D_mp.csv
│   ├── Max_Freq_Fasta_LL_esm2_t36_3B_UR50D_na.csv
│   ├── Max_Freq_Fasta_LL_esm2_t36_3B_UR50D_np.csv
│   ├── Max_Freq_Fasta_LL_esm2_t36_3B_UR50D_ns.csv
│   ├── Max_Freq_Fasta_LL_esm2_t36_3B_UR50D_pa.csv
│   ├── Max_Freq_Fasta_LL_esm2_t36_3B_UR50D_pb1.csv
│   └── Max_Freq_Fasta_LL_esm2_t36_3B_UR50D_pb2.csv
├── max_freqs_log_likelyhood_esm2_t48_15B_UR50D     # Extracted Log likelihood for 15B Model
│   ├── Max_Freq_Fasta_LL_esm2_t48_15B_UR50D_ha.csv
│   ├── Max_Freq_Fasta_LL_esm2_t48_15B_UR50D_mp.csv
│   ├── Max_Freq_Fasta_LL_esm2_t48_15B_UR50D_na.csv
│   ├── Max_Freq_Fasta_LL_esm2_t48_15B_UR50D_np.csv
│   ├── Max_Freq_Fasta_LL_esm2_t48_15B_UR50D_ns.csv
│   ├── Max_Freq_Fasta_LL_esm2_t48_15B_UR50D_pa.csv
│   ├── Max_Freq_Fasta_LL_esm2_t48_15B_UR50D_pb1.csv
│   └── Max_Freq_Fasta_LL_esm2_t48_15B_UR50D_pb2.csv
├── node_fastas                                     # Extracted Fasta per Node
│   ├── nodeSeqs_ha.fasta
│   ├── nodeSeqs_mp.fasta
│   ├── nodeSeqs_na.fasta
│   ├── nodeSeqs_np.fasta
│   ├── nodeSeqs_ns.fasta
│   ├── nodeSeqs_pa.fasta
│   ├── nodeSeqs_pb1.fasta
│   └── nodeSeqs_pb2.fasta
├── root_tree_translated                            # Translated Root Sequences
│   ├── h3n2_60y_ha_root-sequence.json
│   ├── h3n2_60y_mp_root-sequence.json
│   ├── h3n2_60y_na_root-sequence.json
│   ├── h3n2_60y_np_root-sequence.json
│   ├── h3n2_60y_ns_root-sequence.json
│   ├── h3n2_60y_pa_root-sequence.json
│   ├── h3n2_60y_pb1_root-sequence.json
│   └── h3n2_60y_pb2_root-sequence.json

Figures

Scripts to generate figures are in Figures.ipynb notebook

ESM_vs_Max_Freq_Plots Contains figures comparing Max Freq to LL for all three models.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages