fastder

fastder is a C++ based tool for detecting expressed regions in RNA-seq data. It is intended to build on the recount3 resource, which consists of over 750'000 uniformly processed RNA-seq samples across different mouse and human studies. The tool aims to reconstruct expressed genes prior to splicing in an annotation-agnostic approach.

fastder takes genome-wide coverage bigWig files and splice junction coordinates as an input. The tool averages across samples and performs thresholding to identify consecutive regions with above-threshold expression. Following this, fastder attempts to stitch together expressed regions (ERs) by searching for splice junction coordinates that overlap with the start and end position of these expressed regions.

Installation

Recount3 Background

recount3 provides RNA-seq data for over 8'000 human and over 10'000 mouse studies. Each study consist of multiple per-sample coverage bigWig files and one set of per-study splice junction coordinate files amongst others. These datasets can be downloaded from their online platform. Thus, the user can either provide data from one of the existing studies or run the recount3 pipeline with new RNA-seq data.

Input data

recount3 provides uniformly processed RNA-seq data for over 8'000 human and over 10'000 mouse studies. Each study consists of several thousand samples. Existing input files can be retrieved from the recount3 online platform. If a user wishes to run fastder on new RNA-seq data, the easiest way to obtain the required input data is to run the recount3 pipeline.

Recount3 Pipeline

fastder builds on the Monorail pipeline used by recount3. Monorail takes the FASTQ files provided by Illumina Sequencing as an input. A brief summary of the relevant steps in the Monorail pipeline (used to create recount3 resources) is provided below:

Input data:
1. unpaired or paired-end FASTQ files
2. suffix-array-based index of reference genome sequence
Perform spliced alignment with STAR to obtain
1. a BAM file with the spliced alignment
2. a summary of detected splice junction
Use Megadepth to produce bigWig coverage files
Aggregate SJ.out.tab into a
1. MM file
2. RR file

Code Structure

Relational Database Model

The following diagram provides an overview of the tables and objects used in fastder. The _File suffix indicates that the table is one of the input files. All other tables are objects created by the Parser class to map between the three different sample IDs (in lilac) used by the splice junction and coverage files respectively.

Sequence Diagram

The following sequence diagram provides an abstracted overview of the three main functional stages of fastder.

Usage

fastder can currently take only one RR and MM file as an input. Thus, users directly working with recount3 resources can only provide samples from the same study as an input.

fastder expects all input files to be in the same folder (provided as a relative path to the build directory with --dir).
fastder allows users to optionally specify which chromosomes they wish to analyze. The flag --chr <chr1> means that the tool will only output expressed regions on chromosome 1, and will ignore all coverage and splice junction information from other chromosomes).
fastder allows optionally specifying four different thresholds:
- --min-coverage 0.25 describes the coverage threshold of an expressed region (ER). A consecutive base-pair position must have at least 0.25 CPM coverage to be added to en ER.
- --min-length 5 describes the minimum length (in bp) that an ER must have. For instance, three consecutive base pairs with coverage > 0.25 CPM will be ignored if the min length is set to 5 bp.
- --position-tolerance 5 describes the maximum permitted offset of the end position of an exon and the starting position of a splice junction. If this tolerance is set to 5, an ER with end position = 1000 bp and a splice junction with start position = 1005 bp will be stitched together (if the coverage and end junction match).
- --coverage-tolerance 0.1 describes the maximum permitted coverage deviation between two ER that are separated by a spliced region. For a coverage tolerance of 0.1, two ERs with coverage = 10 CPM and 11 CPM will be stitched together (if there is a matching splice junction).

A visualization of the different parameters is provided below.

Usage:
   fastder \
      --dir <path> ... \
      [--chr <chr1> <chr2> ...] \
      [--min-coverage <float>] \
      [--position-tolerance <int>] \
      [--coverage-tolerance <float>] \
      [--help]

Required inputs:

   --dir <path> ...                             Relative path from the build directory to the directory containing the input files.
                                                Example: --dir ../../data/test_exon_skipping

Optional inputs:

   --chr <chr1> <chr2> ...                      List of chromosomes to process.
                                                Default: all (chr1-chr22, chrX)
                                                Example: --chr chr1 chr2 chr3
                                                
   --min-length <float>                         Minimum length [#bp] required for a region to qualify as an expressed region (ER).
                                                Default: 5 bp
                                                Example: --min-length 5
                                                
   --min-coverage <float>                       Minimum coverage [CPM] required for a region to qualify as an ER.
                                                Normalized in-place by library size.
                                                Default: 0.25 CPM
                                                Example: --min-coverage 0.25
   
   --position-tolerance <int>                   Maximum allowed positional deviation between splice junction and ER coordinates [bp].
                                                Default: 5 bp
                                                Example: --position-tolerance 5
   
   --coverage-tolerance <float>                 Allowed relative deviation in coverage between stitched ERs (e.g. 0.1 = 10%).
                                                Default: 0.1
                                                Example: --coverage-tolerance 0.1
   
   --help                                       Show this help message.
   

Example:
   
   fastder \
   --dir ../../data/input \
   --chr chr1 chr2 \
   --position-tolerance 5 \
   --min-length 5 \
   --min-coverage 0.25 \
   --coverage-tolerance 0.1

License

GPLv3

Contact

martina.lavanya@gmail.com

TODO

Snakemake pipeline
installation requirements: CMAke version 4 or newer

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
.github/workflows		.github/workflows
assets		assets
conda/recipe		conda/recipe
cpp		cpp
python		python
scripts		scripts
simulated_data/SimulatedDataMLS		simulated_data/SimulatedDataMLS
tests		tests
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

fastder

Installation

Recount3 Background

Input data

Recount3 Pipeline

Code Structure

Relational Database Model

Sequence Diagram

Usage

License

Contact

TODO

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

fastder

Installation

Recount3 Background

Input data

Recount3 Pipeline

Code Structure

Relational Database Model

Sequence Diagram

Usage

License

Contact

TODO

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages