Skip to content

kalidouBA/dicepro

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

215 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dicepro

Deconvolution with Iterative Completion for Estimating cellular Proportions from RNA-seq data.


Overview

Bulk RNA-seq deconvolution infers the proportions of distinct cell types from a mixed gene expression profile. Most existing methods assume that the reference signature matrix is complete — i.e., that every cell population present in the bulk sample is represented. In practice this assumption rarely holds, leading to biased estimates.

dicepro addresses this limitation through an iterative joint optimisation that simultaneously:

  • estimates cell-type proportions for known populations (supervised step, via CIBERSORTx, FARDEEP, DCQ, CDSeq, or BayesPrism), and
  • discovers and quantifies unknown populations using Non-Negative Matrix Factorisation (NMF) with L-BFGS-B optimisation (unsupervised step).

Hyperparameters $(\lambda, \gamma, p')$ controlling the NMF regularisation are selected automatically via a Pareto-frontier + knee-point procedure, so no manual tuning is required.


Key Features

  • Incomplete-reference robustness — recovers cell types absent from the reference matrix.
  • Method-agnostic supervised step — plug in any supported deconvolution backend.
  • Automated hyperparameter search — random search over a log-uniform grid with Pareto-optimal selection.
  • Bundled benchmark dataBlueCode (34-cell-type reference) and CellMixtures (12 experimentally mixed bulk samples) included.
  • Rich diagnostics — Pareto plot and hyperparameter scatter matrix saved automatically to output_path/report/.

Installation

From CRAN (stable)

install.packages("dicepro")

From GitHub (development)

# install.packages("remotes")
remotes::install_github("kalidouBA/dicepro")

Quick Start

Simulated Data

library(dicepro)
set.seed(2101)

# 1. Simulate reference, proportions, and noisy bulk
sim <- simulation(
  scenario   = "hierarchical",
  nSample    = 30,
  nGenes     = 200,
  nCellsType = 10,
  sigma_bio  = 0.07,
  sigma_tech = 0.07
)

# 2. Run dicepro
out <- dicepro(
  reference             = as.matrix(sim$W),
  bulk                  = as.matrix(sim$B),
  methodDeconv          = "FARDEEP",
  bulkName              = "SimBulk",
  refName               = "SimRef",
  hp_max_evals          = 100L,
  hspaceTechniqueChoose = "all",
  output_path           = tempdir()
)

# 3. Inspect results
class(out)
out$hyperparameters    # best lambda / gamma
head(out$H)            # estimated proportions
out$plot               # interactive Pareto plot
out$plot_hyperopt      # hyperparameter scatter matrix

Real Data (BlueCode + CellMixtures)

library(dicepro)

data(BlueCode)      # 34-cell-type reference (G x 34)
data(CellMixtures)  # 12 mixed bulk samples  (G x 12)

out <- dicepro(
  reference             = BlueCode,
  bulk                  = CellMixtures,
  methodDeconv          = "FARDEEP",
  bulkName              = "CellMixtures",
  refName               = "BlueCode",
  hp_max_evals          = 100L,
  hspaceTechniqueChoose = "all",
  output_path           = tempdir()
)

head(out$H)

CIBERSORTx Setup (optional)

CIBERSORTx (methodDeconv = "CSx") requires Docker and a personal token.

Step 1 — Install Docker Desktop

Download from https://www.docker.com/products/docker-desktop, open it, log in, then pull the CIBERSORTx image from a terminal:

docker pull cibersortx/fractions

Step 2 — Obtain a token

Request a token at https://cibersortx.stanford.edu/getoken.php. Tokens are tied to your account and expire periodically; request a new one when the existing token has expired.

Step 3 — Run dicepro with CIBERSORTx

out <- dicepro(
  reference        = BlueCode,
  bulk             = CellMixtures,
  methodDeconv     = "CSx",
  cibersortx_email = "your@email.com",
  cibersortx_token = "your_token_here",
  bulkName         = "CellMixtures",
  refName          = "BlueCode",
  output_path      = tempdir()
)

Other supported deconvolution backends can be listed with ?running_method.


Output Structure

dicepro() returns an S3 object of class "dicepro" with the following elements:

Element Description
$hyperparameters Best $\lambda$ and $\gamma$ found by the search
$metrics Loss and constraint value at the optimum
$trials data.frame of all evaluated hyperparameter configurations
$W Optimised reference matrix (including unknown cell types)
$H Estimated cell-type proportions (samples x cell types)
$plot Pareto frontier
$plot_hyperopt Hyperparameter scatter matrix (ggplot2)

Bundled Datasets

BlueCode

A gene x 34 cell-type reference signature matrix derived from sorted bulk RNA-seq profiles spanning five major tissue compartments: Immune (9), Stromal (8), Endothelial (3), Epithelial (5), and Muscle (9).

data(BlueCode)
dim(BlueCode)
colnames(BlueCode)

CellMixtures

A gene x 12 bulk RNA-seq matrix of experimentally constructed cell mixtures (samples A–L), paired with BlueCode for benchmarking.

data(CellMixtures)
dim(CellMixtures)
colnames(CellMixtures)

See ?BlueCode and ?CellMixtures for full documentation.


Vignettes

Two vignettes provide step-by-step:

vignette("vignette-simulation", package = "dicepro")
vignette("vignette-real-data",  package = "dicepro")

Citation

If you use dicepro in your research, please cite:

[When Less Is Not More]{When Less Is Not More: Mitigates the Impact of Incomplete Reference Matrices on Cellular Frequency Deconvolution* Bioinformatics. doi:XXXX


About

Deconvolution with Iterative Completion for Estimating cellular Proportions from RNA-seq data.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors