Skip to content

kalidouBA/dicepro

Repository files navigation

dicepro

Deconvolution with Iterative Completion for Estimating Cellular Proportions from RNA-seq Data


Overview

Bulk RNA-seq deconvolution infers the proportions of distinct cell types from a mixed gene expression profile. Most existing methods assume that the reference signature matrix is complete — i.e., that every cell population present in the bulk sample is represented. In practice this assumption rarely holds, leading to biased estimates.

dicepro addresses this limitation through an iterative joint optimization that simultaneously:

  • estimates cell-type proportions for known populations (supervised step, via CIBERSORTx, FARDEEP, DCQ, CDSeq, or BayesPrism), and
  • discovers and quantifies unknown populations using Non-Negative Matrix Factorization (NMF) with L-BFGS-B optimization (unsupervised step).

Hyper-parameters $(\lambda, \gamma, p')$ controlling the NMF regularization are selected automatically via a Pareto-frontier + knee-point procedure, so no manual tuning is required.


Key Features

  • Incomplete-reference robustness — recovers cell types absent from the reference matrix.
  • Method-agnostic supervised step — plug in any supported deconvolution backend.
  • Automated hyper-parameter search — random search over a log-uniform grid with Pareto-optimal selection.
  • Bundled benchmark dataBlueCode (34-cell-type reference) and CellMixtures (12 experimentally mixed bulk samples) included.
  • Rich diagnostics — Pareto plot and hyper-parameter scatter matrix saved automatically to output_path/report/.

Installation

Development version (GitHub)

# install.packages("remotes")
remotes::install_github("kalidouBA/dicepro")

Quick Start

Simulated Data

library(dicepro)
set.seed(2101)

# 1. Simulate reference, proportions, and noisy bulk
sim <- simulation(
  scenario   = "hierarchical",
  nSample    = 30,
  nGenes     = 200,
  nCellsType = 10,
  sigma_bio  = 0.07,
  sigma_tech = 0.07
)

# 2. Run dicepro
out <- dicepro(
  reference             = as.matrix(sim$W),
  bulk                  = as.matrix(sim$B),
  methodDeconv          = "FARDEEP",
  bulkName              = "SimBulk",
  refName               = "SimRef",
  hp_max_evals          = 100L,
  hspaceTechniqueChoose = "all",
  output_path           = tempdir()
)

# 3. Inspect results
class(out)
out$hyperparameters    # best lambda / gamma
head(out$H)            # estimated proportions
out$plot               # interactive Pareto plot
out$plot_hyperopt      # hyper-parameter scatter matrix

Real Data (BlueCode + CellMixtures)

library(dicepro)

data(BlueCode)      # 34-cell-type reference (G x 34)
data(CellMixtures)  # 12 mixed bulk samples  (G x 12)

out <- dicepro(
  reference             = BlueCode,
  bulk                  = CellMixtures,
  methodDeconv          = "FARDEEP",
  bulkName              = "CellMixtures",
  refName               = "BlueCode",
  hp_max_evals          = 100L,
  hspaceTechniqueChoose = "all",
  output_path           = tempdir()
)

head(out$H)

CIBERSORTx Setup (optional)

CIBERSORTx (methodDeconv = "CSx") requires Docker and a personal token.

Step 1 — Install Docker Desktop

Download from https://www.docker.com/products/docker-desktop, open it, log in, then pull the CIBERSORTx image from a terminal:

docker pull cibersortx/fractions

Step 2 — Obtain a token

Request a token at https://cibersortx.stanford.edu/getoken.php. Tokens are tied to your account and expire periodically; request a new one when the existing token has expired.

Step 3 — Run dicepro with CIBERSORTx

out <- dicepro(
  reference        = BlueCode,
  bulk             = CellMixtures,
  methodDeconv     = "CSx",
  cibersortx_email = "your@email.com",
  cibersortx_token = "your_token_here",
  bulkName         = "CellMixtures",
  refName          = "BlueCode",
  output_path      = tempdir()
)

Other supported deconvolution backends can be listed with ?running_method.


Output Structure

dicepro() returns an S3 object of class "dicepro" with the following elements:

Element Description
$hyperparameters Best $\lambda$ and $\gamma$ found by the search
$metrics Loss and constraint value at the optimum
$trials data.frame of all evaluated hyper-parameter configurations
$W Optimized reference matrix (including unknown cell types)
$H Estimated cell-type proportions (samples x cell types)
$plot Pareto frontier
$plot_hyperopt Hyper-parameter scatter matrix (ggplot2)

Bundled Datasets

BlueCode

A gene x 34 cell-type reference signature matrix derived from sorted bulk RNA-seq profiles spanning five major tissue compartments: Immune (9), Stromal (8), Endothelial (3), Epithelial (5), and Muscle (9).

data(BlueCode)
dim(BlueCode)
colnames(BlueCode)

CellMixtures

A gene x 12 bulk RNA-seq matrix of experimentally constructed cell mixtures (samples A–L), paired with BlueCode for benchmarking.

data(CellMixtures)
dim(CellMixtures)
colnames(CellMixtures)

See ?BlueCode and ?CellMixtures for full documentation.


Vignettes

Two vignettes provide step-by-step:

vignette("vignette-simulation", package = "dicepro")
vignette("vignette-real-data",  package = "dicepro")

Citation

If you use dicepro in your research, please cite: When Less Is Not More: dicepro Mitigates the Impact of Incomplete Reference Matrices on Cellular Frequency Deconvolution.
Bioinformatics. doi:


About

Deconvolution with Iterative Completion for Estimating cellular Proportions from RNA-seq data.

Topics

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors