GEfetch2R - Access Single-cell/Bulk RNA-seq Data from Public Repositories and Benchmark the Subsequent Format Conversion Tools

Quick start

Introduction

GEfetch2R is designed to accelerate users' downloading and preparation of single-cell/bulk RNA-seq datasets from public resources. It can be used to:

For bulk RNA-seq:

Download raw data (sra/fastq/bam) from SRA/ENA with GEO accession:
- Parallel download sra/fastq/bam files from SRA/ENA
- Support Aspera (ENA)
- Parallel split sra to fastq files
- Convert bam to fastq files
- Mapping with STAR and load the output to DESeq2
Download count matrix from GEO with GEO accession:
- Generate count matrix from supplementary files
- Load the count matrix to DESeq2

For scRNA-seq:

Download raw data (sra/fastq/bam) from SRA/ENA with GEO accession:
- Parallel download sra/fastq/bam files from SRA/ENA
- Foramt downloaded fastq files to standard style that can be identified by 10x softwares (e.g. CellRanger).
- Support Aspera (ENA)
- Parallel split sra to fastq files
- Download original 10x generated bam files (with custom tags)
- Convert bam to fastq files (samtools/bamtofastq_linux)
- Mapping with STAR/CellRanger and load the output to Seurat
Download count matrix:
- Download count matrix from GEO with GEO accession
- Download count matrix and annotation (e.g. cell type) information from PanglaoDB and UCSC Cell Browser with key words (filter criteria)
- Extract subset with annotation and gene (PanglaoDB and UCSC Cell Browser)
- Load the count matrix and annotation to Seurat
Download processed object:
- Download processed object from GEO with GEO accession
- Download processed object from Zenodo with DOI
- Download processed object from CELLxGENE and Human Cell Atlas with key words (filter criteria)
- Parallel download
- Extract subset with annotation and gene (CELLxGENE)
- Load the processed object (rds/rdata) to Seurat
- Extract count matrix and metadata from processed object (rds/rdata), including Seurat, seurat, SingleCellExperiment, cell_data_set, CellDataSet, DESeqDataSet, DGEList.
Format conversion and benchmark related tools:
- Format conversion between widely used single cell objects (SeuratObject, AnnData, SingleCellExperiment, CellDataSet/cell_data_set and loom)
- Benchmark the tools (more than two tools available for the conversion):
  - SeuratObject to AnnData: SeuratDisk, sceasy, scDIOR
  - SingleCellExperiment to AnnData: sceasy, scDIOR, zellkonverter
  - AnnData to SeuratObject: SeuratDisk, sceasy, scDIOR, schard
  - AnnData to SingleCellExperiment: schard, scDIOR, zellkonverter

Framework

Installation

Manual installation

To install GEfetch2R, start R and enter:

# install from GitHub
# install.packages("devtools")
devtools::install_github("showteeth/GEfetch2R")

There are some conditionally used R packages:

# install.packages("devtools") #In case you have not installed it.
devtools::install_github("alexvpickering/GEOfastq") # download fastq
install.packages('tiledbsoma', repos = c('https://tiledb-inc.r-universe.dev', 'https://cloud.r-project.org')) # download from CELLxGENE
install.packages('cellxgene.census', repos=c('https://chanzuckerberg.r-universe.dev', 'https://cloud.r-project.org')) # download from CELLxGENE
devtools::install_github("cellgeni/sceasy") # format conversion
devtools::install_github("mojaveazure/seurat-disk") # format conversion
devtools::install_github("satijalab/seurat-wrappers") # format conversion
devtools::install_github('theislab/zellkonverter@7b118653a471330b3734dcfee60c3537352ecb8d', upgrade = 'never') # format conversion
devtools::install_github('cellgeni/schard', upgrade = 'never') # format conversion
devtools::install_github('JiekaiLab/dior', upgrade = 'never')  # format conversion

For possible issues about installation, please refer INSTALL.md.

For format conversion and downloading fastq/bam files, GEfetch2R requires additional tools, you can install with:

# install additional packages for format conversion
pip install diopy
conda install -c bioconda loompy anndata 
# or
pip install anndata loompy

# install additional packages for downloading fastq/bam files
conda install -c bioconda 'parallel-fastq-dump' 'sra-tools>=3.0.0'

# install bamtofastq, the following installs linux version
wget --quiet https://github.com/10XGenomics/bamtofastq/releases/download/v1.4.1/bamtofastq_linux && chmod +x bamtofastq_linux

# install ascp
conda install -c hcc aspera-cli -y
# ascp path: ~/miniconda3/bin/ascp (path/bin/ascp)
# private-key file : ~/miniconda3/etc/asperaweb_id_dsa.openssh (path/etc/asperaweb_id_dsa.openssh)

Docker image

We also provide a docker image to use:

# pull the image
docker pull soyabean/gefetch2r:1.2

# run the image
docker run --rm -p 8888:8787 -e PASSWORD=passwd -e ROOT=TRUE -it soyabean/gefetch2r:1.2

Notes:

After running the above codes, open browser and enter http://localhost:8888/, the user name is rstudio, the password is passwd (set by -e PASSWORD=passwd)
If port 8888 is in use, change -p 8888:8787
The conda.path in ExportSeurat and ImportSeurat can be set /opt/conda.
The sra-tools can be found in /opt/sratoolkit.3.0.6-ubuntu64/bin.
The parallel-fastq-dump path: /opt/conda/bin/parallel-fastq-dump.
The bamtofastq_linux path: /opt/bamtofastq_linux.
The samtools path: /opt/conda/bin/samtools.
The STAR and Cell Ranger is not available in the image because customized reference genome is required.

Codes used to test the usability of the Docker image: Docker_test.R

Vignette

Detailed usage is available in website:

Comparison with similar tools

Quick start

Check API

Check the availability of APIs used:

# check all databases: "SRA/ENA", "GEO", "PanglaoDB", "UCSC Cell Browser", "Zenodo", "CELLxGENE", "Human Cell Atlas"
CheckAPI()

# for a given database
CheckAPI(database = "GEO")

SRA/ENA (sra/fastq/bam)

Extract all runs, automatically identify the RNA-seq type (10x Genomics scRNA-seq, bulk RNA-seq, Smart-seq2 scRNA-seq/mini-bulk RNA-seq) of each sample, download fastq files from ENA, perform read mapping, and load the results to R:

# mixture of 10x Genomics scRNA-seq and bulk RNA-seq
GSE305141.list <- DownloadFastq2R(
  acce = "GSE305141", skip.gsm = c("GSM9162729", "GSM9162725"),
  star.ref = "/path/to/star/ref", cellranger.ref = "/path/to/cellranger/ref",
  star.path = "/path/to/STAR", cellranger.path = "/path/to/cellranger"
)

# given GSM number
GSE127942.list <- DownloadFastq2R(
  gsm = c("GSM3656922", "GSM3656923"), star.ref = "/path/to/ref",
  star.path = "/path/to/STAR"
)

Key parameters:

acce: the GEO accession.
skip.gsm: vector of GSM numbers to skip.
gsm: the GSM accession (given sample).
star.ref: path to STAR reference. Used when bulk/Smart-seq2 RNA-seq samples exist.
star.path: path to STAR, can be detected automatically by Sys.which("STAR"). Used when bulk/Smart-seq2 RNA-seq samples exist.
cellranger.ref: path to cellranger reference. Used when 10x Genomics scRNA-seq samples exist.
cellranger.path: path to cellranger, can be detected automatically by Sys.which("cellranger"). Used when 10x Genomics scRNA-seq samples exist.
out.folder: the output folder, current working directory by default.
count.col: column contains used count data (2: unstranded (default value); 3: stranded=yes / 1st read strand; 4: stranded=reverse/2nd read strand), use when bulk RNA-seq or Smart-seq2 scRNA-seq/mini-bulk RNA-seq.

Note

The way GEfetch2R automatically identify the RNA-seq type can be find here. Besides, users can specify the RNA-seq type via force.type, choose one from "10x", "Smart-seq2", "bulk".

The above code equals the following:

# extract all runs
GSE127942.runs <- ExtractRun(acce = "GSE127942")
GSE127942.runs <- GSE127942.runs[GSE127942.runs$gsm_name %in% c("GSM3656922", "GSM3656923"), ]
# download fastq from ENA
GSE127942.down <- DownloadFastq(
  gsm.df = GSE127942.runs, out.folder = "/path/to/fastq_out",
  download.method = "wget", # available method: "download.file", "ascp", "wget"
  parallel = FALSE, format.10x = FALSE # 10x-specific
)
# read mapping, load to R
GSE127942.gsms <- file.path("/path/to/fastq_out", c("GSM3656922", "GSM3656923"))
GSE127942.obj <- Fastq2R(
  sample.dir = GSE127942.gsms,
  method = "STAR", st.path = "/path/to/STAR", ref = "/path/to/STAR.reference", # STAR reference
  out.folder = "/path/to/mapping_out", count.col = 2, # strand-specific (2: unstranded; 3: stranded=yes; 4: stranded=reverse)
  localcores = 4
)
# GSE127942.obj is a DESeqDataSet object

GEO (count matrix, metadata, processed object)

Count matrix (bulk RNA-seq/Smart-seq2)

Supplementary file in csv(.gz)/tsv(.gz)/txt(.gz)/tab(.gz)/xlsx(.gz)/xls(.gz) format or tar(.gz) format (contain csv(.gz)/tsv(.gz)/txt(.gz)/tab(.gz)/xlsx(.gz)/xls(.gz) files):

# return SeuratObject
GSE297431.seu <- ParseGEO(
  acce = "GSE297431",
  supp.idx = 1, # specify the index of used supplementary file
  down.supp = TRUE, supp.type = "count",
  data.type = "sc", # scRNA-seq, Smart-seq2 here
  load2R = TRUE, merge = TRUE
)

Key parameters:

down.supp = TRUE: generate count matrix from supplementary file
supp.idx = 1: the first supplementary file containing the count matrix
supp.type = "count": the file containing count matrix is in csv(.gz)/tsv(.gz)/txt(.gz)/tab(.gz)/xlsx(.gz)/xls(.gz) format
data.type = "sc": the data type of the dataset, choose from "sc" (single-cell) and "bulk" (bulk)
load2R = F: return count matrix; load2R = T and data.type = "sc", return SeuratObject; load2R = T and data.type = "bulk", return DESeqDataSet

Note

ParseGEO is compatible with count matrices generated by htseq-count, featureCounts (contain extra columns: "Chr", "Start", "End", "Strand", "Length", e.g. GSE182219), and STAR with --quantMode (contain three count columns: unstranded, 1st read strand, 2nd read strand, e.g. GSE195839).
ParseGEO is also compatible with count matrix files containing irregular extra columns, e.g. GSE268038 contains "chromosome", "start", "end", "strand". Users can specify the extra columns to ignore by setting extra.cols (default: "chr", "start", "end", "strand", "length", "width", "chromosome", "seqnames", "seqname", "chrom", "chromosome_name", "seqid", "stop")
In general, the rows of the count matrix represent genes, and the columns represent samples. ParseGEO can deal with the transposed count matrix (the number of rows is less than the number of columns) by setting transpose to TRUE.

Count matrix (scRNA-seq)

Supplementary files in h5(.gz) format or composed of barcodes.tsv(.gz)/genes.tsv(.gz), matrix.mtx(.gz), features.tsv(.gz) files (set supp.type = "10xSingle"):

GSE278892.seu <- ParseGEO(
  acce = "GSE278892", down.supp = TRUE,
  supp.type = "10xSingle", timeout = 36000,
  out.folder = "/path/to/store/count_matrix"
)

Key parameters:

down.supp = TRUE: generate count matrix from supplementary file
supp.type = "10xSingle": the file containing count matrix is in separate files (barcodes.tsv(.gz)/genes.tsv(.gz), matrix.mtx(.gz), features.tsv(.gz)) or h5(.gz) file(s)
load2R = F: return count matrix; load2R = T and data.type = "sc", return SeuratObject

Supplementary file in tar(.gz) format (set supp.type = "10x"):

GSE292908.seu <- ParseGEO(
  acce = "GSE292908", down.supp = TRUE,
  supp.type = "10x", timeout = 36000,
  supp.idx = 1, # specify the index of used supplementary file
  out.folder = "/path/to/store/count_matrix"
)

Key parameters:

down.supp = TRUE: generate count matrix from supplementary file
supp.idx = 1: the first supplementary file containing the count matrix
supp.type = "10x": the file containing count matrix is in tar(.gz) format. The files in tar(.gz) can be in zip, tar(.gz), h5(.gz) format, or separate files (barcodes.tsv(.gz)/genes.tsv(.gz), matrix.mtx(.gz), features.tsv(.gz)).
load2R = F: return count matrix; load2R = T and data.type = "sc", return SeuratObject

Note

ParseGEO can also load count matrices generated by scRNA-seq platforms other than 10x Genomics, which have a similar output structure to 10x (CellRanger), e.g., SeekOne and MobiDrop and DNBelab C4
ParseGEO can handle files with very deep hierarchical structures, e.g. Compressed files (zip, tar.gz, tar) in downloaded supplemental files (GEO, 10x)
ParseGEO can identify sample name before or after the fixed name, e.g. Sample name is after the fixed name (GEO, 10x)

Metadata

The sample metadata can be obtained in two ways:

user-provided sample metadata when uploading to GEO (applicable to all GEO accessions):

# set VROOM_CONNECTION_SIZE to avoid error: Error: The size of the connection buffer (786432) was not large enough
Sys.setenv("VROOM_CONNECTION_SIZE" = 131072 * 60)
# extract metadata
GSE297431.meta <- ExtractGEOMeta(acce = "GSE297431")

metadata in supplementary file:

GSE297431.meta.supp <- ExtractGEOMeta(
  acce = "GSE297431", down.supp = TRUE,
  supp.idx = 2 # specify the index of used supplementary file
)

Processed object

Supplementary file in rdata(.gz)/rds(.gz)/h5ad(.gz)/loom(.gz) format or tar(.gz) format (contain rdata(.gz)/rds(.gz)/h5ad(.gz)/loom(.gz) files):

# return SeuratObject (rds)
GSE285723.seu <- ParseGEOProcessed(
  acce = "GSE285723", supp.idx = 1,
  file.ext = c("rdata", "rds"), return.seu = T, timeout = 36000000,
  out.folder = "/path/to/outfoder"
)

Key parameters:

supp.idx = 1: the first supplementary file containing the processed object.
file.ext = c("rdata", "rds"): download/keep files in rdata(.gz) and rds(.gz) formats (case-insensitive).
return.seu = T: load downloaded objects to Seurat.

Dissect and extract the RData files:

# download the object
ParseGEOProcessed(acce = "GSE244572", timeout = 360000, supp.idx = 1, file.ext = c("rdata", "rds", "h5ad", "loom"))
# process the object
GSE244572.list <- LoadRData(
  rdata = "GSE244572/GSE244572_RPE_CITESeq.RData",
  accept.fmt = c("Seurat", "seurat", "SingleCellExperiment", "cell_data_set", "CellDataSet", "DESeqDataSet", "DGEList"),
  slot = "counts", return.obj = TRUE
)

Key parameters:

accept.fmt: vector, the format of objects for dissecting and extracting.
slot: vector, the type of count matrix to pull. 'counts': raw, un-normalized counts, 'data': normalized data, scale.data: z-scored/variance-stabilized data.
return.obj: logical value, whether to load the available objects in accept.fmt to global environment.

Note

The way GEfetch2R dissect and extract the RData files can be find here.

PanglaoDB (count matrix, cell type composition)

Given dataset

# extract cell type composition 
lung.composition <- ExtractPanglaoDBComposition(sra = "SRA570744")

# extract count matrix and load to Seurat
lung.seu <- ParsePanglaoDB(sra = "SRA570744", srs = "SRS2253536")

Filter samples based on metadata

# summarise attributes
StatDBAttribute(df = PanglaoDBMeta, filter = c("species", "protocol"), database = "PanglaoDB")

# filter metadata
hsa.meta <- ExtractPanglaoDBMeta(species = "Homo sapiens", protocol = c("Smart-seq2", "10x chromium"),
                                 show.cell.type = TRUE, cell.num = c(1000, 2000))

# extract cell type composition
hsa.composition <- ExtractPanglaoDBComposition(meta = hsa.meta)

# download matrix and load to Seurat, small test
hsa.seu <- ParsePanglaoDB(hsa.meta[1:3,], merge = TRUE)

UCSC Cell Browser (count matrix, cell type composition)

Given dataset

# extract cell type composition 
ut.sample.ct <- ExtractCBComposition(link = c(
  "https://cells.ucsc.edu/?ds=adult-ureter", # collection
  "https://cells.ucsc.edu/?ds=adult-testis" # dataset
))

# extract count matrix and load to Seurat
ut.seu <- ParseCBDatasets(link = c(
  "https://cells.ucsc.edu/?ds=adult-ureter", # collection
  "https://cells.ucsc.edu/?ds=adult-testis" # dataset
), merge = TRUE)

merge = TRUE: whether to merge Seurat list.

Filter samples based on metadata

# first-time run, get all samples and store json to json.folder
ucsc.cb.samples = ShowCBDatasets(lazy = TRUE, json.folder = "/path/to/json", update = TRUE)
# second-time run, use stored json
# ucsc.cb.samples = ShowCBDatasets(lazy = TRUE, json.folder = "/path/to/json", update = FALSE)

# summarise attributes
StatDBAttribute(
  df = ucsc.cb.samples, filter = c("organism", "organ"),
  database = "UCSC", combine = TRUE
)

# filter metadata
hbb.sample.df <- ExtractCBDatasets(
  all.samples.df = ucsc.cb.samples, organ = c("skeletal muscle"),
  organism = "Human (H. sapiens)", cell.num = c(1000, 2000)
)

# extract cell type
hbb.sample.ct <- ExtractCBComposition(
  json.folder = "/path/to/json",
  meta = hbb.sample.df
)

# parse the whole datasets
hbb.sample.seu <- ParseCBDatasets(meta = hbb.sample.df)
# subset metadata and gene
hbb.sample.seu <- ParseCBDatasets(
  meta = hbb.sample.df, obs.value.filter = "Cell.Type == 'MP' & Phase == 'G2M'",
  include.genes = c(
    "PAX7", "MYF5", "C1QTNF3", "MYOD1", "MYOG", "RASSF4", "MYH3", "MYL4",
    "TNNT3", "PDGFRA", "OGN", "COL3A1"
  )
)

Zenodo (processed object)

# extract metadata
multi.dois <- ExtractZenodoMeta(doi = c("1111", "10.5281/zenodo.7243603", "10.5281/zenodo.7244441"))

# download objects
multi.dois.parse <- ParseZenodo(
  doi = c("1111", "10.5281/zenodo.7243603", "10.5281/zenodo.7244441"),
  file.ext = c("rdata"), timeout = 36000000,
  out.folder = "/path/to/download_zenodo"
)

# return SeuratObject
sinle.doi.parse.seu <- ParseZenodo(
  doi = "10.5281/zenodo.8011282",
  file.ext = c("rds"), return.seu = TRUE, timeout = 36000000,
  out.folder = "/path/to/download_zenodo"
)

return.seu = T: load downloaded objects to Seurat.

dissect and extract the RData files

CELLxGENE (processed object)

Given dataset

CELLxGENE does not support downloading SeuratObject in versions after 2025. The following code can only download h5ad files.

# download h5ad files
cellxgene.given.h5ad <- ParseCELLxGENE(
  link = c(
    "https://cellxgene.cziscience.com/collections/77f9d7e9-5675-49c3-abed-ce02f39eef1b", # collection
    "https://cellxgene.cziscience.com/e/e12eb8a9-5e8b-4b59-90c8-77d29a811c00.cxg/" # dataset
  ),
  timeout = 36000000,
  out.folder = "/path/to/download_cellxgene"
)

Filter samples based on metadata

We have downloaded all the CELLxGENE datasets in May 2025 and stored in all.cellxgene.datasets.rds. The all.cellxgene.datasets.rds contains the SeuratObject.

# all available datasets
all.cellxgene.datasets <- ShowCELLxGENEDatasets()

# the datasets with SeuratObject
# wget https://github.com/showteeth/GEfetch2R/raw/ff2f19f3b557f90fce5f8bf2f8662cebdfd04298/man/benchmark/all.cellxgene.datasets.rds
all.cellxgene.datasets <- readRDS("all.cellxgene.datasets.rds")

# summarise attributes
StatDBAttribute(
  df = all.cellxgene.datasets, filter = c("organism", "sex", "disease"),
  database = "CELLxGENE", combine = TRUE
)
# use cellxgene.census
# StatDBAttribute(filter = c("disease", "tissue", "cell_type"), database = "CELLxGENE", use.census = TRUE, organism = "homo_sapiens")

# human 10x v2 and v3 datasets
human.10x.cellxgene.meta <- ExtractCELLxGENEMeta(
  all.samples.df = all.cellxgene.datasets,
  assay = c("10x 3' v2", "10x 3' v3"), organism = "Homo sapiens"
)

# subset
cellxgene.down.meta <- human.10x.cellxgene.meta[human.10x.cellxgene.meta$cell_type == "oligodendrocyte" &
                                                  human.10x.cellxgene.meta$tissue == "entorhinal cortex", ]

# download objects
cellxgene.down <- ParseCELLxGENE(
  meta = cellxgene.down.meta, file.ext = "rds", timeout = 36000000,
  out.folder = "/path/to/download_cellxgene"
)

# retuen SeuratObject
cellxgene.down.seu <- ParseCELLxGENE(
  meta = cellxgene.down.meta, file.ext = "rds", return.seu = TRUE, timeout = 36000000,
  obs.value.filter = "cell_type == 'oligodendrocyte' & disease == 'Alzheimer disease'",
  obs.keys = c("cell_type", "disease", "sex", "suspension_type", "development_stage"),
  out.folder = "/path/to/download_cellxgene"
)

Human Cell Atlas (processed object)

Given dataset

# download objects
hca.given.download <- ParseHCA(
  link = c(
    "https://explore.data.humancellatlas.org/projects/902dc043-7091-445c-9442-d72e163b9879",
    "https://explore.data.humancellatlas.org/projects/cdabcf0b-7602-4abf-9afb-3b410e545703"
  ), timeout = 36000000, 
  out.folder = "/path/to/download_hca"
)

dissect and extract the RData files

Filter samples based on metadata

# all available datasets
all.hca.projects <- ShowHCAProjects()

# summarise attributes
StatDBAttribute(df = all.hca.projects, filter = c("organism", "sex"), database = "HCA")

# filter metadata
hca.human.10x.projects <- ExtractHCAMeta(
  all.projects.df = all.hca.projects, organism = "Homo sapiens",
  protocol = c("10x 3' v2", "10x 3' v3")
)

# small test
hca.human.10x.down = ParseHCA(meta = hca.human.10x.projects[1:3,], 
                              out.folder = "/path/to/download_hca", 
                              file.ext = c("h5ad", "rds"), timeout = 36000000)

dissect and extract the RData files

Format conversion

There are many tools have been developed to process scRNA-seq data, such as Scanpy, Seurat, scran and Monocle. These tools have their own object formats, such as Anndata of Scanpy, SeuratObject of Seurat, SingleCellExperiment of scran and CellDataSet/cell_data_set of Monocle2/Monocle3. There are also some file formats designed for large omics datasets, such as loom. To perform a comprehensive scRNA-seq data analysis, we usually need to combine multiple tools, which means we need to perform object format conversion frequently. To facilitate user analysis of scRNA-seq data, GEfetch2R benchmarked the format conversion tools (Anndata -> SeuratObject, SeuratObject to Anndata, Anndata to SingleCellExperiment, SingleCellExperiment to Anndata), and provides multiple functions to perform format conversion.

Test data

# library
library(Seurat) # pbmc_small
library(scRNAseq) # seger

SeuratObject:

# object
pbmc_small

SingleCellExperiment:

seger <- scRNAseq::SegerstolpePancreasData()

Convert SeuratObject to other objects

Here, we will convert SeuratObject to SingleCellExperiment, CellDataSet/cell_data_set, Anndata, loom.

SeuratObject to SingleCellExperiment

The conversion is performed with functions implemented in Seurat:

sce.obj <- ExportSeurat(seu.obj = pbmc_small, assay = "RNA", to = "SCE")

SeuratObject to CellDataSet/cell_data_set

To CellDataSet (The conversion is performed with functions implemented in Seurat):

# BiocManager::install("monocle") # reuqire monocle
cds.obj <- ExportSeurat(seu.obj = pbmc_small, assay = "RNA", reduction = "tsne", to = "CellDataSet")

To cell_data_set (The conversion is performed with functions implemented in SeuratWrappers):

# remotes::install_github('cole-trapnell-lab/monocle3') # reuqire monocle3
cds3.obj <- ExportSeurat(seu.obj = pbmc_small, assay = "RNA", to = "cell_data_set")

SeuratObject to AnnData

There are multiple tools available for format conversion from SeuratObject to Anndata:

scDIOR is the best method in terms of information kept and usability
sceasy has best performance in running time and disk usage.

# SeuratDisk
Seu2AD(seu.obj = pbmc_small, method = "SeuratDisk", out.folder = "out.folder", 
       assay = "RNA", save.scale = TRUE)
# sceasy
Seu2AD(seu.obj = pbmc_small, method = "sceasy", out.folder = "out.folder", 
       assay = "RNA", slot = "counts", conda.path = "/path/to/conda")
# scDIOR
Seu2AD(seu.obj = pbmc_small, method = "scDIOR", 
       out.folder = "out.folder", assay = "RNA", save.scale = TRUE)

SeuratObject to loom

The conversion is performed with functions implemented in SeuratDisk:

loom.file <- tempfile(pattern = "pbmc_small_", fileext = ".loom")
ExportSeurat(
  seu.obj = pbmc_small, assay = "RNA", to = "loom",
  loom.file = loom.file
)

Convert other objects to SeuratObject

SingleCellExperiment to SeuratObject

The conversion is performed with functions implemented in Seurat:

seu.obj.sce <- ImportSeurat(obj = sce.obj, from = "SCE", 
                            count.assay = "counts", data.assay = "logcounts", 
                            assay = "RNA")

CellDataSet/cell_data_set to SeuratObject

CellDataSet to SeuratObject (The conversion is performed with functions implemented in Seurat):

seu.obj.cds <- ImportSeurat(obj = cds.obj, from = "CellDataSet", 
                            count.assay = "counts", assay = "RNA")

cell_data_set to SeuratObject (The conversion is performed with functions implemented in Seurat):

seu.obj.cds3 <- ImportSeurat(obj = cds3.obj, from = "cell_data_set", 
                             count.assay = "counts", data.assay = "logcounts",
                             assay = "RNA")

AnnData to SeuratObject

There are multiple tools available for format conversion from AnnData to SeuratObject:

scDIOR is the best method in terms of information kept (GEfetch2R integrates scDIOR and SeuratDisk to achieve the best performance in information kept)
schard is the best method in terms of usability
schard and sceasy have comparable performance when cell number below 200k, but sceasy has better performance in scalability
sceasy has better performance in disk usage

# SeuratDisk
ann.seu <- AD2Seu(anndata.file = "pbmc3k.h5ad", method = "SeuratDisk", 
                  assay = "RNA", load.assays = c("RNA"))
# sceasy
ann.sceasy <- AD2Seu(anndata.file = "pbmc3k.h5ad", method = "sceasy",
                     assay = "RNA", slot = "scale.data")
# scDIOR
ann.scdior <- AD2Seu(anndata.file = "pbmc3k.h5ad", method = "scDIOR", 
                     assay = "RNA")
# schard
ann.schard <- AD2Seu(anndata.file = "pbmc3k.h5ad", 
                     method = "schard", assay = "RNA", use.raw = T)
# SeuratDisk+scDIOR
ann.seuscdior <- AD2Seu(anndata.file = "pbmc3k.h5ad", method = "SeuratDisk+scDIOR", 
                        assay = "RNA", load.assays = c("RNA"))

loom to SeuratObject

The conversion is performed with functions implemented in SeuratDisk and Seurat:

# loom will lose reduction
seu.obj.loom <- ImportSeurat(loom.file = loom.file, from = "loom")

Conversion between SingleCellExperiment and AnnData

SingleCellExperiment to AnnData

There are multiple tools available for format conversion from SingleCellExperiment to AnnData:

zellkonverter is the best method in terms of information kept and running time
scDIOR is the best method in terms of usability and disk usage

# sceasy
SCE2AD(sce.obj = seger, method = "sceasy", out.folder = "benchmark", 
       slot = "rawcounts", conda.path = "/path/to/conda")
       
# scDIOR
seger.scdior <- seger
library(SingleCellExperiment)
# scDIOR does not support varm in rowData
rowData(seger.scdior)$varm <- NULL
SCE2AD(sce.obj = seger.scdior, method = "scDIOR", out.folder = "benchmark")

# zellkonverter
SCE2AD(sce.obj = seger, method = "zellkonverter", 
       out.folder = "benchmark", slot = "rawcounts", 
       conda.path = "/path/to/conda")

AnnData to SingleCellExperiment

There are multiple tools available for format conversion from AnnData to SingleCellExperiment:

zellkonverter is the best method in terms of information kept
schard is the best method in terms of usability and running time
schard and scDIOR have comparable performance in disk usage

# scDIOR
sce.scdior <- AD2SCE(anndata.file = "pbmc3k.h5ad", method = "scDIOR", 
                     assay = "RNA", use.raw = TRUE, conda.path = "/path/to/conda")
# zellkonverter
sce.zell <- AD2SCE(anndata.file = "pbmc3k.h5ad", method = "zellkonverter",
                   slot = "scale.data", use.raw = TRUE, conda.path = "/path/to/conda")
# schard
sce.schard <- AD2SCE(anndata.file = "pbmc3k.h5ad", 
                     method = "schard", use.raw = TRUE)

Conversion between SingleCellExperiment and loom

The conversion is performed with functions implemented in LoomExperiment.

SingleCellExperiment to loom

# remove seger.loom first
seger.loom.file <- tempfile(pattern = "seger_", fileext = ".loom")
SCELoom(
  from = "SingleCellExperiment", to = "loom", sce = seger,
  loom.file = seger.loom.file
)

loom to SingleCellExperiment

seger.loom <- SCELoom(
  from = "loom", to = "SingleCellExperiment",
  loom.file = seger.loom.file
)

Contact

For any question, feature request or bug report please write an email to songyb0519@gmail.com.

Code of Conduct

Please note that the GEfetch2R project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Name		Name	Last commit message	Last commit date
Latest commit History 157 Commits
Docker		Docker
R		R
data		data
docs		docs
man		man
vignettes		vignettes
.DS_Store		.DS_Store
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
DESCRIPTION		DESCRIPTION
GEfetch2R.Rproj		GEfetch2R.Rproj
INSTALL.md		INSTALL.md
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.md		README.md
_pkgdown.yml		_pkgdown.yml
boot.memprof		boot.memprof

License

showteeth/GEfetch2R

Folders and files

Latest commit

History

Repository files navigation

GEfetch2R - Access Single-cell/Bulk RNA-seq Data from Public Repositories and Benchmark the Subsequent Format Conversion Tools

Quick start

Introduction

Framework

Installation

Manual installation

Docker image

Vignette

Comparison with similar tools

Quick start

Check API

SRA/ENA (sra/fastq/bam)

GEO (count matrix, metadata, processed object)

Count matrix (bulk RNA-seq/Smart-seq2)

Count matrix (scRNA-seq)

Metadata

Processed object

PanglaoDB (count matrix, cell type composition)

Given dataset

Filter samples based on metadata

UCSC Cell Browser (count matrix, cell type composition)

Given dataset

Filter samples based on metadata

Zenodo (processed object)

CELLxGENE (processed object)

Given dataset

Filter samples based on metadata

Human Cell Atlas (processed object)

Given dataset

Filter samples based on metadata

Format conversion

Test data

Convert SeuratObject to other objects

SeuratObject to SingleCellExperiment

SeuratObject to CellDataSet/cell_data_set

SeuratObject to AnnData

SeuratObject to loom

Convert other objects to SeuratObject

SingleCellExperiment to SeuratObject

CellDataSet/cell_data_set to SeuratObject

AnnData to SeuratObject

loom to SeuratObject

Conversion between SingleCellExperiment and AnnData

SingleCellExperiment to AnnData

AnnData to SingleCellExperiment

Conversion between SingleCellExperiment and loom

SingleCellExperiment to loom

loom to SingleCellExperiment

Contact

Code of Conduct

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages