SpaIM/data/Data_Processing_README.txt at main · QSong-github/SpaIM · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
This data processing workflow focuses on preprocessing single-cell RNA sequencing data and performing multi-resolution clustering analysis, with the following key steps:

1. Data Preparation: The workflow starts by importing essential libraries (Scanpy for single-cell analysis, NumPy, Pandas, etc.) and defining a function (`adata_to_cluster_expression`) for aggregating gene expression values by cell clusters. This function identifies unique cluster labels, creates a new AnnData object at the cluster level, and computes summed expression values for each gene within each cluster.

2. Data Loading: The single-cell RNA sequencing data (in `.h5ad` format) is loaded from the specified dataset directory (`dataset/Dataset1`).

3. Preprocessing Pipeline:
   - A copy of the original data is created for clustering analysis.
   - The data undergoes normalization to scale total gene expression per cell.
   - Log transformation is applied to stabilize variance.
   - Highly variable genes are identified and retained to focus on biologically meaningful signals.
   - Expression values are scaled to have zero mean and unit variance, with outliers capped at 10.

4. Dimensionality Reduction and Clustering:
   - Principal Component Analysis (PCA) is performed to reduce data dimensionality.
   - A neighborhood graph of cells is constructed based on PCA components to capture cell-cell relationships.
   - Leiden clustering is applied across multiple resolutions (0.05, 0.10, 0.30, 0.50, 0.70, 0.90) to generate varying numbers of cell clusters. Each resolution's cluster labels are stored in the original dataset with a resolution-specific column name.

5. Result Storage: The modified dataset, now containing all cluster labels from different resolutions, is written back to the original `.h5ad` file to preserve the processed information for subsequent analyses.

This structured workflow ensures consistent preprocessing and provides flexible clustering results to explore cell population structures at different granularities.