guomics-lab/GNHSF
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|
Repository files navigation
GNHSF: Large Scale Metaproteomics Reveals Key Microbial Functions in Metabolic Diseases and Aging This repository contains the main analysis codes used to generate figures in the GNHSF study. Code Description figure_R This directory contains the R scripts used for generating the figures in the research paper Generate_matrix.R #Prepares specific sample cohorts for downstream processing: N=1385 for cross-sectional analysis, N=954 for longitudinal analysis, and N=1039 for metagenomic integration. arrange_link.R #This script establishes the correspondence relationships among peptides, proteins, and taxa, generating the protein-taxa mapping files required by all subsequent analysis scripts. Fig1D_FigS1ABC.R #This script generates figure1D and supplementary figure 1. Note: Run figs1_part_getBC.R first to calculate Bray-Curtis distance matrices for all replicate types. Includes: Fig 1D: Spearman correlations across all types of QCs. Fig S1A-B, Fig 1C: Correlation coefficients and Bray-Curtis distances for all replicates Fig S1C: PCoA of all 2,512 samples Fig1E_FigS1FG.R #This script generates: Fig 1E: Average identification counts per sample at each taxonomic level Fig S1F: Distribution histogram of sample identification counts Fig S1G: Breakdown of microbial proteins versus human proteins Fig2.R #This script processes and visualizes results from Generalized Linear Model (GLM) analysis. Note: Run fig2_glm_1385.R first to obtain complete GLM results. Includes: Fig S3D: GLM associations grouped by clinical categories Fig 2A: Summary of top 6 associations Fig 2B-D: Heatmap visualization of the most significant associations at different taxonomic levels Fig3.R #This script analyzes metaproteomic features associated with aging. Note: Run fig3_glmm_954.R first to calculate within-subject associations using Generalized Linear Mixed Models (GLMM). Includes: Fig 3A-F: Aging-associated metaproteomic features Fig4.R #This script identifies and visualizes metaproteomic features commonly associated with metabolic diseases. Includes: Fig 4A-C: Shared metaproteomic signatures across metabolic diseases Fig5.R #This script performs medication-weighted GLM calculations and generates related visualizations. Includes: Fig 5B-G: Medication-responsive metaproteomic features in metabolic diseases Fig S7D: Medication-specific proteins and their corresponding species in T2D Fig6_FigS6_FigS7.R #This script analyzes and visualizes T2D-associated features. Note: Perform GLM analysis on the FH cohort and run machine learning code to export ML-related features before executing relevant sections. Includes: Fig 6A: Network visualization of T2D-associated species Fig 6C: Network visualization of T2D-associated metaproteomic features Fig S6B: Comparison between metaproteomics and metagenomics data Fig S6C: T2D-related species and their produced microbial protein groups Fig S7C: GLM associations of M. elsdenii proteins with T2D and T2D medication Fig7.R #This script visualizes in vivo and in vitro biological validation data. Includes: Fig 7: All panels for biological validation experiments FigS1DE_mapping.R #This script calculates the proportion of each sample annotated to taxa or functions for Fig S1D-E: Annotation coverage statistics FigS2_count.R #This script generates Fig S2A-H: Count statistics of top features FigS2A-H.R #This script generates all panels in Fig S2 A-H. FigS2I-K.R #This script generates all panels in Fig S2 I-K. FigS3_FigS4.R #This script calculates and visualizes all core features of the GNHSF metaproteomic dataset. Includes Fig S3 and Fig S4: All panels showing core metaproteomic features FigS5.R #This script performs Fig S5A: PERMANOVA analysis ML_py This directory contains the python scripts used for machine learning in the research paper. Includes: evaluate.py # Calculate AUC and other metrics from predicted probabilities and true labels. test_model_extra.py # Evaluate model performance on the external test set. test_model_inter.py # Evaluate model performance on the internal test set. train_model.py # Train models on different proteomics datasets. validation_analysis.py # model validation metric calculation script. figures This directory contains the python scripts and results of ROC-AUC and PR-AUC of machine learning models. Includes: auc_curves # This directory contains the ROC-AUC curve plots of internal and external tests external_roc.pdf internal_roc.pdf boxplot # This directory contains the scripts and box plots of ROC-AUC and PR-AUC. Includes: seed_auc_boxplot_external.pdf seed_auc_boxplot_internal.pdf seed_prauc_boxplot_external.pdf seed_prauc_boxplot_internal.pdf plot_auc_boxplots.py # Generate boxplots illustrating the distribution of ROC-AUC scores across 20 random seeds. plot_pr_auc_boxplots.py # Create boxplots that depict the distribution of PR-AUC (Precision-Recall Area Under the Curve) scores across 20 random seeds Usage Notes Some scripts require prerequisite scripts to be run first (as noted in descriptions above). Ensure all dependency files and intermediate results are properly generated before running downstream analyses. Scripts are named according to their corresponding figures in the manuscript.