Skip to content

Feat: Add machine learning module in TCMDATA for key targets screening#15

Merged
GuangchuangYu merged 13 commits intoYuLab-SMU:masterfrom
Hinna0818:ml
Mar 23, 2026
Merged

Feat: Add machine learning module in TCMDATA for key targets screening#15
GuangchuangYu merged 13 commits intoYuLab-SMU:masterfrom
Hinna0818:ml

Conversation

@Hinna0818
Copy link
Copy Markdown
Contributor

此PR为TCMDATA添加了基于机器学习的关键靶点筛选模块,定位于网络药理学分析的下游步骤,即从 PPI / WGCNA 候选基因集中,结合RNA-seq表达谱数据筛选关键基因。结合文献的实际情况,本次选取了应用最广泛的几个模型:弹性网络、随机森林-boruta、SVM-RFE和XGBoost。一般的做法是,独立运行这几个模型,然后进行特征选择,根据交集情况保留共同识别的特征基因作为下游验证。

新增内容

核心函数R/ml_models.rR/ml_utils.R

  • prepare_ml_data() — 数据预处理,支持三种验证模式:
    • Mode A:全数据交叉验证(无留出集,常用于特征选择,指标评估通过如随机森林的袋外OOB预测或者CV的折外预测,不存在数据泄露的问题)
    • Mode B:内部训练/测试随机划分
    • Mode C:外部独立验证集(比如用户想用一个其他独立的数据集来做测试,评估筛选的特征的性能,则可以用这个模式)
  • 六种模型:ml_lasso()ml_enet()ml_ridge()ml_rf()ml_svm_rfe()ml_xgboost()
  • run_ml_screening() — 批量执行多种方法
  • get_ml_consensus() — 提取被 ≥ k 个模型共同选出的基因
  • select_features() — 对 Ridge、XGBoost 等保留全部特征的模型进行事后裁剪选取(主观选取)

可视化R/ml_plots.R

  • 各模型的 CV 曲线、系数路径图、重要性条形图
  • plot_ml_roc() — 多方法 ROC 曲线叠加(三种模式均支持)
  • plot_ml_venn() — 各方法筛选基因集的韦恩图

数据与文档

  • covid19 数据集(GSE157103,100 样本 × 5000 基因),用于示例与测试
  • bookdown 章节 08-Machine-Learning-analysis.Rmd,覆盖三种模式的完整使用示例

设计说明

  • 所有模型的构建函数都返回统一的 S3 对象(tcm_ml),具有一致的访问接口($selected_features$importance$cv_performance$test_performance)。
  • 超参数调优全部在训练集内部通过交叉验证完成,测试集不参与任何模型选择过程。

具体可见8-ML文档部分:8-ML

- ggdot_sankey(): new gene_fontface param (default 'italic') for Gene axis
- tcm_sankey(): new target_fontface param (default 'italic') for target axis
- R CMD check: Status OK
- Default normalize=FALSE to keep raw betweenness/closeness values
- Avoids compressed color gradients in ggtangle node coloring
- Affects 4 calls: betweenness/betweenness_w/closeness/closeness_w
- Add ml_lasso, ml_enet, ml_ridge, ml_rf, ml_svm_rfe, ml_xgboost
- Add run_ml_screening for batch execution (Mode A/B/C)
- Add visualization: ROC, Venn, importance, CV curves
- Add covid19 demo dataset (GSE157103)
- Add bookdown chapter 08-Machine-Learning-analysis.Rmd
- Rename 08-Other-resources → 09-Other-resources
- Update .gitignore to exclude build artefacts
- Remove tracked tar.gz from index
- best_iteration: try $best_iteration then $early_stop$best_iteration
- oof predictions: try $pred then $cv_predict$pred
- xgb.importance: handle both 'Feature' and 'Features' column names
- Add fallback to evaluation_log when best_iteration is NULL
…e roxygen docs

- ml_enet(): relevel(y, ref=levels[2]) so glmnet event = levels[1] (positive);
  removes scattered 1-p inversions in oof_prob and plot_ml_roc
- plot_ml_roc(): SVM-RFE Mode B/C uses pre-computed test_performance
  probabilities instead of predict.rfe() (which requires all features)
- NAMESPACE: sort exports, remove unused ggplot2 imports
- man/: regenerated roxygen docs with examples and expanded descriptions
- ml_utils.R: defer caret dependency check to Mode B only
- Fix roxygen: replace bare @details with proper title/description block
- Add input validation (non-tcm_ml inputs raise informative error)
- Auto-name unnamed arguments from $method slot (dedup suffix)
- Set 'ml_data' attribute from first element (mirrors run_ml_screening)
- Rebuild man/create_tcm_ml_list.Rd via devtools::document()
- 08-Rmd: add '## Assembling models manually' section with worked example
- 08-Rmd: add '## Session Information' chunk at end of chapter
- Adds R/ml_gene_diag.R with .resolve_expr_group(), get_gene_auc(), plot_gene_roc(), and plot_gene_boxplot()
- Provides AUC computation with DeLong CIs via pROC
- Creates multiplexed ROC overlays and statistical boxplots dynamically faceted, mapped to raw datasets or ML outputs without imposing additional dependency footprint (using base stats and strictly ggplot2 primitives/geom computations)
- Documents functions and updates man pages / NAMESPACE
- Included evaluated examples in new 'Single-gene diagnostic analysis' subsection of bookdown
…e_roc / plot_gene_boxplot

- Supports all stats::p.adjust.methods: BH, bonferroni, holm, etc.
- Default remains 'BH' (Benjamini-Hochberg FDR)
- Single-gene case: p_adj == p_value automatically (no special handling needed)
- Validated via match.arg() against stats::p.adjust.methods
… xgboost nrounds guard; add pROC citation in docs

- .fmt_p() and .signif_stars() moved from inside plot_gene_boxplot() to file top-level with @Keywords internal / @nord
- ml_models.R: add final safety guard for xgboost best_iter (integer(0) case)
- docs/08: add pROC CRAN reference under Single-gene ROC curves section
- Fix en-dash (U+2013) -> ASCII '--' in roxygen to avoid Rd parse warnings
@Hinna0818
Copy link
Copy Markdown
Contributor Author

补充:get_gene_auc()使用pROC包计算该基因针对表达谱数据的AUC;plot_gene_roc()绘制对应的单/多基因诊断roc曲线;plot_gene_boxplot()绘制基因在表达谱数据的不同分组间的表达情况;https://hinna0818.github.io/TCMDATA/ml-analysis.html#ml-gene-diag

@GuangchuangYu GuangchuangYu merged commit e4ffa8a into YuLab-SMU:master Mar 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants