Feat: Add machine learning module in `TCMDATA` for key targets screening by Hinna0818 · Pull Request #15 · YuLab-SMU/TCMDATA

Hinna0818 · 2026-03-19T06:14:47Z

此PR为TCMDATA添加了基于机器学习的关键靶点筛选模块，定位于网络药理学分析的下游步骤，即从 PPI / WGCNA 候选基因集中，结合RNA-seq表达谱数据筛选关键基因。结合文献的实际情况，本次选取了应用最广泛的几个模型：弹性网络、随机森林-boruta、SVM-RFE和XGBoost。一般的做法是，独立运行这几个模型，然后进行特征选择，根据交集情况保留共同识别的特征基因作为下游验证。

新增内容

核心函数（R/ml_models.r、R/ml_utils.R）

prepare_ml_data() — 数据预处理，支持三种验证模式：
- Mode A：全数据交叉验证（无留出集，常用于特征选择，指标评估通过如随机森林的袋外OOB预测或者CV的折外预测，不存在数据泄露的问题）
- Mode B：内部训练/测试随机划分
- Mode C：外部独立验证集（比如用户想用一个其他独立的数据集来做测试，评估筛选的特征的性能，则可以用这个模式）
六种模型：ml_lasso()、ml_enet()、ml_ridge()、ml_rf()、ml_svm_rfe()、ml_xgboost()
run_ml_screening() — 批量执行多种方法
get_ml_consensus() — 提取被 ≥ k 个模型共同选出的基因
select_features() — 对 Ridge、XGBoost 等保留全部特征的模型进行事后裁剪选取（主观选取）

可视化（R/ml_plots.R）

各模型的 CV 曲线、系数路径图、重要性条形图
plot_ml_roc() — 多方法 ROC 曲线叠加（三种模式均支持）
plot_ml_venn() — 各方法筛选基因集的韦恩图

数据与文档

covid19 数据集（GSE157103，100 样本 × 5000 基因），用于示例与测试
bookdown 章节 08-Machine-Learning-analysis.Rmd，覆盖三种模式的完整使用示例

设计说明

所有模型的构建函数都返回统一的 S3 对象（tcm_ml），具有一致的访问接口（$selected_features、$importance、$cv_performance、$test_performance）。
超参数调优全部在训练集内部通过交叉验证完成，测试集不参与任何模型选择过程。

具体可见8-ML文档部分：8-ML

- ggdot_sankey(): new gene_fontface param (default 'italic') for Gene axis - tcm_sankey(): new target_fontface param (default 'italic') for target axis - R CMD check: Status OK

- Default normalize=FALSE to keep raw betweenness/closeness values - Avoids compressed color gradients in ggtangle node coloring - Affects 4 calls: betweenness/betweenness_w/closeness/closeness_w

- Add ml_lasso, ml_enet, ml_ridge, ml_rf, ml_svm_rfe, ml_xgboost - Add run_ml_screening for batch execution (Mode A/B/C) - Add visualization: ROC, Venn, importance, CV curves - Add covid19 demo dataset (GSE157103) - Add bookdown chapter 08-Machine-Learning-analysis.Rmd - Rename 08-Other-resources → 09-Other-resources - Update .gitignore to exclude build artefacts - Remove tracked tar.gz from index

- best_iteration: try $best_iteration then $early_stop$best_iteration - oof predictions: try $pred then $cv_predict$pred - xgb.importance: handle both 'Feature' and 'Features' column names - Add fallback to evaluation_log when best_iteration is NULL

…e roxygen docs - ml_enet(): relevel(y, ref=levels[2]) so glmnet event = levels[1] (positive); removes scattered 1-p inversions in oof_prob and plot_ml_roc - plot_ml_roc(): SVM-RFE Mode B/C uses pre-computed test_performance probabilities instead of predict.rfe() (which requires all features) - NAMESPACE: sort exports, remove unused ggplot2 imports - man/: regenerated roxygen docs with examples and expanded descriptions - ml_utils.R: defer caret dependency check to Mode B only

- Fix roxygen: replace bare @details with proper title/description block - Add input validation (non-tcm_ml inputs raise informative error) - Auto-name unnamed arguments from $method slot (dedup suffix) - Set 'ml_data' attribute from first element (mirrors run_ml_screening) - Rebuild man/create_tcm_ml_list.Rd via devtools::document() - 08-Rmd: add '## Assembling models manually' section with worked example - 08-Rmd: add '## Session Information' chunk at end of chapter

- Adds R/ml_gene_diag.R with .resolve_expr_group(), get_gene_auc(), plot_gene_roc(), and plot_gene_boxplot() - Provides AUC computation with DeLong CIs via pROC - Creates multiplexed ROC overlays and statistical boxplots dynamically faceted, mapped to raw datasets or ML outputs without imposing additional dependency footprint (using base stats and strictly ggplot2 primitives/geom computations) - Documents functions and updates man pages / NAMESPACE - Included evaluated examples in new 'Single-gene diagnostic analysis' subsection of bookdown

…e_roc / plot_gene_boxplot - Supports all stats::p.adjust.methods: BH, bonferroni, holm, etc. - Default remains 'BH' (Benjamini-Hochberg FDR) - Single-gene case: p_adj == p_value automatically (no special handling needed) - Validated via match.arg() against stats::p.adjust.methods

@Keywords

… xgboost nrounds guard; add pROC citation in docs - .fmt_p() and .signif_stars() moved from inside plot_gene_boxplot() to file top-level with @Keywords internal / @nord - ml_models.R: add final safety guard for xgboost best_iter (integer(0) case) - docs/08: add pROC CRAN reference under Single-gene ROC curves section - Fix en-dash (U+2013) -> ASCII '--' in roxygen to avoid Rd parse warnings

Hinna0818 · 2026-03-21T13:35:22Z

补充：get_gene_auc()使用pROC包计算该基因针对表达谱数据的AUC；plot_gene_roc()绘制对应的单/多基因诊断roc曲线；plot_gene_boxplot()绘制基因在表达谱数据的不同分组间的表达情况；https://hinna0818.github.io/TCMDATA/ml-analysis.html#ml-gene-diag

Hinna0818 added 13 commits March 12, 2026 01:21

feat(sankey): add gene_fontface/target_fontface for italic gene labels

bd94c84

- ggdot_sankey(): new gene_fontface param (default 'italic') for Gene axis - tcm_sankey(): new target_fontface param (default 'italic') for target axis - R CMD check: Status OK

add branch 'ml' for deployment

5a90be7

set font_face as plain in default

6978813

feat(PPI): add normalize param to compute_nodeinfo()

9d5760a

- Default normalize=FALSE to keep raw betweenness/closeness values - Avoids compressed color gradients in ggtangle node coloring - Affects 4 calls: betweenness/betweenness_w/closeness/closeness_w

Merge remote-tracking branch 'upstream/master' into ml

6a9ca16

update readme

7145b6f

GuangchuangYu merged commit e4ffa8a into YuLab-SMU:master Mar 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Add machine learning module in `TCMDATA` for key targets screening#15

Feat: Add machine learning module in `TCMDATA` for key targets screening#15
GuangchuangYu merged 13 commits intoYuLab-SMU:masterfrom
Hinna0818:ml

Hinna0818 commented Mar 19, 2026

Uh oh!

Hinna0818 commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Hinna0818 commented Mar 19, 2026

新增内容

设计说明

Uh oh!

Hinna0818 commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants