Feat: Add machine learning module in TCMDATA for key targets screening#15
Merged
GuangchuangYu merged 13 commits intoYuLab-SMU:masterfrom Mar 23, 2026
Merged
Feat: Add machine learning module in TCMDATA for key targets screening#15GuangchuangYu merged 13 commits intoYuLab-SMU:masterfrom
TCMDATA for key targets screening#15GuangchuangYu merged 13 commits intoYuLab-SMU:masterfrom
Conversation
- ggdot_sankey(): new gene_fontface param (default 'italic') for Gene axis - tcm_sankey(): new target_fontface param (default 'italic') for target axis - R CMD check: Status OK
- Default normalize=FALSE to keep raw betweenness/closeness values - Avoids compressed color gradients in ggtangle node coloring - Affects 4 calls: betweenness/betweenness_w/closeness/closeness_w
- Add ml_lasso, ml_enet, ml_ridge, ml_rf, ml_svm_rfe, ml_xgboost - Add run_ml_screening for batch execution (Mode A/B/C) - Add visualization: ROC, Venn, importance, CV curves - Add covid19 demo dataset (GSE157103) - Add bookdown chapter 08-Machine-Learning-analysis.Rmd - Rename 08-Other-resources → 09-Other-resources - Update .gitignore to exclude build artefacts - Remove tracked tar.gz from index
- best_iteration: try $best_iteration then $early_stop$best_iteration - oof predictions: try $pred then $cv_predict$pred - xgb.importance: handle both 'Feature' and 'Features' column names - Add fallback to evaluation_log when best_iteration is NULL
…e roxygen docs - ml_enet(): relevel(y, ref=levels[2]) so glmnet event = levels[1] (positive); removes scattered 1-p inversions in oof_prob and plot_ml_roc - plot_ml_roc(): SVM-RFE Mode B/C uses pre-computed test_performance probabilities instead of predict.rfe() (which requires all features) - NAMESPACE: sort exports, remove unused ggplot2 imports - man/: regenerated roxygen docs with examples and expanded descriptions - ml_utils.R: defer caret dependency check to Mode B only
- Fix roxygen: replace bare @details with proper title/description block - Add input validation (non-tcm_ml inputs raise informative error) - Auto-name unnamed arguments from $method slot (dedup suffix) - Set 'ml_data' attribute from first element (mirrors run_ml_screening) - Rebuild man/create_tcm_ml_list.Rd via devtools::document() - 08-Rmd: add '## Assembling models manually' section with worked example - 08-Rmd: add '## Session Information' chunk at end of chapter
- Adds R/ml_gene_diag.R with .resolve_expr_group(), get_gene_auc(), plot_gene_roc(), and plot_gene_boxplot() - Provides AUC computation with DeLong CIs via pROC - Creates multiplexed ROC overlays and statistical boxplots dynamically faceted, mapped to raw datasets or ML outputs without imposing additional dependency footprint (using base stats and strictly ggplot2 primitives/geom computations) - Documents functions and updates man pages / NAMESPACE - Included evaluated examples in new 'Single-gene diagnostic analysis' subsection of bookdown
…e_roc / plot_gene_boxplot - Supports all stats::p.adjust.methods: BH, bonferroni, holm, etc. - Default remains 'BH' (Benjamini-Hochberg FDR) - Single-gene case: p_adj == p_value automatically (no special handling needed) - Validated via match.arg() against stats::p.adjust.methods
… xgboost nrounds guard; add pROC citation in docs - .fmt_p() and .signif_stars() moved from inside plot_gene_boxplot() to file top-level with @Keywords internal / @nord - ml_models.R: add final safety guard for xgboost best_iter (integer(0) case) - docs/08: add pROC CRAN reference under Single-gene ROC curves section - Fix en-dash (U+2013) -> ASCII '--' in roxygen to avoid Rd parse warnings
Contributor
Author
|
补充: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
此PR为TCMDATA添加了基于机器学习的关键靶点筛选模块,定位于网络药理学分析的下游步骤,即从 PPI / WGCNA 候选基因集中,结合RNA-seq表达谱数据筛选关键基因。结合文献的实际情况,本次选取了应用最广泛的几个模型:弹性网络、随机森林-boruta、SVM-RFE和XGBoost。一般的做法是,独立运行这几个模型,然后进行特征选择,根据交集情况保留共同识别的特征基因作为下游验证。
新增内容
核心函数(
R/ml_models.r、R/ml_utils.R)prepare_ml_data()— 数据预处理,支持三种验证模式:ml_lasso()、ml_enet()、ml_ridge()、ml_rf()、ml_svm_rfe()、ml_xgboost()run_ml_screening()— 批量执行多种方法get_ml_consensus()— 提取被 ≥ k 个模型共同选出的基因select_features()— 对 Ridge、XGBoost 等保留全部特征的模型进行事后裁剪选取(主观选取)可视化(
R/ml_plots.R)plot_ml_roc()— 多方法 ROC 曲线叠加(三种模式均支持)plot_ml_venn()— 各方法筛选基因集的韦恩图数据与文档
covid19数据集(GSE157103,100 样本 × 5000 基因),用于示例与测试08-Machine-Learning-analysis.Rmd,覆盖三种模式的完整使用示例设计说明
tcm_ml),具有一致的访问接口($selected_features、$importance、$cv_performance、$test_performance)。具体可见8-ML文档部分:8-ML