Refactor Multimodal by andrewscouten · Pull Request #15 · collaborativebioinformatics/OncoLearn

andrewscouten · 2026-02-26T19:23:08Z

Refactors multimodal code to work with:

uv dependency management
git submodule located in "submodules/biomed-multi-omic" (vs. locally copied code with no git history)

Closes #5, #13

…ort multimodal dependencies.

andrewscouten · 2026-02-27T18:37:12Z

It looks like source code files weren't committed properly. There are multiple references to a src/multimodal/src/data module, but the src/multimodal/.gitignore references /src/data/*. It's likely these files were unintentionally ignored, but as a result the code cannot run as-is.

@seohyun408, @seungjindes, @y00628, if any of you still have the source code, I'd appreciate if you push it. If not, I can try to reverse engineer based on the existing code.

seungjindes · 2026-02-27T23:14:49Z

The original source code will likely be difficult to obtain. Even if it is not a perfect reproduction, I think it would be best to reimplement it in a simplified form based on the existing code. Thank you.원본 소스 코드를 얻기 어려울 수 있습니다. 완벽한 재현은 아니더라도 기존 코드를 기반으로 단순화된 형태로 다시 구현하는 것이 가장 좋을 것 같습니다. 감사합니다.

…

________________________________________

______________________________________________________ Seungjin Han, M.S한승진, M.S. Data eXperience Laboratory데이터 경험 연구실 Department of Applied Artificial Intelligence응용인공지능학과 Sungkyunkwan University성균관대학교 Seoul, Republic of Korea대한민국 서울 Tel: +82 010-6638-2302 /전화: +82 010-6638-2302 / Email: ***@***.***이메일: ***@***.*** Web: https://seungjindes.github.io/about/ / Lab: http://dsl.skku.edu웹: https://seungjindes.github.io/about/ / 연구실: http://dsl.skku.edu

________________________________________

______________________________________________________ 2026년 2월 27일 (금) AM 10:37, Andrew Scouten ***@***.***>님이 작성:

*andrewscouten* left a comment (collaborativebioinformatics/OncoLearn#15) <#15 (comment)> It looks like source code files weren't committed properly. There are multiple references to a src/multimodal/src/data module, but the src/multimodal/.gitignore references /src/data/*. It's likely these files were unintentionally ignored, but as a result the code cannot run as-is. @seohyun408 <https://github.com/seohyun408>, @seungjindes <https://github.com/seungjindes>, @y00628 <https://github.com/y00628>, if any of you still have the source code, I'd appreciate if you push it. If not, I can try to reverse engineer based on the existing code. — Reply to this email directly, view it on GitHub <#15 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BE3NC6INKTQZ3ICYZT4VAJ34OCFG5AVCNFSM6AAAAACWA5VZX6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTSNZUGQ2TIMRSG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

seungjindes · 2026-02-28T02:50:40Z

I have uploaded the preprocessing code used for Andrew’s training as well as the README file. Although I could not find the original raw data files, I hope these materials will still be helpful. Also, regarding the hackathon paper, would it be okay if I make some edits to the draft we have written? Thank you.

…

________________________________________ Seungjin Han, M.S Data eXperience Laboratory Department of Applied Artificial Intelligence Sungkyunkwan University Seoul, Republic of Korea Tel: +82 010-6638-2302 / Email: ***@***.*** Web: https://seungjindes.github.io/about/ / Lab: http://dsl.skku.edu

________________________________________ 2026년 2월 27일 (금) PM 3:14, Seungjin Han ***@***.***>님이 작성:

The original source code will likely be difficult to obtain. Even if it is not a perfect reproduction, I think it would be best to reimplement it in a simplified form based on the existing code. Thank you.원본 소스 코드를 얻기 어려울 수 있습니다. 완벽한 재현은 아니더라도 기존 코드를 기반으로 단순화된 형태로 다시 구현하는 것이 가장 좋을 것 같습니다. 감사합니다. ________________________________________ ______________________________________________________ Seungjin Han, M.S한승진, M.S. Data eXperience Laboratory데이터 경험 연구실 Department of Applied Artificial Intelligence응용인공지능학과 Sungkyunkwan University성균관대학교 Seoul, Republic of Korea대한민국 서울 Tel: +82 010-6638-2302 /전화: +82 010-6638-2302 / Email: ***@***.***이메일: ***@***.*** Web: https://seungjindes.github.io/about/ / Lab: http://dsl.skku.edu웹: https://seungjindes.github.io/about/ / 연구실: http://dsl.skku.edu ________________________________________ ______________________________________________________ 2026년 2월 27일 (금) AM 10:37, Andrew Scouten ***@***.***>님이 작성: > *andrewscouten* left a comment (collaborativebioinformatics/OncoLearn#15) > <#15 (comment)> > > It looks like source code files weren't committed properly. There are > multiple references to a src/multimodal/src/data module, but the > src/multimodal/.gitignore references /src/data/*. It's likely these > files were unintentionally ignored, but as a result the code cannot run > as-is. > > @seohyun408 <https://github.com/seohyun408>, @seungjindes > <https://github.com/seungjindes>, @y00628 <https://github.com/y00628>, > if any of you still have the source code, I'd appreciate if you push it. If > not, I can try to reverse engineer based on the existing code. > > — > Reply to this email directly, view it on GitHub > <#15 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/BE3NC6INKTQZ3ICYZT4VAJ34OCFG5AVCNFSM6AAAAACWA5VZX6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTSNZUGQ2TIMRSG4> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >

seungjindes · 2026-02-28T02:51:51Z

The content to be included in the paper has been uploaded to Discord by River.

…

________________________________________ Seungjin Han, M.S Data eXperience Laboratory Department of Applied Artificial Intelligence Sungkyunkwan University Seoul, Republic of Korea Tel: +82 010-6638-2302 / Email: ***@***.*** Web: https://seungjindes.github.io/about/ / Lab: http://dsl.skku.edu

________________________________________ 2026년 2월 27일 (금) PM 6:50, Seungjin Han ***@***.***>님이 작성:

I have uploaded the preprocessing code used for Andrew’s training as well as the README file. Although I could not find the original raw data files, I hope these materials will still be helpful. Also, regarding the hackathon paper, would it be okay if I make some edits to the draft we have written? Thank you. ________________________________________ Seungjin Han, M.S Data eXperience Laboratory Department of Applied Artificial Intelligence Sungkyunkwan University Seoul, Republic of Korea Tel: +82 010-6638-2302 / Email: ***@***.*** Web: https://seungjindes.github.io/about/ / Lab: http://dsl.skku.edu ________________________________________ 2026년 2월 27일 (금) PM 3:14, Seungjin Han ***@***.***>님이 작성: > The original source code will likely be difficult to obtain. Even if it > is not a perfect reproduction, I think it would be best to reimplement it > in a simplified form based on the existing code. Thank you.원본 소스 코드를 얻기 어려울 > 수 있습니다. 완벽한 재현은 아니더라도 기존 코드를 기반으로 단순화된 형태로 다시 구현하는 것이 가장 좋을 것 같습니다. 감사합니다. > ________________________________________ > ______________________________________________________ > Seungjin Han, M.S한승진, M.S. > > Data eXperience Laboratory데이터 경험 연구실 > Department of Applied Artificial Intelligence응용인공지능학과 > Sungkyunkwan University성균관대학교 > Seoul, Republic of Korea대한민국 서울 > > Tel: +82 010-6638-2302 /전화: +82 010-6638-2302 / > Email: ***@***.***이메일: ***@***.*** > Web: https://seungjindes.github.io/about/ / Lab: http://dsl.skku.edu웹: > https://seungjindes.github.io/about/ / 연구실: http://dsl.skku.edu > ________________________________________ > ______________________________________________________ > > > 2026년 2월 27일 (금) AM 10:37, Andrew Scouten ***@***.***>님이 > 작성: > >> *andrewscouten* left a comment >> (collaborativebioinformatics/OncoLearn#15) >> <#15 (comment)> >> >> It looks like source code files weren't committed properly. There are >> multiple references to a src/multimodal/src/data module, but the >> src/multimodal/.gitignore references /src/data/*. It's likely these >> files were unintentionally ignored, but as a result the code cannot run >> as-is. >> >> @seohyun408 <https://github.com/seohyun408>, @seungjindes >> <https://github.com/seungjindes>, @y00628 <https://github.com/y00628>, >> if any of you still have the source code, I'd appreciate if you push it. If >> not, I can try to reverse engineer based on the existing code. >> >> — >> Reply to this email directly, view it on GitHub >> <#15 (comment)>, >> or unsubscribe >> <https://github.com/notifications/unsubscribe-auth/BE3NC6INKTQZ3ICYZT4VAJ34OCFG5AVCNFSM6AAAAACWA5VZX6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTSNZUGQ2TIMRSG4> >> . >> You are receiving this because you were mentioned.Message ID: >> ***@***.***> >> >

andrewscouten · 2026-02-28T15:27:41Z

@seungjindes I am not the team's writer. I would ask one of the two of them, if they have not changed since. I am of the opinion that this is okay, but I would make sure first.

…dules Add Multimodal's README.md so that refactor can reference it more easily.

…learn-modules [Refactor] Oncolearn

…rs while keeping original functionality. Still needs more testing.

- Added a new `encoders.py` file to implement an encoder registration system. - Implemented `register_encoder`, `get_encoder`, and `get_all_encoders` functions for managing encoders. - Updated imports in `train_nvflare.py` to reflect new encoder structure. - Refactored `trainer.py` to utilize PyTorch Lightning for training orchestration. - Removed legacy configuration management code from `config.py`. - Added tests for configuration loading and validation in `test_config.py`. - Created example configuration files for testing purposes.

- Added YAML error handling in XenaCohortBuilder to raise ValueError for invalid configurations. - Filtered empty cohort names in download script to prevent processing errors. - Initialized _full_dataset in ImageDataModule and ClinicalDataModule to improve data handling. - Updated PillowLoader to provide more informative error messages for image loading failures. - Improved dataset validation in MultimodalDataModule to ensure only valid labels are processed. - Enhanced encoder classes to conditionally freeze models based on configuration settings.

- Enhanced OncoTrainer to support hyperparameter optimization (HPO) using Optuna, including a new method to run HPO and apply best parameters to the training configuration. - Updated L1 regularization calculation in BaseOncoClassifier to only include parameters that require gradients. - Changed the registration string for GatedLateFusionConfig to include the full module path. - Adjusted gradient clipping value handling in OncoTrainer to ensure it is only applied when greater than zero. - Updated dependency management in `uv.lock` to include new packages: alembic, colorlog, greenlet, mako, optuna, and sqlalchemy, along with their respective versions and dependencies.

… support

…ic assay data; enhance API client with retry logic

…zer and loss parameter handling in YAML and code

…function flexibility

…d later to "data/source"

- Introduced unit tests for the pipeline executor in `test_pipeline_executor.py`, covering various scenarios including loading data, joining datasets, and handling errors. - Added unit tests for pipeline nodes in `test_pipeline_nodes.py`, validating default behaviors and configurations for `DataSource`, `Load`, `Join`, `Sequence`, and modality classes. - Refactored image and multimodal data modules to improve structure and consistency in `test_image_e2e.py`, `test_multimodal_e2e.py`, and `test_tabular_e2e.py`. - Updated configuration tests in `test_config.py` to reflect changes in the pipeline-based schema and removed deprecated modality tests. - Consolidated data module tests in `test_datamodules.py` to focus on the new `ImageDataModule` and removed legacy tests for `GeneDataModule` and `ClinicalDataModule`. - Enhanced the dataset registry tests in `test_registry.py` to include dataset registration and retrieval functionalities.

…e labels; refactor data modules and add Log2Normalization support

…on configuration

…ation

- Created train and test split files for fold 0 to fold 4 in the PAM50 and stage datasets. - Implemented logging functionality to capture the KFold generation process, including patient counts and splits. - Updated Docker Compose configuration to mount the configs directory for easier access within containers. - Enhanced the kfold.py script to log output to a file while also displaying it in the console.

…metrics logging callback

…ctions and key capabilities section

andrewscouten · 2026-03-15T13:41:51Z

While final verification still needs to be run to ensure it meets the described metrics and methods in the paper... I will have to take an extended break for exams. As our manuscript will be published soon, I want to push this out before hand. The current code in this PR:

Trains and inferences models.
Accurately, to my understanding, reproduces the multimodal model's missing code.
Defines a config / DLS system for reproducibility.
Defines a CLI for downloading and preprocessing data.
Documents everything

As such, I am going to be merging this branch to main so that the project is in a more complete state when the manuscript releases.

andrewscouten added 6 commits February 26, 2026 13:11

Move multimodal scripts into scripts folder

2424948

Removed copied github repo to just use submodule instead.

1d15527

Assume we have already installed the submodule as a dependency

283a4b5

feat: Add libcurl to Dockerfile and update pyproject.toml to supp…

d479437

…ort multimodal dependencies.

Move "src/multimodal/src" into "src/multimodal"

85ab531

Fix uv header ordering

ac3fa16

andrewscouten self-assigned this Feb 27, 2026

andrewscouten added 18 commits February 28, 2026 11:23

Merge remote-tracking branch 'origin/main' into refactor/oncolearn-mo…

6407247

…dules Add Multimodal's README.md so that refactor can reference it more easily.

Refactor project to be more cohesive as a library

08c6905

Add tests for DL pipelines

5545199

Merge pull request #16 from collaborativebioinformatics/refactor/onco…

6c01cb6

…learn-modules [Refactor] Oncolearn

More multimodal integration work, trying to use the API and dataloade…

8e54177

…rs while keeping original functionality. Still needs more testing.

Reformat tests

238c5c6

Remove unused code

8683d03

Fix Dockerfile for multimodal extras and biomed submodule

b4824f1

Remove outdated documentation

3a5f0a1

Add cache files to .gitignore

6d6df24

Better split project dependencies

9acb06a

Dataset fixes

b91d78a

Multimodal pipeline bugfixes

6790c09

Update docs to reflect multimodal refactor

82927cf

Create data/.gitignore and move relevant entries

9dc3138

Huggingface config replace with encoder config

4426dc3

Working pipeline, now just need to run it

a18bba5

andrewscouten added 14 commits March 8, 2026 18:05

Update config system to include optimizer, scheduler, and regularization

85c18e6

Move split files

77f7920

cBioPortal API and CLI

d3dd792

Refactor configuration loading and validation; remove legacy modality…

6bf0ecd

… support

cBioPortal: Add support for downloading structural variants and gener…

921d4aa

…ic assay data; enhance API client with retry logic

Refactor hyperparameter optimization configuration; streamline optimi…

6ba4266

…zer and loss parameter handling in YAML and code

Add MultiMarginLoss configuration options for training; enhance loss …

a82e753

…function flexibility

Move all data source configs and splits to data/configs

6c6da46

Refactor CLI command structure to make it less monolithic

baabed0

cBioPortal download optimizations

6f7d95a

Tensorboard logging

59617d6

Add "label_col" to modality parsers

7593c68

andrewscouten linked an issue Mar 11, 2026 that may be closed by this pull request

[Docs] Flesh out the Wiki more #13

Closed

andrewscouten added 11 commits March 11, 2026 07:14

Move config files to "data/configs", where the data will be downloade…

73a22b3

…d later to "data/source"

Implement multimodal pipelines for TCGA-BRCA with PAM50 and AJCC stag…

38c4305

…e labels; refactor data modules and add Log2Normalization support

Add support for learning rate schedulers in hyperparameter optimizati…

0958537

…on configuration

Add preprocess subcommand for multimodal data with K-fold split gener…

791c8ef

…ation

Implement cross-validation support in training configuration and add …

0464ae6

…metrics logging callback

Refactor code structure for improved readability and maintainability

442d6c8

Bump Python version in Jupyter notebooks to 3.12.13 for compatibility

54ba0f0

Update README.md for clarity and structure; enhance quickstart instru…

3df7928

…ctions and key capabilities section

Update submodule reference for wiki to latest commit

b7cf236

andrewscouten marked this pull request as ready for review March 15, 2026 13:42

andrewscouten merged commit cd90062 into main Mar 15, 2026

andrewscouten deleted the refactor/multiomnics branch March 15, 2026 13:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Multimodal#15

Refactor Multimodal#15
andrewscouten merged 58 commits intomainfrom
refactor/multiomnics

andrewscouten commented Feb 26, 2026 •

edited

Loading

Uh oh!

andrewscouten commented Feb 27, 2026

Uh oh!

seungjindes commented Feb 27, 2026 via email

Uh oh!

seungjindes commented Feb 28, 2026 via email

Uh oh!

seungjindes commented Feb 28, 2026 via email

Uh oh!

andrewscouten commented Feb 28, 2026 •

edited

Loading

Uh oh!

andrewscouten commented Mar 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andrewscouten commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrewscouten commented Feb 27, 2026

Uh oh!

seungjindes commented Feb 27, 2026 via email

Uh oh!

seungjindes commented Feb 28, 2026 via email

Uh oh!

seungjindes commented Feb 28, 2026 via email

Uh oh!

andrewscouten commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrewscouten commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

andrewscouten commented Feb 26, 2026 •

edited

Loading

andrewscouten commented Feb 28, 2026 •

edited

Loading

andrewscouten commented Mar 15, 2026 •

edited

Loading