Include challenge teams1 5 by Ultimate-Storm · Pull Request #226 · KatherLab/MediSwarm

Ultimate-Storm · 2026-03-27T09:00:07Z

Summary

Integrates five ODELIA challenge models into MediSwarm for federated swarm training, along with infrastructure improvements for testing, deployment, and CI reliability.

Challenge Models Added

Job	Model Architecture	Directory
`challenge_1DivideAndConquer`	ResidualEncoder	`application/jobs/challenge_1DivideAndConquer`
`challenge_2BCN_AIM`	SwinUNETR	`application/jobs/challenge_2BCN_AIM`
`challenge_3agaldran`	MViT v2	`application/jobs/challenge_3agaldran`
`challenge_4abmil`	CrossModalAttentionABMIL + Swin	`application/jobs/challenge_4abmil`
`challenge_5pimed`	ResNet18	`application/jobs/challenge_5pimed`

Each challenge job has its own config_fed_client.conf, model code, main.py, data pipeline, and synthetic dataset generator.

Key Changes

Challenge model integration:

Each job is a self-contained NVFlare application under application/jobs/
Hardcoded MODEL_NAME in each challenge's main.py to prevent the MODEL_NAME=${MODEL_NAME:-MST} env var override from docker.sh silently selecting the wrong model
Pretrained weights (checkpoint_final.pth for challenge 1, mvit_v2_s-ae3be167.pth for challenge 3) are cached to /MediSwarm/pretrained_weights/ in the Docker image, outside job directories, to prevent NVFlare from bundling large .pth files during job submission

--job flag for docker.sh:

Added --job <job_name> parameter to docker.sh (via master_template.yml) for --preflight_check and --local_training modes
Defaults to ODELIA_ternary_classification (backward compatible)
Allows participants to test individual challenge models locally before joining the swarm

Deployment tooling:

New deploy_and_test.sh script automating multi-site Docker image push, startup kit deployment, server/client lifecycle, and job submission
New kit_live_sync/ for live sync of startup kits with heartbeat monitoring
New DEPLOY_README.md documenting the deployment workflow

CI fixes:

Fixed executable permissions on runIntegrationTests.sh and 24 other .sh/.exp scripts (644 → 755)
Auto-install gdown into a temporary venv when not found during Docker build (CI runners don’t have it pre-installed)
Pushed NVFlare submodule commits (timeout increase, dashboard removal) so CI can fetch the referenced commit

Documentation updates:

README.participant.md: Added --job flag usage, dataset validation steps with --log_dataset_details, timezone troubleshooting
README.developer.md: Added --job flag docs, challenge jobs table, configurable docker.sh parameters reference

Testing

All CI integration tests passing (startup kit generation, standalone training, simulation mode, PoC mode, 3DCNN simulation, license checks, preflight checks, swarm training)
Challenge models tested in simulation mode and actual swarm deployment

chore: Update APT versions in Dockerfile

Extend integration tests

…eded

chore: Update APT versions in Dockerfile

…eded

…-on-readme-files-D Extend Participant README

…p afterwards

…eded

… print

…it-archive Build Docker image using git archive

…eded resolved conflict in buildDockerImageAndStartupKits.sh

…gain before merging to main)

…emoved again before merging to main)" This reverts commit e9c69d0.

Add MODEL_NAME variable and update SCRATCHDIR path.

NVFlare copies job Python files to a workspace directory at deploy time, but large .pth weight files are not included in the job bundle. The model code was resolving checkpoint paths relative to __file__, which pointed to the workspace copy where the weights don't exist — causing FileNotFoundError or resource reservation timeouts ("No reserved resources"). - Add Docker baked-in path fallback to challenge_3agaldran model_factory and challenge_1DivideAndConquer model.py - Update _cacheAndCopyPretrainedModelWeights.sh to download checkpoint_final.pth from Google Drive via gdown when not cached locally - Add deploy_and_test.sh automation script for build/push/deploy/test - Gitignore deploy_sites.conf (deployment credentials) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The build script passes the project YAML path into a Docker container where the absolute host path doesn't exist. Use a relative path and run from the repo root instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

NVFlare packages the entire job folder when submitting, so .pth files inside job dirs caused ~800MB transfers to each client. Now weights are stored at /MediSwarm/pretrained_weights/ in the Docker image and model code falls back to that path at runtime. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- config_fed_client.conf: fix typo "models.odel.create_model" → "models.model.create_model" - model.py: update create_model signature to accept factory-pattern args (logger, loss_kwargs, env_vars, **kwargs) matching threedcnn_ptl.py caller Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The __init__.py imported .resnet, .mst, .swin3D which don't exist in the models directory, causing ModuleNotFoundError when NVFlare's PTFileModelPersistor tries to import models.model.create_model. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

get_unified_model_name() was only checking for "5Pimed" but main.py passes "challenge_5Pimed". The function now resolves both variants to the actual architecture name "resnet18" via get_model_config(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The Docker container sets MODEL_NAME=MST globally via docker.sh. os.getenv("MODEL_NAME", "challenge_5Pimed") was picking up "MST" instead of the intended default, causing Resnet assertion failure. Hardcode the model name since each challenge job knows its own model. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…pdate docs - Fix MODEL_NAME env var override bug in challenge_1-4 main.py (Docker env sets MODEL_NAME=MST, overriding defaults). Now hardcoded like challenge_5. - Add --job parameter to docker.sh (master_template.yml) so preflight_check and local_training can target any challenge model. - Create DEPLOY_README.md documenting the deploy_and_test.sh workflow. - Update README.participant.md: replace MODEL_NAME export with --job examples. - Update README.developer.md: fix params table, add --job docs, add challenge jobs table. - Update root README.md: add link to DEPLOY_README.md. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Resolve conflicts: - ODELIA app files (env_config, main, threedcnn_ptl): take main's version (dataset validation, log_dataset_details, test split support) - Dockerfile_ODELIA: take main's pinned apt versions - odelia_image.version: take main's 1.0.3 - master_template.yml: merge --job flag with --log_dataset_details - READMEs: merge challenge --job docs with dataset validation steps - remove_old_odelia_docker_images.sh: take main's clean version Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The CI runners don't have gdown installed globally. Instead of failing, the script now creates a temporary venv and installs gdown automatically when needed to download the 1DivideAndConquer checkpoint. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

runIntegrationTests.sh and several test scripts were tracked as 644 (not executable), causing "Permission denied" failures on CI runners. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The previous commit fixed 7 scripts but missed 17 others. This sets +x on every shell/expect script in the repo so Docker entrypoints and CI steps (e.g. _list_licenses.sh) no longer fail with "Permission denied". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

oleschwen and others added 30 commits October 29, 2025 14:15

added clients

2464c78

chore: update apt versions in Dockerfile_ODELIA

4aa0a27

Merge pull request #148 from KatherLab/ci/apt-update-1761797383

7c11bb0

chore: Update APT versions in Dockerfile

Merge branch 'main' into dev-144-extend-integration-tests

ed162ba

simplified check for expected output in server and client logs

559b5d9

Merge pull request #146 from KatherLab/dev-144-extend-integration-tests

99266bd

Extend integration tests

Merge branch 'main' into 147-check-if-custom-swarm-controllers-are-ne…

86ecf1c

…eded

chore: update apt versions in Dockerfile_ODELIA

5902593

Merge pull request #154 from KatherLab/ci/apt-update-1762229379

007cf3a

chore: Update APT versions in Dockerfile

Merge branch 'main' into 147-check-if-custom-swarm-controllers-are-ne…

5ae8849

…eded

added ubuntu versions known to work

c559adc

git not required for participating

16920f9

example as reminder that site name should end in "_1"

6fb7ab3

consistently end sentences

1533744

added VPN pitfall

63f6c0c

throttle local VPN to 60 Mbit/s, matching production setup more closely

62e2396

changed test site name

adaaebf

Merge pull request #155 from KatherLab/102-implement-further-feedback…

76879de

…-on-readme-files-D Extend Participant README

use git archive rather than copy source code directory and clean it u…

8021369

…p afterwards

extracted copying cached pretrained model weights to separate script

82ebd3a

refactored to split steps

ece267e

meaningful error message

2498b00

removed removed test also from CI

ff8cd31

Merge branch 'main' into 147-check-if-custom-swarm-controllers-are-ne…

4378d63

…eded

changed capitalization of expected output to what the NVFlare classes…

c61e2ba

… print

print error message after output to keep error visible

1552c75

Merge pull request #157 from KatherLab/156-build-docker-image-using-g…

0a2b191

…it-archive Build Docker image using git archive

Merge branch 'main' into 147-check-if-custom-swarm-controllers-are-ne…

1cd80e1

…eded resolved conflict in buildDockerImageAndStartupKits.sh

swarm config file for testing controller changes (should be removed a…

20dfaa1

…gain before merging to main)

Revert "swarm config file for testing controller changes (should be r…

0c04579

…emoved again before merging to main)" This reverts commit e9c69d0.

deboraJ1 and others added 7 commits March 26, 2026 00:14

download team 3 model in model_factory

0ce64d1

download team 3 model in model_factory

f0458aa

updates for log auto collection and enable scp ckpts

b921ad9

Make startup kit live sync injector executable

69e93b4

updates for log auto collection

d9c4cde

updates auto download ckpt

5d4f6ce

add scp installation

ad97a96

github-advanced-security AI found potential problems Mar 27, 2026

View reviewed changes

Comment thread server_tools/app.py Fixed

Comment thread server_tools/app.py Fixed

Comment thread server_tools/app.py Fixed

Comment thread server_tools/app.py Fixed

Comment thread server_tools/app.py Fixed

Ultimate-Storm and others added 13 commits March 27, 2026 15:00

MediSwarm Live Monitor: installation and usage

4c8dfe1

Delete TESTING_SUMMARY.md

27a6a83

Update README.participant.md: add MODEL_NAME export

4aaf6f2

Add MODEL_NAME variable and update SCRATCHDIR path.

typo in readme

4358e4d

increase time out for model downloading and remove gdown or scp

3ea5fb8

nvflare diff

5b3f1fe

Ultimate-Storm self-assigned this Apr 2, 2026

Ultimate-Storm and others added 2 commits April 2, 2026 21:30

Ultimate-Storm changed the title ~~WIP Include challenge teams1 5~~ Include challenge teams1 5 Apr 2, 2026

Ultimate-Storm and others added 3 commits April 2, 2026 21:53

Fix executable permissions on scripts used by CI

13ea460

runIntegrationTests.sh and several test scripts were tracked as 644 (not executable), causing "Permission denied" failures on CI runners. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Ultimate-Storm merged commit d7a3264 into main Apr 2, 2026
5 checks passed

Ultimate-Storm deleted the include_challenge_teams1-5 branch April 2, 2026 21:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include challenge teams1 5#226

Include challenge teams1 5#226
Ultimate-Storm merged 1154 commits intomainfrom
include_challenge_teams1-5

Ultimate-Storm commented Mar 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Ultimate-Storm commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Challenge Models Added

Key Changes

Testing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Ultimate-Storm commented Mar 27, 2026 •

edited

Loading