Skip to content

Include challenge teams1 5#226

Merged
Ultimate-Storm merged 1154 commits intomainfrom
include_challenge_teams1-5
Apr 2, 2026
Merged

Include challenge teams1 5#226
Ultimate-Storm merged 1154 commits intomainfrom
include_challenge_teams1-5

Conversation

@Ultimate-Storm
Copy link
Copy Markdown
Contributor

@Ultimate-Storm Ultimate-Storm commented Mar 27, 2026

Summary

Integrates five ODELIA challenge models into MediSwarm for federated swarm training, along with infrastructure improvements for testing, deployment, and CI reliability.

Challenge Models Added

Job Model Architecture Directory
challenge_1DivideAndConquer ResidualEncoder application/jobs/challenge_1DivideAndConquer
challenge_2BCN_AIM SwinUNETR application/jobs/challenge_2BCN_AIM
challenge_3agaldran MViT v2 application/jobs/challenge_3agaldran
challenge_4abmil CrossModalAttentionABMIL + Swin application/jobs/challenge_4abmil
challenge_5pimed ResNet18 application/jobs/challenge_5pimed

Each challenge job has its own config_fed_client.conf, model code, main.py, data pipeline, and synthetic dataset generator.

Key Changes

Challenge model integration:

  • Each job is a self-contained NVFlare application under application/jobs/
  • Hardcoded MODEL_NAME in each challenge's main.py to prevent the MODEL_NAME=${MODEL_NAME:-MST} env var override from docker.sh silently selecting the wrong model
  • Pretrained weights (checkpoint_final.pth for challenge 1, mvit_v2_s-ae3be167.pth for challenge 3) are cached to /MediSwarm/pretrained_weights/ in the Docker image, outside job directories, to prevent NVFlare from bundling large .pth files during job submission

--job flag for docker.sh:

  • Added --job <job_name> parameter to docker.sh (via master_template.yml) for --preflight_check and --local_training modes
  • Defaults to ODELIA_ternary_classification (backward compatible)
  • Allows participants to test individual challenge models locally before joining the swarm

Deployment tooling:

  • New deploy_and_test.sh script automating multi-site Docker image push, startup kit deployment, server/client lifecycle, and job submission
  • New kit_live_sync/ for live sync of startup kits with heartbeat monitoring
  • New DEPLOY_README.md documenting the deployment workflow

CI fixes:

  • Fixed executable permissions on runIntegrationTests.sh and 24 other .sh/.exp scripts (644 → 755)
  • Auto-install gdown into a temporary venv when not found during Docker build (CI runners don’t have it pre-installed)
  • Pushed NVFlare submodule commits (timeout increase, dashboard removal) so CI can fetch the referenced commit

Documentation updates:

  • README.participant.md: Added --job flag usage, dataset validation steps with --log_dataset_details, timezone troubleshooting
  • README.developer.md: Added --job flag docs, challenge jobs table, configurable docker.sh parameters reference

Testing

  • All CI integration tests passing (startup kit generation, standalone training, simulation mode, PoC mode, 3DCNN simulation, license checks, preflight checks, swarm training)
  • Challenge models tested in simulation mode and actual swarm deployment

oleschwen and others added 30 commits October 29, 2025 14:15
chore: Update APT versions in Dockerfile
chore: Update APT versions in Dockerfile
…-on-readme-files-D

Extend Participant README
…it-archive

Build Docker image using git archive
…eded

resolved conflict in
	buildDockerImageAndStartupKits.sh
…emoved again before merging to main)"

This reverts commit e9c69d0.
Comment thread server_tools/app.py Fixed
Comment thread server_tools/app.py Fixed
Comment thread server_tools/app.py Fixed
Comment thread server_tools/app.py Fixed
Comment thread server_tools/app.py Fixed
Ultimate-Storm and others added 13 commits March 27, 2026 15:00
Add MODEL_NAME variable and update SCRATCHDIR path.
NVFlare copies job Python files to a workspace directory at deploy time,
but large .pth weight files are not included in the job bundle. The model
code was resolving checkpoint paths relative to __file__, which pointed
to the workspace copy where the weights don't exist — causing
FileNotFoundError or resource reservation timeouts ("No reserved
resources").

- Add Docker baked-in path fallback to challenge_3agaldran model_factory
  and challenge_1DivideAndConquer model.py
- Update _cacheAndCopyPretrainedModelWeights.sh to download
  checkpoint_final.pth from Google Drive via gdown when not cached locally
- Add deploy_and_test.sh automation script for build/push/deploy/test
- Gitignore deploy_sites.conf (deployment credentials)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The build script passes the project YAML path into a Docker container
where the absolute host path doesn't exist. Use a relative path and
run from the repo root instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
NVFlare packages the entire job folder when submitting, so .pth files
inside job dirs caused ~800MB transfers to each client. Now weights
are stored at /MediSwarm/pretrained_weights/ in the Docker image and
model code falls back to that path at runtime.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- config_fed_client.conf: fix typo "models.odel.create_model" → "models.model.create_model"
- model.py: update create_model signature to accept factory-pattern args
  (logger, loss_kwargs, env_vars, **kwargs) matching threedcnn_ptl.py caller

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The __init__.py imported .resnet, .mst, .swin3D which don't exist in
the models directory, causing ModuleNotFoundError when NVFlare's
PTFileModelPersistor tries to import models.model.create_model.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
get_unified_model_name() was only checking for "5Pimed" but main.py
passes "challenge_5Pimed". The function now resolves both variants
to the actual architecture name "resnet18" via get_model_config().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Docker container sets MODEL_NAME=MST globally via docker.sh.
os.getenv("MODEL_NAME", "challenge_5Pimed") was picking up "MST"
instead of the intended default, causing Resnet assertion failure.
Hardcode the model name since each challenge job knows its own model.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Ultimate-Storm Ultimate-Storm self-assigned this Apr 2, 2026
Ultimate-Storm and others added 2 commits April 2, 2026 21:30
…pdate docs

- Fix MODEL_NAME env var override bug in challenge_1-4 main.py (Docker env
  sets MODEL_NAME=MST, overriding defaults). Now hardcoded like challenge_5.
- Add --job parameter to docker.sh (master_template.yml) so preflight_check
  and local_training can target any challenge model.
- Create DEPLOY_README.md documenting the deploy_and_test.sh workflow.
- Update README.participant.md: replace MODEL_NAME export with --job examples.
- Update README.developer.md: fix params table, add --job docs, add challenge
  jobs table.
- Update root README.md: add link to DEPLOY_README.md.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolve conflicts:
- ODELIA app files (env_config, main, threedcnn_ptl): take main's version
  (dataset validation, log_dataset_details, test split support)
- Dockerfile_ODELIA: take main's pinned apt versions
- odelia_image.version: take main's 1.0.3
- master_template.yml: merge --job flag with --log_dataset_details
- READMEs: merge challenge --job docs with dataset validation steps
- remove_old_odelia_docker_images.sh: take main's clean version

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Ultimate-Storm Ultimate-Storm changed the title WIP Include challenge teams1 5 Include challenge teams1 5 Apr 2, 2026
Ultimate-Storm and others added 3 commits April 2, 2026 21:53
The CI runners don't have gdown installed globally. Instead of
failing, the script now creates a temporary venv and installs gdown
automatically when needed to download the 1DivideAndConquer checkpoint.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
runIntegrationTests.sh and several test scripts were tracked as 644
(not executable), causing "Permission denied" failures on CI runners.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous commit fixed 7 scripts but missed 17 others. This sets
+x on every shell/expect script in the repo so Docker entrypoints and
CI steps (e.g. _list_licenses.sh) no longer fail with "Permission denied".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Ultimate-Storm Ultimate-Storm merged commit d7a3264 into main Apr 2, 2026
5 checks passed
@Ultimate-Storm Ultimate-Storm deleted the include_challenge_teams1-5 branch April 2, 2026 21:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants