Merged
Conversation
chore: Update APT versions in Dockerfile
Extend integration tests
chore: Update APT versions in Dockerfile
…-on-readme-files-D Extend Participant README
…it-archive Build Docker image using git archive
…eded resolved conflict in buildDockerImageAndStartupKits.sh
…gain before merging to main)
…emoved again before merging to main)" This reverts commit e9c69d0.
Add MODEL_NAME variable and update SCRATCHDIR path.
NVFlare copies job Python files to a workspace directory at deploy time,
but large .pth weight files are not included in the job bundle. The model
code was resolving checkpoint paths relative to __file__, which pointed
to the workspace copy where the weights don't exist — causing
FileNotFoundError or resource reservation timeouts ("No reserved
resources").
- Add Docker baked-in path fallback to challenge_3agaldran model_factory
and challenge_1DivideAndConquer model.py
- Update _cacheAndCopyPretrainedModelWeights.sh to download
checkpoint_final.pth from Google Drive via gdown when not cached locally
- Add deploy_and_test.sh automation script for build/push/deploy/test
- Gitignore deploy_sites.conf (deployment credentials)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The build script passes the project YAML path into a Docker container where the absolute host path doesn't exist. Use a relative path and run from the repo root instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
NVFlare packages the entire job folder when submitting, so .pth files inside job dirs caused ~800MB transfers to each client. Now weights are stored at /MediSwarm/pretrained_weights/ in the Docker image and model code falls back to that path at runtime. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- config_fed_client.conf: fix typo "models.odel.create_model" → "models.model.create_model" - model.py: update create_model signature to accept factory-pattern args (logger, loss_kwargs, env_vars, **kwargs) matching threedcnn_ptl.py caller Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The __init__.py imported .resnet, .mst, .swin3D which don't exist in the models directory, causing ModuleNotFoundError when NVFlare's PTFileModelPersistor tries to import models.model.create_model. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
get_unified_model_name() was only checking for "5Pimed" but main.py passes "challenge_5Pimed". The function now resolves both variants to the actual architecture name "resnet18" via get_model_config(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Docker container sets MODEL_NAME=MST globally via docker.sh.
os.getenv("MODEL_NAME", "challenge_5Pimed") was picking up "MST"
instead of the intended default, causing Resnet assertion failure.
Hardcode the model name since each challenge job knows its own model.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…pdate docs - Fix MODEL_NAME env var override bug in challenge_1-4 main.py (Docker env sets MODEL_NAME=MST, overriding defaults). Now hardcoded like challenge_5. - Add --job parameter to docker.sh (master_template.yml) so preflight_check and local_training can target any challenge model. - Create DEPLOY_README.md documenting the deploy_and_test.sh workflow. - Update README.participant.md: replace MODEL_NAME export with --job examples. - Update README.developer.md: fix params table, add --job docs, add challenge jobs table. - Update root README.md: add link to DEPLOY_README.md. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolve conflicts: - ODELIA app files (env_config, main, threedcnn_ptl): take main's version (dataset validation, log_dataset_details, test split support) - Dockerfile_ODELIA: take main's pinned apt versions - odelia_image.version: take main's 1.0.3 - master_template.yml: merge --job flag with --log_dataset_details - READMEs: merge challenge --job docs with dataset validation steps - remove_old_odelia_docker_images.sh: take main's clean version Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The CI runners don't have gdown installed globally. Instead of failing, the script now creates a temporary venv and installs gdown automatically when needed to download the 1DivideAndConquer checkpoint. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
runIntegrationTests.sh and several test scripts were tracked as 644 (not executable), causing "Permission denied" failures on CI runners. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous commit fixed 7 scripts but missed 17 others. This sets +x on every shell/expect script in the repo so Docker entrypoints and CI steps (e.g. _list_licenses.sh) no longer fail with "Permission denied". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Integrates five ODELIA challenge models into MediSwarm for federated swarm training, along with infrastructure improvements for testing, deployment, and CI reliability.
Challenge Models Added
challenge_1DivideAndConquerapplication/jobs/challenge_1DivideAndConquerchallenge_2BCN_AIMapplication/jobs/challenge_2BCN_AIMchallenge_3agaldranapplication/jobs/challenge_3agaldranchallenge_4abmilapplication/jobs/challenge_4abmilchallenge_5pimedapplication/jobs/challenge_5pimedEach challenge job has its own
config_fed_client.conf, model code,main.py, data pipeline, and synthetic dataset generator.Key Changes
Challenge model integration:
application/jobs/MODEL_NAMEin each challenge'smain.pyto prevent theMODEL_NAME=${MODEL_NAME:-MST}env var override fromdocker.shsilently selecting the wrong modelcheckpoint_final.pthfor challenge 1,mvit_v2_s-ae3be167.pthfor challenge 3) are cached to/MediSwarm/pretrained_weights/in the Docker image, outside job directories, to prevent NVFlare from bundling large .pth files during job submission--jobflag fordocker.sh:--job <job_name>parameter todocker.sh(viamaster_template.yml) for--preflight_checkand--local_trainingmodesODELIA_ternary_classification(backward compatible)Deployment tooling:
deploy_and_test.shscript automating multi-site Docker image push, startup kit deployment, server/client lifecycle, and job submissionkit_live_sync/for live sync of startup kits with heartbeat monitoringDEPLOY_README.mddocumenting the deployment workflowCI fixes:
runIntegrationTests.shand 24 other.sh/.expscripts (644 → 755)gdowninto a temporary venv when not found during Docker build (CI runners don’t have it pre-installed)Documentation updates:
README.participant.md: Added--jobflag usage, dataset validation steps with--log_dataset_details, timezone troubleshootingREADME.developer.md: Added--jobflag docs, challenge jobs table, configurabledocker.shparameters referenceTesting