refactor: updated pyproject toml dependencies, changed readme file na…#50
refactor: updated pyproject toml dependencies, changed readme file na…#50nazlicantrn wants to merge 1 commit intoEPFLiGHT:experts_newfrom
Conversation
…me conventions, added new bash scripts
There was a problem hiding this comment.
Pull request overview
This PR aims to make the OpenAI Batch augmentation workflow fully configurable via env/CLI (no code edits), consolidate “experts” dependencies into the root pyproject.toml, and clean up documentation across the experts pipelines.
Changes:
- Added shell scripts to run the GPT augmentation pipeline for full runs and cost-estimation runs via environment variables.
- Refactored parts of the GPT augmentation batch builder to be env-driven and updated related docs.
- Added an
expertsoptional-dependency group inpyproject.tomland updated experts READMEs to install viapip install -e ".[experts]".
Reviewed changes
Copilot reviewed 10 out of 11 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/multimeditron/experts/gpt_augmentation_pipeline/submit_batches.py | Removes shebang (execution now assumed via python ...). |
| src/multimeditron/experts/gpt_augmentation_pipeline/scripts/run_estimate.sh | New env-driven script to run a small sample and estimate cost. |
| src/multimeditron/experts/gpt_augmentation_pipeline/scripts/run_all.sh | New env-driven script to run the full pipeline end-to-end. |
| src/multimeditron/experts/gpt_augmentation_pipeline/make_batches.py | Refactors entrypoint to use env-driven settings (currently broken due to missing helpers). |
| src/multimeditron/experts/gpt_augmentation_pipeline/README.md | Documents new scripts + env vars (has a script-name mismatch). |
| src/multimeditron/experts/evaluation_pipeline/readme.md | Updates install instructions and fixes filename references. |
| src/multimeditron/experts/evaluation_pipeline/hard_benchmark_skin_tone_stratified.py | Adds a new SCIN skin-tone stratified hard benchmark script. |
| src/multimeditron/experts/evaluation_pipeline/disease_classification_pipeline/readme.md | Updates install instructions to use the new experts extra. |
| src/multimeditron/experts/dataset_processing/skin_expert/process_isic.py | Removes paraphrasing flags and simplifies ISIC processing flow. |
| src/multimeditron/experts/dataset_processing/readme.md | Updates script names/examples (contains path + command typos/inconsistencies). |
| pyproject.toml | Introduces project.optional-dependencies.experts with pinned versions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| src/multimeditron/experts/dataset_processing | ||
| ├── skin/ | ||
| │ ├── process_skin10.py | ||
| │ ├── process_dermnet.py | ||
| │ ├── process_fitzpatrick.py | ||
| │ └── process_isic4.py | ||
| │ └── process_isic.py | ||
| ├── ophthalmology/ | ||
| │ ├── process_rfmid2.py | ||
| │ ├── process_messidor2.py |
There was a problem hiding this comment.
The directory structure shown here uses skin/ and ophthalmology/, but the actual directories in this repo are skin_expert/ and ophthalmology_expert/ (e.g., src/multimeditron/experts/dataset_processing/skin_expert). Please update this tree so users can find the scripts at the documented paths.
| **Usage:** | ||
| ```bash | ||
| python split_jsonl_train_val.py \ | ||
| python train_val_split.py.py \ |
There was a problem hiding this comment.
The usage snippet has a typo: python train_val_split.py.py will fail. It should invoke the actual script filename once (e.g., python train_val_split.py ...).
| python train_val_split.py.py \ | |
| python train_val_split.py \ |
| python skin/process_isic.py \ | ||
| --output_images_root data/isic4/images \ | ||
| --output_jsonl data/isic4/manifest.jsonl \ | ||
| --paraphrase \ | ||
| --output_jsonl data/isic/manifest.jsonl | ||
|
|
||
| python paraphrase_jsonl.py \ | ||
| --in_jsonl data/isic4/manifest.jsonl \ | ||
| --out_jsonl data/isic/manifest_paraphrased.jsonl \ | ||
| --seed 42 |
There was a problem hiding this comment.
This example uses python skin/process_isic.py, but the script in this repo is skin_expert/process_isic.py (under src/multimeditron/experts/dataset_processing/skin_expert). Also the paths are inconsistent (--output_images_root data/isic4/... vs --output_jsonl data/isic/..., and the paraphrasing step reads data/isic4/manifest.jsonl even though the previous command writes data/isic/manifest.jsonl). Please align the example commands/paths with the actual script location and produced files. Note: process_isic.py no longer supports --paraphrase/--seed, so the docs should reflect that paraphrasing happens only via paraphrase_jsonl.py.
| def main() -> None: | ||
| task = _get_env_task() | ||
| nb_samples = _get_env_nb_samples() | ||
| batches_dir = _get_env_batches_dir() |
There was a problem hiding this comment.
main() calls _get_env_task(), _get_env_nb_samples(), and _get_env_batches_dir(), but these helpers are not defined or imported anywhere in this module. As-is, python make_batches.py will raise NameError and the new run_all.sh / run_estimate.sh scripts will fail. Define these functions (with validation/parsing for TASK_TYPE, NB_SAMPLES, BATCHES_DIR) or switch back to using os.getenv(...)/config.BATCHES_DIR directly in main().
| Use the full pipeline script to process the entire dataset: | ||
| ```bash | ||
| python submit_batches.py | ||
| ``` | ||
| # Default runs full skin dataset | ||
| ./scripts/run_full.sh | ||
|
|
||
| * Uploads each batch file to OpenAI | ||
| * Submits Batch API jobs | ||
| * Saves returned batch IDs for later retrieval | ||
| # Customize task type | ||
| TASK_TYPE=ophthalmology ./scripts/run_full.sh | ||
| ``` |
There was a problem hiding this comment.
The README instructs running ./scripts/run_full.sh, but the repo adds scripts/run_all.sh (no run_full.sh). Update the documentation to reference the actual script name so the full-dataset instructions work as written.
PR summary
This PR makes the OpenAI Batch augmentation pipeline fully env/CLI-driven, removing the need for code edits. It also consolidates dependencies in the root pyproject.toml and cleans up readme documentation inconsistencies across the repo.