Skip to content

refactor: updated pyproject toml dependencies, changed readme file na…#50

Open
nazlicantrn wants to merge 1 commit intoEPFLiGHT:experts_newfrom
nazlicanturan:experts_new
Open

refactor: updated pyproject toml dependencies, changed readme file na…#50
nazlicantrn wants to merge 1 commit intoEPFLiGHT:experts_newfrom
nazlicanturan:experts_new

Conversation

@nazlicantrn
Copy link
Copy Markdown

PR summary

This PR makes the OpenAI Batch augmentation pipeline fully env/CLI-driven, removing the need for code edits. It also consolidates dependencies in the root pyproject.toml and cleans up readme documentation inconsistencies across the repo.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to make the OpenAI Batch augmentation workflow fully configurable via env/CLI (no code edits), consolidate “experts” dependencies into the root pyproject.toml, and clean up documentation across the experts pipelines.

Changes:

  • Added shell scripts to run the GPT augmentation pipeline for full runs and cost-estimation runs via environment variables.
  • Refactored parts of the GPT augmentation batch builder to be env-driven and updated related docs.
  • Added an experts optional-dependency group in pyproject.toml and updated experts READMEs to install via pip install -e ".[experts]".

Reviewed changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/multimeditron/experts/gpt_augmentation_pipeline/submit_batches.py Removes shebang (execution now assumed via python ...).
src/multimeditron/experts/gpt_augmentation_pipeline/scripts/run_estimate.sh New env-driven script to run a small sample and estimate cost.
src/multimeditron/experts/gpt_augmentation_pipeline/scripts/run_all.sh New env-driven script to run the full pipeline end-to-end.
src/multimeditron/experts/gpt_augmentation_pipeline/make_batches.py Refactors entrypoint to use env-driven settings (currently broken due to missing helpers).
src/multimeditron/experts/gpt_augmentation_pipeline/README.md Documents new scripts + env vars (has a script-name mismatch).
src/multimeditron/experts/evaluation_pipeline/readme.md Updates install instructions and fixes filename references.
src/multimeditron/experts/evaluation_pipeline/hard_benchmark_skin_tone_stratified.py Adds a new SCIN skin-tone stratified hard benchmark script.
src/multimeditron/experts/evaluation_pipeline/disease_classification_pipeline/readme.md Updates install instructions to use the new experts extra.
src/multimeditron/experts/dataset_processing/skin_expert/process_isic.py Removes paraphrasing flags and simplifies ISIC processing flow.
src/multimeditron/experts/dataset_processing/readme.md Updates script names/examples (contains path + command typos/inconsistencies).
pyproject.toml Introduces project.optional-dependencies.experts with pinned versions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 8 to 16
src/multimeditron/experts/dataset_processing
├── skin/
│ ├── process_skin10.py
│ ├── process_dermnet.py
│ ├── process_fitzpatrick.py
│ └── process_isic4.py
│ └── process_isic.py
├── ophthalmology/
│ ├── process_rfmid2.py
│ ├── process_messidor2.py
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The directory structure shown here uses skin/ and ophthalmology/, but the actual directories in this repo are skin_expert/ and ophthalmology_expert/ (e.g., src/multimeditron/experts/dataset_processing/skin_expert). Please update this tree so users can find the scripts at the documented paths.

Copilot uses AI. Check for mistakes.
**Usage:**
```bash
python split_jsonl_train_val.py \
python train_val_split.py.py \
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage snippet has a typo: python train_val_split.py.py will fail. It should invoke the actual script filename once (e.g., python train_val_split.py ...).

Suggested change
python train_val_split.py.py \
python train_val_split.py \

Copilot uses AI. Check for mistakes.
Comment on lines +87 to 94
python skin/process_isic.py \
--output_images_root data/isic4/images \
--output_jsonl data/isic4/manifest.jsonl \
--paraphrase \
--output_jsonl data/isic/manifest.jsonl

python paraphrase_jsonl.py \
--in_jsonl data/isic4/manifest.jsonl \
--out_jsonl data/isic/manifest_paraphrased.jsonl \
--seed 42
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example uses python skin/process_isic.py, but the script in this repo is skin_expert/process_isic.py (under src/multimeditron/experts/dataset_processing/skin_expert). Also the paths are inconsistent (--output_images_root data/isic4/... vs --output_jsonl data/isic/..., and the paraphrasing step reads data/isic4/manifest.jsonl even though the previous command writes data/isic/manifest.jsonl). Please align the example commands/paths with the actual script location and produced files. Note: process_isic.py no longer supports --paraphrase/--seed, so the docs should reflect that paraphrasing happens only via paraphrase_jsonl.py.

Copilot uses AI. Check for mistakes.
Comment on lines +254 to +257
def main() -> None:
task = _get_env_task()
nb_samples = _get_env_nb_samples()
batches_dir = _get_env_batches_dir()
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main() calls _get_env_task(), _get_env_nb_samples(), and _get_env_batches_dir(), but these helpers are not defined or imported anywhere in this module. As-is, python make_batches.py will raise NameError and the new run_all.sh / run_estimate.sh scripts will fail. Define these functions (with validation/parsing for TASK_TYPE, NB_SAMPLES, BATCHES_DIR) or switch back to using os.getenv(...)/config.BATCHES_DIR directly in main().

Copilot uses AI. Check for mistakes.
Comment on lines +94 to +101
Use the full pipeline script to process the entire dataset:
```bash
python submit_batches.py
```
# Default runs full skin dataset
./scripts/run_full.sh

* Uploads each batch file to OpenAI
* Submits Batch API jobs
* Saves returned batch IDs for later retrieval
# Customize task type
TASK_TYPE=ophthalmology ./scripts/run_full.sh
```
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README instructs running ./scripts/run_full.sh, but the repo adds scripts/run_all.sh (no run_full.sh). Update the documentation to reference the actual script name so the full-dataset instructions work as written.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants