[SyntheticEHR] HALO Baseline by jalengg · Pull Request #897 · sunlabuiuc/PyHealth

jalengg · 2026-03-17T02:12:09Z

See notes for #878

Notebook permalink: https://colab.research.google.com/github/jalengg/PyHealth/blob/halo-pr-integration/examples/halo_mimic3_colab.ipynb

- Add HALO (Healthcare generative model using transformers) implementation - Include example training script with configurable parameters - Include example generation script for synthetic patient data - Add canonical SLURM scripts with optimal parameters (80 epochs, batch_size 48, lr 0.0001) - Register HALO in generators module - Update HALO_MIMIC3Dataset with latest preprocessing - Update README with HALO documentation

Remove README.rst changes that only documented CorGAN, not HALO. This PR should focus solely on HALO implementation.

…ls to HALO notebook Complete Tasks 3-7: - Configuration panel with demo defaults - Data upload with validation - Training logic with checkpoint management - Generation with CSV conversion - Results display with quality checks and download Notebook now has 24 cells with complete end-to-end workflow.

- Replace `!pip install` with subprocess.run() for error checking - Show clear error message if installation fails - Raise RuntimeError to stop notebook execution on failure Fixes #1

- Remove PATIENTS.csv and patient_ids.txt (not used by HALO_MIMIC3Dataset) - Handle Colab file renaming (ADMISSIONS (1).csv -> ADMISSIONS.csv) - Allow uploading files one at a time with progress tracking - Check Google Drive for existing files before requesting upload - Add FORK variable to installation cell for easier testing Fixes #4, #5, #6

Fixes #7

Ensures Colab users always get the latest version from GitHub without using cached packages. Critical for picking up recent fixes like the halo_resources __init__.py. Fixes #18

Use os.path.join() instead of string concatenation to properly handle directory paths with or without trailing slashes. Fixes #19

Fixes #21 The YAML config files in pyhealth/datasets/configs/ were not being included when the package was installed via pip. This caused FileNotFoundError for multiple datasets including HALO, MIMIC3, MIMIC4, EHRShot, COVID-19 CXR, and Medical Transcriptions. Added MANIFEST.in to specify which non-Python files should be included in the package distribution.

Fixes #21 MANIFEST.in only affects sdist source distributions. When installing via `pip install git+https://...` (as in Colab), pip relies on package_data in setup.py to include non-Python files. Added explicit package_data to ensure YAML configs in pyhealth/datasets/configs/ are included in all install paths. Removed MANIFEST.in as it provided no benefit for pip-from-git installs.

Timestamp reflects when notebook was last modified so users can verify they are running the correct version. Reverts dynamic install-time timestamp in favor of this static header approach.

When pkl_data_dir has no trailing slash (e.g. "/path/to/pkl_data"), raw f-string concatenation produced invalid paths like "/path/to/pkl_datacodeToIndex.pkl" instead of "/path/to/pkl_data/codeToIndex.pkl". Replace all pickle save paths with os.path.join(). Also add os.makedirs() so the output directory is created if missing.

…rom test_halo_model

…cell 2

…taset calls

…rd unique_patients in cell 22

… Colab install

…/torchvision

…vision

…encode_visits

…ALO encoding

jhnwu3

I've made some small comments here, but will summarize more on Discord since I think you guys are actually really close. There's some quick discussions on what I think is arguably more efficient and effective. And honestly will confuse the users who build upon this work later about this.

jhnwu3 · 2026-03-19T21:16:16Z

examples/halo_mimic3_training.py

+# at least one ICD-9 code). Patients with fewer than 2 qualifying visits
+# are excluded.
+print("Setting HALO generation task...")
+sample_dataset = base_dataset.set_task(halo_generation_mimic3_fn)


Can we use the TaskClass you wrote here for this instead of the function that I can't see anymore?

jhnwu3 · 2026-03-19T21:17:55Z

pyhealth/models/generators/halo.py

+                with torch.no_grad():
+                    for val_batch in val_loader:
+                        visits = val_batch["visits"].to(self.device)
+                        batch_ehr, batch_mask = self._encode_visits(visits)


I see there's an encode function. This can be quite prohibitively expensive in training if it's some form of tokenization here or whatever.

I guess the question is that is batch_ehr here an embedding here? or is it a tokenization approach here? A collate function?

jhnwu3 · 2026-03-19T21:18:27Z

pyhealth/models/generators/halo.py

+        """Forward pass.
+
+        Accepts the padded index tensor produced by NestedSequenceProcessor,
+        converts it to HALO multi-hot format, and runs the transformer.


If you need a multihot format, we have a multihot processor here. We should probably chat on some next steps.

chufangao and others added 30 commits June 15, 2025 13:04

init generators commit

d1f97af

base

ee8c52c

Stab at implementation

00f10c2

Misc. changes for testing

b666f82

Remove testing logs

ec4f23d

Clean up things a bit

4ce8e21

Clean up hardcoded file path

b1584fd

Remove testing files from PR

d374603

Init model properly

4f456f9

Update comments

56380f6

Remove non-HALO README changes

d2b8da3

Remove README.rst changes that only documented CorGAN, not HALO. This PR should focus solely on HALO implementation.

Create HALO Colab notebook structure with headers

97050b8

Add setup and installation cells to HALO notebook

58ef738

Add README documentation for HALO Colab notebook

b1458fe

Fix installation cell to detect pip failures

702d65c

- Replace `!pip install` with subprocess.run() for error checking - Show clear error message if installation fails - Raise RuntimeError to stop notebook execution on failure Fixes #1

Remove pandas<2 constraint for Python 3.12 compatibility

261f819

Add missing __init__.py to halo_resources module

b8b4c96

Fixes #7

Add --no-cache-dir to pip install for latest code

564cf0a

Ensures Colab users always get the latest version from GitHub without using cached packages. Critical for picking up recent fixes like the halo_resources __init__.py. Fixes #18

Fix path concatenation bug in HALO_MIMIC3Dataset

8002123

Use os.path.join() instead of string concatenation to properly handle directory paths with or without trailing slashes. Fixes #19

Add install timestamp to Colab notebook success message

2418788

Add last-updated timestamp to Colab notebook header

f1ceb35

Timestamp reflects when notebook was last modified so users can verify they are running the correct version. Reverts dynamic install-time timestamp in favor of this static header approach.

Use human-readable timestamp format in notebook header

663dbb8

added trailing slash

200e693

added trailing slash

76de88e

jalengg added 27 commits February 25, 2026 00:29

test: fix tearDown cleanup, env var path, and relative bootstrap paths

1b744eb

test: guard integration test against sys.modules stub contamination f…

d5248de

…rom test_halo_model

Update halo_mimic3_colab.ipynb to PyHealth 2.0 API and remove emojis

0d7775e

Update last updated date in halo_mimic3_colab.ipynb

855f232

Full dry-run audit of halo_mimic3_colab.ipynb against halo-pr-528 API

c65ca42

Fix numpy/scipy Colab incompatibility with kernel-restart pattern in …

9715453

…cell 2

Replace numpy-downgrade hack with scipy>=1.14.0 upgrade in cell 2

b80f837

Wrap cardiology_detect import in try/except to fix scipy cascade failure

5f66455

Wrap mne-dependent task imports in try/except to fix Colab scipy cascade

ad9051b

Guard mne/scipy-dependent imports to fix Colab numpy 2.x cascade failure

b1470ad

Guard optional-dep imports to fix Colab numpy 2.x cascade failures

de00ece

Remove PyHealth 1.x args code_mapping and refresh_cache from MIMIC3Da…

0f06422

…taset calls

Fix DataLoader shuffle=True incompatibility with IterableDataset; gua…

71f5c71

…rd unique_patients in cell 22

Fix base_dataset.patients → unique_patient_ids (PyHealth 2.0 API)

4cca0a4

Fix notebook: require 4 MIMIC-III files, remove nonexistent stat() call

f718886

Make icustays optional in MIMIC3Dataset; update notebook required files

d9408bc

Remove icustays from MIMIC3Dataset defaults; add --force-reinstall to…

c52aa0b

… Colab install

Wrap ChestXray14 and COVID19CXR dataset imports in try/except for PIL…

dada186

…/torchvision

Wrap CNN and VisionEmbeddingModel imports in try/except for PIL/torch…

d1f49ac

…vision

Add GPU auto-detection and batch-level progress logging to HALO training

dd09fd1

Change default synthetic samples from 1000 to 100 for faster demo runs

384d21e

Add tqdm progress bar to synthesize_dataset generation loop

5d70702

Fix: skip position-1 mask when label_vocab_size is 0

e7fd23f

Update notebook: 50 demo samples, remove time estimates from markdown

d44a32f

Fix: compute per-patient visit count instead of batch-padded max in _…

3fa8c31

…encode_visits

Fix: co-locate end token with last visit position to match original H…

5ff44db

…ALO encoding

Fix HALO paper link to correct arxiv ID 2304.02169

af31710

jalengg marked this pull request as ready for review March 17, 2026 03:12

jalengg changed the title ~~Halo pr integration~~ [SyntheticEHR] HALO Baseline Mar 17, 2026

jhnwu3 requested changes Mar 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SyntheticEHR] HALO Baseline#897

[SyntheticEHR] HALO Baseline#897
jalengg wants to merge 71 commits intosunlabuiuc:masterfrom
jalengg:halo-pr-integration

jalengg commented Mar 17, 2026 •

edited

Loading

Uh oh!

jhnwu3 left a comment

Uh oh!

jhnwu3 Mar 19, 2026

Uh oh!

jhnwu3 Mar 19, 2026

Uh oh!

jhnwu3 Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jalengg commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhnwu3 left a comment

Choose a reason for hiding this comment

Uh oh!

jhnwu3 Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

jhnwu3 Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

jhnwu3 Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jalengg commented Mar 17, 2026 •

edited

Loading