Conversation
- Add HALO (Healthcare generative model using transformers) implementation - Include example training script with configurable parameters - Include example generation script for synthetic patient data - Add canonical SLURM scripts with optimal parameters (80 epochs, batch_size 48, lr 0.0001) - Register HALO in generators module - Update HALO_MIMIC3Dataset with latest preprocessing - Update README with HALO documentation
Remove README.rst changes that only documented CorGAN, not HALO. This PR should focus solely on HALO implementation.
…ls to HALO notebook Complete Tasks 3-7: - Configuration panel with demo defaults - Data upload with validation - Training logic with checkpoint management - Generation with CSV conversion - Results display with quality checks and download Notebook now has 24 cells with complete end-to-end workflow.
- Replace `!pip install` with subprocess.run() for error checking - Show clear error message if installation fails - Raise RuntimeError to stop notebook execution on failure Fixes #1
- Remove PATIENTS.csv and patient_ids.txt (not used by HALO_MIMIC3Dataset) - Handle Colab file renaming (ADMISSIONS (1).csv -> ADMISSIONS.csv) - Allow uploading files one at a time with progress tracking - Check Google Drive for existing files before requesting upload - Add FORK variable to installation cell for easier testing Fixes #4, #5, #6
Ensures Colab users always get the latest version from GitHub without using cached packages. Critical for picking up recent fixes like the halo_resources __init__.py. Fixes #18
Use os.path.join() instead of string concatenation to properly handle directory paths with or without trailing slashes. Fixes #19
Fixes #21 The YAML config files in pyhealth/datasets/configs/ were not being included when the package was installed via pip. This caused FileNotFoundError for multiple datasets including HALO, MIMIC3, MIMIC4, EHRShot, COVID-19 CXR, and Medical Transcriptions. Added MANIFEST.in to specify which non-Python files should be included in the package distribution.
Fixes #21 MANIFEST.in only affects sdist source distributions. When installing via `pip install git+https://...` (as in Colab), pip relies on package_data in setup.py to include non-Python files. Added explicit package_data to ensure YAML configs in pyhealth/datasets/configs/ are included in all install paths. Removed MANIFEST.in as it provided no benefit for pip-from-git installs.
Timestamp reflects when notebook was last modified so users can verify they are running the correct version. Reverts dynamic install-time timestamp in favor of this static header approach.
When pkl_data_dir has no trailing slash (e.g. "/path/to/pkl_data"), raw f-string concatenation produced invalid paths like "/path/to/pkl_datacodeToIndex.pkl" instead of "/path/to/pkl_data/codeToIndex.pkl". Replace all pickle save paths with os.path.join(). Also add os.makedirs() so the output directory is created if missing.
…rom test_halo_model
…rd unique_patients in cell 22
jhnwu3
left a comment
There was a problem hiding this comment.
I've made some small comments here, but will summarize more on Discord since I think you guys are actually really close. There's some quick discussions on what I think is arguably more efficient and effective. And honestly will confuse the users who build upon this work later about this.
| # at least one ICD-9 code). Patients with fewer than 2 qualifying visits | ||
| # are excluded. | ||
| print("Setting HALO generation task...") | ||
| sample_dataset = base_dataset.set_task(halo_generation_mimic3_fn) |
There was a problem hiding this comment.
Can we use the TaskClass you wrote here for this instead of the function that I can't see anymore?
| with torch.no_grad(): | ||
| for val_batch in val_loader: | ||
| visits = val_batch["visits"].to(self.device) | ||
| batch_ehr, batch_mask = self._encode_visits(visits) |
There was a problem hiding this comment.
I see there's an encode function. This can be quite prohibitively expensive in training if it's some form of tokenization here or whatever.
I guess the question is that is batch_ehr here an embedding here? or is it a tokenization approach here? A collate function?
| """Forward pass. | ||
|
|
||
| Accepts the padded index tensor produced by NestedSequenceProcessor, | ||
| converts it to HALO multi-hot format, and runs the transformer. |
There was a problem hiding this comment.
If you need a multihot format, we have a multihot processor here. We should probably chat on some next steps.
See notes for #878
Notebook permalink: https://colab.research.google.com/github/jalengg/PyHealth/blob/halo-pr-integration/examples/halo_mimic3_colab.ipynb