IBD Cohort Identification Code To Go with Accompanying Papers and Models
- BMJ Open Gastroenterology. Citation: Stammers M, Gwiggner M, Nouraei R, Metcalf C, Batchelor J. Robust comparative evaluation of 15 natural language processing algorithms to positively identify patients with inflammatory bowel disease from secondary care records. BMJ Open Gastroenterology. 2025 Oct 10;12(1).
- Digestive Diseases and Sciences. Citation: Stammers M, Sartain S, Cummings JF, Kipps C, Nouraei R, Gwiggner M, Metcalf C, Batchelor J. Identification of cohorts with inflammatory bowel disease amidst fragmented clinical databases via machine learning. Digestive Diseases and Sciences. 2025 Oct;70(10):3309-22.
- BMC Gastroenterology. Stammers M, Ramgopal B, Owusu Nimako A, Vyas A, Nouraei R, Metcalf C, Batchelor J, Shepherd J, Gwiggner M. A foundation systematic review of natural language processing applied to gastroenterology & hepatology. BMC Gastroenterology. 2025 Feb 6;25(1):58.
- Contained in this repo and associated repositories as .joblb/.pkl files with associated disclaimers but from a trusted source. Use at your own discretion.
- Collection of BERT-based models on HuggingFace BERT Based Models
- Model Demo on HuggingFace IBD Cohort Identification Demo
- Python Difficulty Level: Fairly Advanced (Not Particularly Recommended for Beginners)
- Primary Code Purpose: Code Your Own Versions. Transparency for paper. Maximising generalisability and replicability.
To run the code you will have to appropriately prepare your (ideally poetry environment) study_id's and string data into seperated columns in a dataframe. I recommend using .py files rather than .ipynb notebooks for this but the choice is up to you and will to some degree depend upon level of experience. For a basic primer on using python and setting it up for the first time: Python Starter Guide
Analysts must prepare the environment appropriately. I have written a guide before which I will link into this repo. Alternatively, if you are new to python and working in a healthcare context I recommend visiting for a basic-advanced quick into: NHS BI Analyts Python for Data Science Intro
- Install environments.
The first thing to flag is that this pipeline works best in Linux environments. It does run in Windows but less successfully. All Windows dependencies have been removed to make it interoperable.
The recommendation is to use poetry to install a cuda enabled environment otherwise the pipeline will take a long time to run. This can be achieved as follows:
pip install poetry
cd src
poetry env activate
poetry install --extras "cuda"This installs all the base packages. However, to complete the process it is easier to use pip to complete the installation. If you find a good way to do it all with poetry please send me a pull request.
pip install -r requirements.txt- Run Tests.
Before you run the pipeline it is recommended to run the test suite. This can be achieved with the following if you want to see the warnings as well:
python main.py --test --disable-warnings --capture=tee-sys -rwIf all the tests succeed then it is likely the pipeline will run successfully.
- Run The Pipeline
Now you are ready to run the full pipeline (providing you have sufficient VRAM and compute available). If not either shift everything to the CPU or configure accelerate to load balance the training process.
To run the pipeline run:
python main.py --disable-umlsThese models contain several known biases inherent in the training cohort itself. Most notably they underselect IBD diagnoses for female, wealthy and African patients. They are also overfitted to the training data and may underperform in other contexts. Use with caution. If you can improve them for the benefit of the world then please do so but not for profit - that is against the terms of the licence.
If you would like to contribute futher to this project you can do so by submitting a pull request to this repo. If you remix or fork the project please attribute appropraitely. These models should not be used commercially. Obtaining profit by using them is forbidden as per the licence. If they are improved then they must be shared open source to the community.
This project and the associated models are Attribution-NonCommercial 4.0 International Licensed. The copyright holder is Matt Stammers and University Hospital Southampton NHS Foundation Trust.
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
No guarantee is given of model performance in any production capacity whatsoever. These models should be used in full accordance with the EU AI Act - Regulation 2024/1689. These are not CE marked medical devices and are suitable at this point only for research and development / experimentation at users own discretion. They can be improved but any improvements should be published openly and shared openly with the community. UHSFT and the author own the copyright and are choosing to share them freely under a CC BY-NC 4.0 Licence for the benefit of the wider research community but not for commercial organisations who are breaking copyright law and infringing upon NHS intellectual property if they try to sell/market these models for profit.
