Skip to content

MattStammers/An_Open_Source_Collection_Of_IBD_Cohort_Identification_Models

Repository files navigation

IBD_NLP_Cohort_Identification_Models_IC-IBD_Part_2

An_Open_Source_Collection_Of_IBD_Cohort_Identification_Models

By Matt Stammers

22/06/2025

IBD Cohort Identification Code To Go with Accompanying Papers and Models

Papers

  1. BMJ Open Gastroenterology. Citation: Stammers M, Gwiggner M, Nouraei R, Metcalf C, Batchelor J. Robust comparative evaluation of 15 natural language processing algorithms to positively identify patients with inflammatory bowel disease from secondary care records. BMJ Open Gastroenterology. 2025 Oct 10;12(1).
  2. Digestive Diseases and Sciences. Citation: Stammers M, Sartain S, Cummings JF, Kipps C, Nouraei R, Gwiggner M, Metcalf C, Batchelor J. Identification of cohorts with inflammatory bowel disease amidst fragmented clinical databases via machine learning. Digestive Diseases and Sciences. 2025 Oct;70(10):3309-22.
  3. BMC Gastroenterology. Stammers M, Ramgopal B, Owusu Nimako A, Vyas A, Nouraei R, Metcalf C, Batchelor J, Shepherd J, Gwiggner M. A foundation systematic review of natural language processing applied to gastroenterology & hepatology. BMC Gastroenterology. 2025 Feb 6;25(1):58.

Models

  1. Contained in this repo and associated repositories as .joblb/.pkl files with associated disclaimers but from a trusted source. Use at your own discretion.
  2. Collection of BERT-based models on HuggingFace BERT Based Models
  3. Model Demo on HuggingFace IBD Cohort Identification Demo

Ratings/Features

  • Python Difficulty Level: Fairly Advanced (Not Particularly Recommended for Beginners)
  • Primary Code Purpose: Code Your Own Versions. Transparency for paper. Maximising generalisability and replicability.

How to Use Yourself

To run the code you will have to appropriately prepare your (ideally poetry environment) study_id's and string data into seperated columns in a dataframe. I recommend using .py files rather than .ipynb notebooks for this but the choice is up to you and will to some degree depend upon level of experience. For a basic primer on using python and setting it up for the first time: Python Starter Guide

Analysts must prepare the environment appropriately. I have written a guide before which I will link into this repo. Alternatively, if you are new to python and working in a healthcare context I recommend visiting for a basic-advanced quick into: NHS BI Analyts Python for Data Science Intro

How to Use Yourself

  1. Install environments.

The first thing to flag is that this pipeline works best in Linux environments. It does run in Windows but less successfully. All Windows dependencies have been removed to make it interoperable.

The recommendation is to use poetry to install a cuda enabled environment otherwise the pipeline will take a long time to run. This can be achieved as follows:

pip install poetry
cd src
poetry env activate
poetry install --extras "cuda"

This installs all the base packages. However, to complete the process it is easier to use pip to complete the installation. If you find a good way to do it all with poetry please send me a pull request.

pip install -r requirements.txt
  1. Run Tests.

Before you run the pipeline it is recommended to run the test suite. This can be achieved with the following if you want to see the warnings as well:

python main.py --test --disable-warnings --capture=tee-sys -rw

If all the tests succeed then it is likely the pipeline will run successfully.

  1. Run The Pipeline

Now you are ready to run the full pipeline (providing you have sufficient VRAM and compute available). If not either shift everything to the CPU or configure accelerate to load balance the training process.

To run the pipeline run:

python main.py --disable-umls

Warnings 👀

These models contain several known biases inherent in the training cohort itself. Most notably they underselect IBD diagnoses for female, wealthy and African patients. They are also overfitted to the training data and may underperform in other contexts. Use with caution. If you can improve them for the benefit of the world then please do so but not for profit - that is against the terms of the licence.

Contributing

If you would like to contribute futher to this project you can do so by submitting a pull request to this repo. If you remix or fork the project please attribute appropraitely. These models should not be used commercially. Obtaining profit by using them is forbidden as per the licence. If they are improved then they must be shared open source to the community.

Licence

This project and the associated models are Attribution-NonCommercial 4.0 International Licensed. The copyright holder is Matt Stammers and University Hospital Southampton NHS Foundation Trust.

Shield: CC BY-NC 4.0

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

CC BY-NC 4.0

Legal

No guarantee is given of model performance in any production capacity whatsoever. These models should be used in full accordance with the EU AI Act - Regulation 2024/1689. These are not CE marked medical devices and are suitable at this point only for research and development / experimentation at users own discretion. They can be improved but any improvements should be published openly and shared openly with the community. UHSFT and the author own the copyright and are choosing to share them freely under a CC BY-NC 4.0 Licence for the benefit of the wider research community but not for commercial organisations who are breaking copyright law and infringing upon NHS intellectual property if they try to sell/market these models for profit.

About

No description, website, or topics provided.

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages