ModelAuditor Agent

AI-powered model auditing agent with multi-agent debate for robust evaluation of machine learning models.

Setup

This repository has been tested extensively with Python 3.10.15. Typical install time via uv is less than a minute.

Using uv (recommended)

uv sync
uv run python main.py --model resnet50 --dataset CIFAR10 --weights path/to/weights.pth

Using pip

pip install -e .
python main.py --model resnet50 --dataset CIFAR10 --weights path/to/weights.pth

Medical AI dependencies (optional)

uv sync --extra medical  # or pip install -e ".[medical]"

Data and Model Downloads

Pre-trained Models (HuggingFace)

Download pre-trained ResNet50 models from HuggingFace:

# Install huggingface_hub if needed
pip install huggingface_hub

# Download all models
python -c "from huggingface_hub import hf_hub_download; [hf_hub_download('lukaskuhndkfz/ModelAuditor', f'{name}_resnet50_1_224.pt', local_dir='models') for name in ['camelyon17', 'chexpert', 'ham10000']]"

Or download individually:

# Camelyon17 (pathology)
huggingface-cli download lukaskuhndkfz/ModelAuditor camelyon17_resnet50_1_224.pt --local-dir models

# CheXpert (chest X-ray)
huggingface-cli download lukaskuhndkfz/ModelAuditor chexpert_resnet50_1_224.pt --local-dir models

# HAM10000 (dermatology)
huggingface-cli download lukaskuhndkfz/ModelAuditor ham10000_resnet50_1_224.pt --local-dir models

HAM10000 Dataset

Download the dataset from Harvard Dataverse:
- HAM10000_images_part_1.zip
- HAM10000_images_part_2.zip
- HAM10000_metadata.tab as CSV
Extract and organize:

# Create data directory
mkdir -p data/ham10000

# Extract images (both parts) into data/ham10000/
unzip HAM10000_images_part_1.zip -d data/ham10000/
unzip HAM10000_images_part_2.zip -d data/ham10000/

# Copy metadata
cp HAM10000_metadata.csv data/ham10000/

Split into vidir_modern and rosendahl subsets:

python setup_ham10000.py

This creates the following structure:

data/ham10000/
├── vidir_modern/
│   ├── bkl/  (475 images)
│   └── mel/  (680 images)
└── rosendahl/
    ├── bkl/  (490 images)
    └── mel/  (342 images)

Usage

General Usage

python main.py --model resnet50 --dataset CIFAR10 --weights models/model.pth

Medical Models

# ISIC skin lesion classification
python main.py --model siim-isic --dataset isic --weights models/isic/model.pth

# HAM10000 dataset
python main.py --model deepderm --dataset ham10000 --weights models/ham10000.pth

Medical Example (DeepDerm)

The auditor can be tested with the DeepDerm classifier which can be downloaded here. All that is needed is a valid Anthropic API Key as can be seen below (see section 'Environment Variables') as well as a test dataset such as HAM10000.

python main.py --model deepderm --dataset ham10000 --weights models/deepderm_isic.pth

Alternatively, use our pre-trained ResNet50 model from HuggingFace (see "Data and Model Downloads" above):

python main.py --model resnet50 --dataset ham10000 --weights models/ham10000_resnet50_1_224.pt

Expected runtime varies depending on user response speed and subset time but should take less than 10 minutes in total.

Toy Example (CIFAR10)

We also prepared a small toy model, trained on CIFAR10 so the Auditor can be evaluated on natural images. All that is needed is a valid Anthropic API Key as can be seen below (see section 'Environment Variables').

python main.py --model resnet18 --dataset CIFAR10 --weights examples/cifar10/cifar10.pth

Expected runtime varies depending on user response speed and subset time but should take less than 10 minutes in total.

Options

--subset N: Use N samples for faster evaluation
--no-debate: Disable multi-agent debate
--single-agent: Use single agent instead of multi-agent debate
--device: Specify device (cpu, cuda, mps)

Environment Variables

Set your API keys:

export ANTHROPIC_API_KEY="your-key"
export OPENAI_API_KEY="your-key"  # if using non-Anthropic models

Project Structure

main.py - Interactive model auditor with multi-agent debate
testbench.py - Automated evaluation script
utils/agent.py - Multi-agent conversation system
architectures/ - Custom model architectures
prompts/ - System prompts for different evaluation phases
models/ - Pre-trained model weights
results/ - Evaluation results and conversation logs

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
architectures		architectures
examples/cifar10		examples/cifar10
prompts		prompts
utils		utils
.gitignore		.gitignore
License.md		License.md
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
setup_ham10000.py		setup_ham10000.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ModelAuditor Agent

Setup

Using uv (recommended)

Using pip

Medical AI dependencies (optional)

Data and Model Downloads

Pre-trained Models (HuggingFace)

HAM10000 Dataset

Usage

General Usage

Medical Models

Medical Example (DeepDerm)

Toy Example (CIFAR10)

Options

Environment Variables

Project Structure

About

Uh oh!

Releases

Packages

Languages

License

MLO-lab/ModelAuditor

Folders and files

Latest commit

History

Repository files navigation

ModelAuditor Agent

Setup

Using uv (recommended)

Using pip

Medical AI dependencies (optional)

Data and Model Downloads

Pre-trained Models (HuggingFace)

HAM10000 Dataset

Usage

General Usage

Medical Models

Medical Example (DeepDerm)

Toy Example (CIFAR10)

Options

Environment Variables

Project Structure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages