AD-Copilot: A Vision-Language Assistant for Industrial Anomaly Detection via Visual In-context Comparison

Xi Jiang, Yue Guo, Jian Li, Yong Liu, Bin-Bin Gao, Hanqiu Deng, Jun Liu, Heng Zhao, Chengjie Wang, Feng Zheng

Southern University of Science and Technology | Macau University of Science and Technology | Tencent YouTu Lab | A*STAR

Abstract

Multimodal Large Language Models (MLLMs) have achieved impressive success in natural visual understanding, yet they consistently underperform in industrial anomaly detection (IAD). This is because MLLMs trained mostly on general web data differ significantly from industrial images. Moreover, they encode each image independently and can only compare images in the language space, making them insensitive to subtle visual differences that are key to IAD. To tackle these issues, we present AD-Copilot, an interactive MLLM specialized for IAD via visual in-context comparison. We first design a novel data curation pipeline to mine inspection knowledge from sparsely labeled industrial images, yielding a large-scale multimodal dataset Chat-AD (620k+ samples, 327 categories). AD-Copilot incorporates a novel Comparison Encoder that employs cross-attention between paired image features to enhance multi-image fine-grained perception, and is trained with a multi-stage strategy that incorporates domain knowledge and gradually enhances IAD skills. We also introduce MMAD-BBox, an extended benchmark for anomaly localization with bounding-box evaluation.

AD-Copilot achieves 82.3% accuracy on the MMAD benchmark, outperforming all proprietary and open-source models (including GPT-4o, Gemini 1.5 Pro) without any data leakage, and surpasses ordinary human performance on several IAD tasks.

News

[2026-04] Online demo available: HuggingFace Space
[2026-03] Paper released on arXiv: arXiv:2603.13779
[2026-03] Model weights released on HuggingFace
[Coming Soon] Chat-AD dataset and MMAD-BBox benchmark

Architecture

The Comparison Encoder performs cross-attention between paired image features to generate comparison tokens, enabling visual in-context comparison at the encoding stage. This design preserves original image representations while producing compact comparison tokens for downstream reasoning.

Models

Model	Size	Backbone	Description	Link
AD-Copilot	7B	Qwen2.5-VL-7B	Base model for anomaly detection	jiang-cc/AD-Copilot
AD-Copilot-Thinking	7B	Qwen2.5-VL-7B	Thinking variant with chain-of-thought reasoning	jiang-cc/AD-Copilot-Thinking

Results

MMAD Benchmark (1-shot)

Model	Scale	Anomaly Disc.	Defect Cls.	Defect Loc.	Defect Desc.	Defect Ana.	Object Cls.	Object Ana.	Average
Human (expert)	-	95.24	75.00	92.31	83.33	94.20	86.11	80.37	86.65
GPT-4o	-	68.63	65.80	55.62	73.21	83.41	94.98	82.80	74.92
Qwen2.5-VL	7B	71.10	56.02	60.69	64.13	78.26	91.49	83.67	72.19
AD-Copilot	7B	73.64	67.89	64.08	80.60	85.91	91.06	87.78	78.71
AD-Copilot-Thinking	7B	73.95	74.29	76.40	84.92	86.93	91.86	87.67	82.29

MMAD-BBox Benchmark (1-shot)

Model	Scale	mIoU (%)	ACC@IoU 0.1	ACC@IoU 0.3	ACC@IoU 0.5
Qwen2.5-VL	7B	10.47	28.39	18.83	13.39
AD-Copilot	7B	24.46	55.66	43.77	34.78
AD-Copilot-Thinking	7B	25.30	55.22	44.76	35.88

Installation

# Clone the repository
git clone https://github.com/jam-cc/AD-Copilot.git
cd AD-Copilot

# Create conda environment
conda create -n ad-copilot python=3.10 -y
conda activate ad-copilot

# Install dependencies
pip install -r requirements.txt

Quick Start

Inference

import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
from qwen_vl_utils import process_vision_info
from PIL import Image

model_path = "jiang-cc/AD-Copilot"

processor = AutoProcessor.from_pretrained(
    model_path,
    min_pixels=64 * 28 * 28,
    max_pixels=1280 * 28 * 28,
    trust_remote_code=True,
)

model = AutoModelForImageTextToText.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
    trust_remote_code=True,
).eval()

# Load images
good_image = Image.open("path/to/good_image.png")
test_image = Image.open("path/to/test_image.png")

# Build messages
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": good_image},
            {"type": "image", "image": test_image},
            {"type": "text", "text": "The first image is a normal sample. "
             "Is there any anomaly in the second image? A. Yes B. No. "
             "Please answer the letter only."},
        ],
    }
]

# Run inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text], images=image_inputs, videos=video_inputs,
    padding=True, return_tensors="pt",
).to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
generated_ids_trimmed = [
    out[len(inp):] for inp, out in zip(inputs.input_ids, generated_ids)
]
output = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True,
    clean_up_tokenization_spaces=False,
)[0]
print(output)

Command-Line Inference

# Basic inference
python scripts/inference.py \
    --model_path jiang-cc/AD-Copilot \
    --good_image path/to/good.png \
    --bad_image path/to/test.png

# With custom prompt
python scripts/inference.py \
    --model_path jiang-cc/AD-Copilot \
    --good_image path/to/good.png \
    --bad_image path/to/test.png \
    --prompt "The first image is a normal sample. Please describe the anomaly in the second image." \
    --max_new_tokens 256

# Benchmark mode (latency + throughput)
python scripts/inference.py \
    --model_path jiang-cc/AD-Copilot \
    --good_image path/to/good.png \
    --bad_image path/to/test.png \
    --benchmark

Supported Tasks

Task	Example Prompt
Anomaly Discrimination	`The first image is a normal sample. Is there any anomaly in the second image? A. Yes B. No.`
Defect Classification	`The first image is a normal sample. What is the type of defect? A. Contamination B. Broken C. Scratch D. No defect.`
Defect Description	`The first image is a normal sample. Please describe the anomaly in the second image in detail.`
Defect Localization	`The first image is a normal sample. Please locate the defects within the second image with bounding box in JSON format.`

Evaluation

# Run evaluation on MMAD benchmark
python scripts/evaluate_mmad.py \
    --model_path jiang-cc/AD-Copilot \
    --data_root /path/to/MMAD/dataset \
    --output_dir results/

See scripts/ for more details.

Environment

Key dependencies (see requirements.txt for full list):

Package	Version
torch	>= 2.4.0
transformers	== 4.57.3
flash-attn	>= 2.7.0 (recommended)
accelerate	>= 0.30.0
qwen_vl_utils	latest

Note: flash-attn significantly speeds up inference but requires CUDA dev tools to compile. If unavailable, the model falls back to SDPA attention automatically.

Demo

Try AD-Copilot online: HuggingFace Space

The demo supports:

Comparison-based anomaly detection (reference + test image)
Single-image tasks (object counting, OCR)
Automatic bounding box visualization

License

This project is released under the Apache 2.0 License.

Citation

If you find this work useful, please cite our paper:

@article{jiang2026adcopilot,
  title   = {AD-Copilot: A Vision-Language Assistant for Industrial Anomaly
             Detection via Visual In-context Comparison},
  author  = {Jiang, Xi and Guo, Yue and Li, Jian and Liu, Yong and
             Gao, Bin-Bin and Deng, Hanqiu and Liu, Jun and Zhao, Heng
             and Wang, Chengjie and Zheng, Feng},
  journal = {arXiv preprint arXiv:2603.13779},
  year    = {2026}
}

Acknowledgements

AD-Copilot is built upon Qwen2.5-VL. We thank the Qwen team for their excellent foundation model.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
scripts		scripts
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AD-Copilot: A Vision-Language Assistant for Industrial Anomaly Detection via Visual In-context Comparison

Abstract

News

Architecture

Models

Results

MMAD Benchmark (1-shot)

MMAD-BBox Benchmark (1-shot)

Installation

Quick Start

Inference

Command-Line Inference

Supported Tasks

Evaluation

Environment

Demo

License

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AD-Copilot: A Vision-Language Assistant for Industrial Anomaly Detection via Visual In-context Comparison

Abstract

News

Architecture

Models

Results

MMAD Benchmark (1-shot)

MMAD-BBox Benchmark (1-shot)

Installation

Quick Start

Inference

Command-Line Inference

Supported Tasks

Evaluation

Environment

Demo

License

Citation

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages