Remove Headers, Footers, and Footnotes from scanned book images using DocLayout-YOLO.
HFF Remover is a Python package that uses the DocLayout-YOLO model to automatically detect and mask headers, footers, and footnotes in scanned book images. It's designed to process large batches of images (200K+) efficiently using GPU acceleration.
- Automatic Detection: Uses DocLayout-YOLO to detect document layout elements
- Batch Processing: Process thousands of images with GPU acceleration
- Resumable: Checkpoint support for interruption recovery
- Mixed Formats: Supports JPEG, PNG, TIFF, BMP, and WebP
- Configurable: Adjustable confidence thresholds, padding, and output quality
- CLI & API: Both command-line and Python API interfaces
# Clone the repository
git clone https://github.com/OpenPecha/HFF-Remover.git
cd HFF-Remover
# Install in development mode
pip install -e ".[dev]"- Python >= 3.8
- CUDA-capable GPU (recommended for large batches)
- PyTorch >= 2.0.0
# Process all images in a directory
hff-remover process /path/to/images --output /path/to/output
# Process with GPU and custom settings
hff-remover process /path/to/images --output /path/to/output \
--device cuda \
--batch-size 32 \
--confidence 0.5 \
--padding 5
# Resume interrupted processing
hff-remover process /path/to/images --output /path/to/output --resume
# Process a single image
hff-remover single input.jpg output.jpg
# Just detect without masking (view coordinates)
hff-remover detect input.jpgfrom hff_remover import HFFDetector, HFFProcessor, BatchProcessor
# Initialize components
detector = HFFDetector(device="cuda", confidence_threshold=0.5)
processor = HFFProcessor(padding=5)
# Process a single image
from hff_remover.utils import load_image, save_image
image = load_image("input.jpg")
detections = detector.detect(image)
result = processor.mask_regions(image, detections)
save_image(result, "output.jpg")
# Batch processing
batch_processor = BatchProcessor(
detector=detector,
processor=processor,
batch_size=32,
)
stats = batch_processor.process_directory(
input_dir="/path/to/images",
output_dir="/path/to/output",
resume=True,
)
print(f"Processed {stats.processed_images} images")
print(f"Speed: {stats.images_per_second:.1f} images/sec")Process all images in a directory.
| Option | Default | Description |
|---|---|---|
--output, -o |
Required | Output directory |
--device |
cuda |
Device for inference (cuda or cpu) |
--batch-size |
8 |
Images per batch |
--confidence |
0.5 |
Minimum detection confidence |
--padding |
0 |
Extra pixels around detected regions |
--image-size |
1024 |
Model input size |
--quality |
95 |
Output image quality (0-100) |
--resume |
false |
Resume from checkpoint |
--no-recursive |
false |
Don't search subdirectories |
Process a single image.
hff-remover single input.jpg output.jpg --device cudaDetect HFF regions without masking.
hff-remover detect input.jpg --output detections.jsonDocLayout-YOLO detects the following document elements. HFF Remover targets:
| Class | Name | Description |
|---|---|---|
| 2 | abandon | Headers, footers, page numbers |
| 7 | table_footnote | Footnotes in tables |
On a modern GPU (NVIDIA V100/T4):
- Speed: 50-100 images/second
- 200K images: ~30-60 minutes
For best performance:
- Use SSD storage for faster I/O
- Increase batch size based on GPU memory
- Use multiple I/O workers (
--io-workers)
For processing large batches on cloud GPUs:
# AWS/GCP with NVIDIA GPU
hff-remover process /data/images --output /data/output \
--device cuda \
--batch-size 64 \
--io-workers 8 \
--checkpoint-interval 1000# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run with coverage
pytest --cov=hff_removerMIT License - see LICENSE for details.
- DocLayout-YOLO for the document layout model
- OpenPecha for supporting this project