This tool generates weak OCR labels per patch image and stores them in weak_ocr.parquet keyed by patch_id.
macOS (Homebrew):
brew install tesseract
brew install tesseract-langUbuntu/Debian:
sudo apt-get update
sudo apt-get install -y tesseract-ocr tesseract-ocr-eng tesseract-ocr-bodNotes:
- Tibetan language data (
bod) is not always installed by default. - If
bodis unavailable, the backend automatically falls back toeng.
python -m pechabridge.cli.weak_ocr_label \
--dataset /path/to/out_dataset \
--meta /path/to/out_dataset/meta/patches.parquet \
--out /path/to/out_dataset/meta/weak_ocr.parquet \
--backend tesseract \
--config configs/weak_ocr.yaml \
--num_workers 8 \
--shard_id 0 --num_shards 1 \
--resumeDebug dumps:
python -m pechabridge.cli.weak_ocr_label ... --debug_dump 20This writes:
<dataset>/debug/weak_ocr/*_orig.png<dataset>/debug/weak_ocr/*_prep.png<dataset>/debug/weak_ocr/*_text.txt
confidence is computed from image_to_data word confidences:
- mean(valid_word_confidence) / 100
-1values are ignored- if no valid word confidence exists,
confidenceisNaN - a fallback heuristic is still computed and stored in backend raw payload when raw storage is enabled
- Resume: use
--resumeto skippatch_ids already present in output parquet. - Overwrite: use
--overwriteto recompute this shard’s patch ids. - Shard rule:
patch_id % num_shards == shard_id.
weak_ocr.parquet includes:
patch_id,doc_id,page_id,line_id,scale_wtext,confidence,char_count,word_countlang_used,backendpreprocess_hash,ocr_config_hasherror_code,error_msg- optional
raw_jsonwhenoutput.store_raw=true
The pipeline calls only the backend interface in:
pechabridge/ocr/backends/base.py
Current VLM placeholder:
pechabridge/ocr/backends/vlm_backend_stub.py
To switch from Tesseract to VLM later:
- Implement API call logic inside
VLMBackendStub.ocr_image. - Keep return type as
OCRResult. - Run CLI with
--backend vlmand corresponding backend config.