PechaBridge is a workflow for Tibetan pecha document understanding with a focus on layout detection, synthetic data generation, SBB data ingestion, OCR/VLM-assisted parsing, diffusion-based texture augmentation, and retrieval-ready patch pipelines.
The project combines:
- synthetic YOLO dataset generation for Tibetan/number classes,
- training and evaluation of detection models,
- large-scale processing of SBB page images,
- optional VLM backends for layout extraction,
- SDXL/SD2.1 + ControlNet + LoRA texture adaptation,
- line sub-patch dataset generation with Option-A neighborhood metadata,
- weak OCR labeling + robust MNN mining for cross-page positives,
- ViT/DINOv2 retrieval training (mp-InfoNCE) and FAISS-based evaluation/search,
- line-level dual vision-text encoder training (CLIP-style) on OCR manifests,
- Donut/TroCR OCR training on OpenPecha/BDRC-style line manifests with optional BDRC preprocessing.
The primary entrypoint for end-to-end usage is the Workbench UI (ui_workbench.py).
- Synthetic multi-class dataset generation: Creates YOLO-ready pages for Tibetan number words, Tibetan text blocks, and Chinese number words.
- OCR-ready target export: Optionally saves rendered OCR targets with deterministic line linearization and optional OCR crop export by label.
- Detection training and inference: Provides Ultralytics YOLO training, validation, and inference workflows for local data and SBB pages.
- Pseudo-labeling and rule-based filtering: Supports VLM-assisted layout extraction plus post-filtering before annotation review.
- Donut-style OCR workflow (Label 1): Runs generation, manifest preparation, tokenizer handling, and Vision Transformer encoder + autoregressive decoder training.
- Standalone DONUT/TroCR OCR training: Trains OCR directly on OpenPecha/BDRC line manifests (
train-donut-ocr) withnone|pb|bdrcimage preprocessing and CER evaluation. - Diffusion texture adaptation: Includes SDXL/SD2.1 + ControlNet augmentation and optional LoRA integration for more realistic page textures.
- Patch retrieval dataset generation: Generates line sub-patches (
datasets/text_patches) with parquet metadata for retrieval training. - Weak supervision for retrieval: Supports weak OCR labels and robust MNN mining for cross-page positives.
- Retrieval encoder training + eval: Trains ViT/DINOv2 patch encoders with mp-InfoNCE and exports FAISS-ready embeddings plus cross-page evaluation.
- Dual vision-text encoder (CLIP-style) training: Trains DINOv2 + text encoder (e.g. ByT5) on line image/text manifests (
line_clip) for text-to-line and line-to-text retrieval. - Retrieval encoder preparation (legacy/base): Includes unpaired image/text encoder training as a base for broader Tibetan retrieval experiments.
The figure below shows an example layout analysis result for a Tibetan pecha page from the Staatsbibliothek zu Berlin (SBB). Detected layout regions are overlaid on the original page image and illustrate the intended detection quality for real-world scans.
- Build a robust pipeline for Tibetan page layout analysis that works with limited labeled data.
- Improve model quality through synthetic data and realistic texture transfer from real scans.
- Support scalable ingestion and weak supervision on large historical collections (for example SBB PPNs).
- Prepare retrieval-ready representations (image and text encoders) for future Tibetan n-gram search.
- Keep all major workflows reproducible in both UI and CLI.
- Data Foundation: Synthetic generation, SBB download pipeline, and dataset QA/export workflows.
- Detection and Parsing: YOLO training/inference plus optional VLM-assisted layout parsing and pseudo-labeling.
- Realism and Domain Adaptation: Diffusion + LoRA texture workflows to bridge synthetic-to-real domain gaps.
- Retrieval Readiness: Train unpaired image/text encoders and establish schemas/pipelines for retrieval indexing.
- Retrieval System: Dual-encoder alignment, ANN indexing, provenance-aware search results, and iterative evaluation.
pip install -r requirements.txtrequirements.txt is now the unified dependency file for:
- Workbench UI
- VLM backends
- Diffusion + LoRA workflows
- Retrieval encoder training
Legacy files requirements-ui.txt, requirements-vlm.txt, and requirements-lora.txt remain as compatibility wrappers.
- CLI command reference and end-to-end examples: README_CLI.md
- Pseudo-labeling and Label Studio workflow: README_PSEUDO_LABELING_LABEL_STUDIO.md
- Patch dataset generation (YOLO textbox -> lines -> sub-patches): docs/dataset_generation.md
- Robust MNN mining for cross-page positives: docs/mnn_mining.md
- Retrieval training with mp-InfoNCE (MNN/OCR weak positives): docs/retrieval_mpnce_training.md
- DONUT/TroCR OCR training (OpenPecha/BDRC manifests, CER, checkpoints): README_DONUT_OCR.md
- Line-CLIP dual vision-text encoder training (DINOv2 + text encoder): README_LINE_CLIP_DUAL_ENCODER.md
- Weak OCR labeling for patch datasets: docs/weak_ocr.md
- Diffusion + LoRA details: docs/texture_augmentation.md
- Retrieval roadmap: docs/tibetan_ngram_retrieval_plan.md
- Chinese number corpus note: data/corpora/Chinese Number Words/README.md
python ui_workbench.pyOptional runtime flags via environment variables:
export UI_HOST=127.0.0.1 # use 0.0.0.0 for remote server binding
export UI_PORT=7860
export UI_SHARE=false # set true only if you explicitly want a public Gradio link
python ui_workbench.pyIf the Workbench runs on a remote host, keep UI_SHARE=false and use SSH forwarding:
ssh -L 7860:127.0.0.1:7860 <user>@<server>Then open http://127.0.0.1:7860 on your laptop.
Synthetic Data/PPN Downloader: build or ingest page images.Ultralytics Training+Model Inference: train and validate layout detection (YOLO).Workbench Preview Tabs: inspect detected text blocks, line segmentation, and hierarchy overlays on real pecha pages.Label Studio Export/ pseudo-label workflow: generate reviewable labels when needed.gen-patches(CLI): builddatasets/text_patchesfrom page images using YOLO textbox detection + classical line segmentation.weak-ocr-label(CLI, optional): create weak OCR labels on patch crops.mine-mnn-pairs(CLI): mine robust cross-page positives for retrieval training.train-text-hierarchy-vit(CLI): train patch retrieval encoder with mp-InfoNCE usingMNN,OCR, orboth.train-text-hierarchy-vit --train-mode line_clip(CLI): train line-level dual vision-text encoder on OCR manifests.train-donut-ocr(CLI): train generative OCR model (TroCR/Donut-style VisionEncoderDecoder) on line manifests.eval-faiss-crosspageandfaiss-text-hierarchy-search(CLI): evaluate and inspect retrieval behavior.
The project includes a unified CLI entrypoint:
python cli.py -hKey commands:
# Texture LoRA dataset prep
python cli.py prepare-texture-lora-dataset --input_dir ./sbb_images --output_dir ./datasets/texture-lora-dataset
# Train texture LoRA (SDXL or SD2.1 via --model_family)
python cli.py train-texture-lora --dataset_dir ./datasets/texture-lora-dataset --output_dir ./models/texture-lora-sdxl
# Texture augmentation inference
python cli.py texture-augment --input_dir ./datasets/tibetan-yolo-ui/train/images --output_dir ./datasets/tibetan-yolo-ui-textured
# Train image encoder (self-supervised)
python cli.py train-image-encoder --input_dir ./sbb_images --output_dir ./models/image-encoder
# Train text encoder (unsupervised, Unicode-normalized)
python cli.py train-text-encoder --input_dir ./data/corpora --output_dir ./models/text-encoder
# Generate patch retrieval dataset (YOLO textbox -> lines -> multi-scale patches)
python cli.py gen-patches \
--model ./models/layoutModels/layout_model.pt \
--input-dir ./sbb_images \
--output-dir ./datasets/text_patches \
--no-samples 20 \
--debug-dump 5
# Generate weak OCR labels for patch crops (optional retrieval weak positives)
python cli.py weak-ocr-label \
--dataset ./datasets/text_patches \
--meta ./datasets/text_patches/meta/patches.parquet \
--out ./datasets/text_patches/meta/weak_ocr.parquet
# Mine cross-page MNN positives for retrieval training
python cli.py mine-mnn-pairs \
--dataset ./datasets/text_patches \
--meta ./datasets/text_patches/meta/patches.parquet \
--out ./datasets/text_patches/meta/mnn_pairs.parquet \
--config ./configs/mnn_mining.yaml \
--num-workers 8
# Train patch retrieval encoder with mp-InfoNCE (MNN/OCR/both)
python cli.py train-text-hierarchy-vit \
--dataset-dir ./datasets/text_patches \
--output-dir ./models/text_hierarchy_vit_mpnce \
--model-name-or-path facebook/dinov2-base \
--train-mode patch_mpnce \
--positive-sources both \
--pairs-parquet ./datasets/text_patches/meta/mnn_pairs.parquet \
--weak-ocr-parquet ./datasets/text_patches/meta/weak_ocr.parquet
# Train line-level dual vision-text encoder (CLIP-style) on OCR manifests
python cli.py train-text-hierarchy-vit \
--dataset-dir ./datasets/openpecha_ocr_lines \
--output-dir ./models/line_clip_openpecha_bdrc_dinov2_byt5 \
--train-mode line_clip \
--train-manifest ./datasets/openpecha_ocr_lines/train/meta/lines.jsonl \
--val-manifest ./datasets/openpecha_ocr_lines/eval/meta/lines.jsonl \
--model-name-or-path facebook/dinov2-base \
--text-encoder-name-or-path google/byt5-small \
--image-preprocess-pipeline bdrc
# Train Donut/TroCR OCR directly on line manifests (with CER on eval split)
python cli.py train-donut-ocr \
--train_manifest ./datasets/openpecha_ocr_lines/train/meta/lines.jsonl \
--val_manifest ./datasets/openpecha_ocr_lines/eval/meta/lines.jsonl \
--output_dir ./models/donut_openpecha_bdrc \
--model_name_or_path microsoft/trocr-base-stage1 \
--tokenizer_path openpecha/BoSentencePiece \
--image_preprocess_pipeline bdrc
# Cross-page FAISS evaluation from exported embeddings
python cli.py eval-faiss-crosspage \
--embeddings-npy ./models/text_hierarchy_vit_mpnce/faiss_embeddings.npy \
--embeddings-meta ./models/text_hierarchy_vit_mpnce/faiss_embeddings_meta.parquet \
--mnn-pairs ./datasets/text_patches/meta/mnn_pairs.parquet \
--output-dir ./models/text_hierarchy_vit_mpnce/eval_crosspage
# Full label-1 OCR workflow (generate -> prepare -> train)
python cli.py run-donut-ocr-workflow \
--dataset_name tibetan-donut-ocr-label1 \
--dataset_output_dir ./datasets \
--font_path_tibetan "ext/Microsoft Himalaya.ttf" \
--font_path_chinese ext/simkai.ttf \
--model_output_dir ./models/donut-ocr-label1Weak OCR label generation for patch datasets is available as a root CLI subcommand:
python cli.py weak-ocr-label \
--dataset ./datasets/text_patches \
--meta ./datasets/text_patches/meta/patches.parquet \
--out ./datasets/text_patches/meta/weak_ocr.parquetTypical outputs in --output_dir:
checkpoint-*(step-based HF checkpoints)checkpoint-epoch-<N>-cer-<X>symlink aliases (if eval happened before save)model/(finalVisionEncoderDecoderModel)tokenizer/image_processor/train_summary.json
Current Workbench support:
- The Workbench supports the DONUT OCR workflow runner (
run-donut-ocr-workflow) and monitors training logs/output dirs. - There is currently no dedicated standalone DONUT inference tab that loads a trained
model/ + tokenizer/ + image_processor/triplet for ad-hoc OCR inference in the UI. - Training and evaluation are fully supported via CLI (
cli.py train-donut-ocr).
Typical outputs in --output_dir:
text_hierarchy_vit_backbone/(image backbone + image processor)text_hierarchy_projection_head.pt(image projection head)text_hierarchy_clip_text_encoder/(HF text encoder + tokenizer)text_hierarchy_clip_text_projection_head.pt(text projection head)faiss_embeddings.npy,faiss_embeddings_meta.parquettraining_config.json- optional
checkpoint_step_*snapshots (image backbone/head checkpoints)
Current Workbench support:
- The Workbench can scan and use the image backbone + image projection head for line/block encoding previews and FAISS-related UI flows.
- The text encoder part of
line_clipis currently not yet consumed by the Workbench UI (text query encoding remains a CLI / future UI extension topic). - Training artifacts are therefore usable in the UI for image-side encoding, while full dual-encoder evaluation is primarily CLI-driven.
For local file serving in Label Studio, set:
export LABEL_STUDIO_LOCAL_FILES_SERVING_ENABLED=true
export LABEL_STUDIO_LOCAL_FILES_DOCUMENT_ROOT=/absolute/path/to/your/dataset/rootThen use the Workbench export actions.
CLI usage is documented separately in:
- README_CLI.md
- README_PSEUDO_LABELING_LABEL_STUDIO.md
- docs/dataset_generation.md
- docs/mnn_mining.md
- docs/retrieval_mpnce_training.md
- docs/weak_ocr.md
- docs/texture_augmentation.md
- docs/tibetan_ngram_retrieval_plan.md
MIT, see LICENSE.

