Skip to content

OpenPecha/nalanda-note-normalizer

Repository files navigation

nalanda-note-normalizer

A collection of Python scripts for cleaning and normalising footnotes in Tibetan Buddhist .docx files from the Nalanda collection. The tools process books organised by pandita (scholar) under the data/ directory.

Scripts

1. clean_empty_footnotes.py — Clean empty & invalid footnotes

Walks all pandita folders under data/, opens each .docx, and performs the following cleanup operations:

  • Remove empty footnotes — Detects footnotes with no text content (only whitespace) and removes both the <w:footnote> entry and its reference marker in the document body.
  • Remove invalid footnotes — Detects footnotes whose text begins with (Tibetan opening bracket), which are considered malformed, and removes them.
  • Replace tsheg before footnote markers — Replaces (tsheg, U+0F0B) with (tsheg-bstar, U+0F0C) immediately before each footnote reference marker, following Tibetan typographic convention.
  • Export footnotes page-wise — Exports all footnotes grouped by page number to a {book_name}.txt file alongside each .docx.

Log files produced:

Log file Content
deleted_footnotes.log Records of empty footnotes that were removed (book, page, footnote ID)
deleted_invalid_footnotes.log Records of invalid footnotes starting with that were removed (book, page, full text)

Usage:

python clean_empty_footnotes.py

Set dry_run=True in the main() call to preview changes without modifying files.


2. normalise_notes.py — Normalise footnotes (3-pass pipeline)

The main normalisation script that runs three sequential passes on every .docx under data/:

Pass 1 — Remove punctuation-only footnotes

Removes footnotes that contain only Tibetan punctuation marks (། ༄ ༅ ་ ༌), edition labels (༼…༽), Tibetan digits (), and whitespace — i.e. no real textual content.

Pass 2 — Fix incomplete-reference footnotes

Detects footnotes that start with a bare before the edition label (e.g. ། ༼སྣར། པེ།༽ ཞིག །), indicating the reference syllable from the main text is missing. Uses botok-rs tokenizer to extract the last syllable from the main text preceding the footnote marker and prepends it to the footnote.

Tibetan orthographic rules are applied:

  • ང final: the tsheg is kept before the shad (e.g. ཡོང་། → extracts ཡོང་)
  • Other finals: the tsheg is omitted before the shad (e.g. ཡོད་པར། → extracts པར)

Pass 3 — Normalise archaic-word footnotes

Uses the archaic_words.yml dictionary to identify footnotes whose reference spelling is an archaic Tibetan word. For each match it:

  1. Swaps the archaic reference and the modern variant in the footnote text
  2. Complements the edition label (e.g. ༼སྣར། པེ།༽༼སྡེ། ཅོ།༽) using the standard four-edition order: Derge (སྡེ།), Chone (ཅོ།), Narthang (སྣར།), Peking (པེ།)
  3. Replaces the archaic spelling in the main document body with the modern variant

Log files produced:

Log file Content
deleted_punctuation_footnotes.log Punctuation-only footnotes that were removed
fixed_incomplete_footnotes.log Incomplete references that were fixed (before → after), plus unfixable entries
normalised_archaic_footnotes.log Archaic-word footnotes that were normalised (before → after)

Usage:

python normalise_notes.py

Set dry_run=True in the main() call to preview changes without modifying files.


3. tsek_ony_note.py — List shad-starting footnotes

A diagnostic/reporting script that scans every .docx under data/ and collects footnotes whose text begins with (Tibetan shad, U+0F0D). These are candidates for the incomplete-reference fix in normalise_notes.py.

Log file produced:

Log file Content
tsek_only_footnotes.log All footnotes starting with (book, page, full text)

Usage:

python tsek_ony_note.py

Data Layout

data/
├── 01-Nagarjuna/
│   ├── Book No 15) ….docx
│   ├── Book No 16) ….docx
│   └── ...
├── 02-Aryadeva/
│   └── ...
├── ...
└── 17-Atisha/
    └── ...

Each subdirectory corresponds to a pandita (Indian Buddhist scholar). The .docx files are volumes of their collected works with Tibetan textual-critical footnotes referencing variant readings across four canonical editions.

Supporting Files

File Description
archaic_words.yml YAML list of ~2,800 archaic Tibetan word forms used by Pass 3 of normalise_notes.py

Requirements

  • Python 3.10+
  • lxml — XML parsing of .docx internals
  • botok-rs — Tibetan word tokenizer (used for syllable extraction in Pass 2)

Install dependencies:

pip install -r requirement.txt

Recommended Run Order

  1. clean_empty_footnotes.py — Remove empty/invalid footnotes and fix tsheg-bstar first
  2. normalise_notes.py — Run the 3-pass normalisation pipeline on cleaned files
  3. tsek_ony_note.py — (Optional) Audit remaining shad-starting footnotes

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors