A collection of Python scripts for cleaning and normalising footnotes in Tibetan Buddhist .docx files from the Nalanda collection. The tools process books organised by pandita (scholar) under the data/ directory.
Walks all pandita folders under data/, opens each .docx, and performs the following cleanup operations:
- Remove empty footnotes — Detects footnotes with no text content (only whitespace) and removes both the
<w:footnote>entry and its reference marker in the document body. - Remove invalid footnotes — Detects footnotes whose text begins with
༼(Tibetan opening bracket), which are considered malformed, and removes them. - Replace tsheg before footnote markers — Replaces
་(tsheg, U+0F0B) with༌(tsheg-bstar, U+0F0C) immediately before each footnote reference marker, following Tibetan typographic convention. - Export footnotes page-wise — Exports all footnotes grouped by page number to a
{book_name}.txtfile alongside each.docx.
Log files produced:
| Log file | Content |
|---|---|
deleted_footnotes.log |
Records of empty footnotes that were removed (book, page, footnote ID) |
deleted_invalid_footnotes.log |
Records of invalid footnotes starting with ༼ that were removed (book, page, full text) |
Usage:
python clean_empty_footnotes.pySet
dry_run=Truein themain()call to preview changes without modifying files.
The main normalisation script that runs three sequential passes on every .docx under data/:
Removes footnotes that contain only Tibetan punctuation marks (། ༄ ༅ ་ ༌), edition labels (༼…༽), Tibetan digits (༠–༩), and whitespace — i.e. no real textual content.
Detects footnotes that start with a bare ། before the edition label (e.g. ། ༼སྣར། པེ།༽ ཞིག །), indicating the reference syllable from the main text is missing. Uses botok-rs tokenizer to extract the last syllable from the main text preceding the footnote marker and prepends it to the footnote.
Tibetan orthographic rules are applied:
- ང final: the tsheg
་is kept before the shad (e.g.ཡོང་།→ extractsཡོང་) - Other finals: the tsheg is omitted before the shad (e.g.
ཡོད་པར།→ extractsཔར)
Uses the archaic_words.yml dictionary to identify footnotes whose reference spelling is an archaic Tibetan word. For each match it:
- Swaps the archaic reference and the modern variant in the footnote text
- Complements the edition label (e.g.
༼སྣར། པེ།༽→༼སྡེ། ཅོ།༽) using the standard four-edition order: Derge (སྡེ།), Chone (ཅོ།), Narthang (སྣར།), Peking (པེ།) - Replaces the archaic spelling in the main document body with the modern variant
Log files produced:
| Log file | Content |
|---|---|
deleted_punctuation_footnotes.log |
Punctuation-only footnotes that were removed |
fixed_incomplete_footnotes.log |
Incomplete references that were fixed (before → after), plus unfixable entries |
normalised_archaic_footnotes.log |
Archaic-word footnotes that were normalised (before → after) |
Usage:
python normalise_notes.pySet
dry_run=Truein themain()call to preview changes without modifying files.
A diagnostic/reporting script that scans every .docx under data/ and collects footnotes whose text begins with ། (Tibetan shad, U+0F0D). These are candidates for the incomplete-reference fix in normalise_notes.py.
Log file produced:
| Log file | Content |
|---|---|
tsek_only_footnotes.log |
All footnotes starting with ། (book, page, full text) |
Usage:
python tsek_ony_note.pydata/
├── 01-Nagarjuna/
│ ├── Book No 15) ….docx
│ ├── Book No 16) ….docx
│ └── ...
├── 02-Aryadeva/
│ └── ...
├── ...
└── 17-Atisha/
└── ...
Each subdirectory corresponds to a pandita (Indian Buddhist scholar). The .docx files are volumes of their collected works with Tibetan textual-critical footnotes referencing variant readings across four canonical editions.
| File | Description |
|---|---|
archaic_words.yml |
YAML list of ~2,800 archaic Tibetan word forms used by Pass 3 of normalise_notes.py |
- Python 3.10+
lxml— XML parsing of.docxinternalsbotok-rs— Tibetan word tokenizer (used for syllable extraction in Pass 2)
Install dependencies:
pip install -r requirement.txtclean_empty_footnotes.py— Remove empty/invalid footnotes and fix tsheg-bstar firstnormalise_notes.py— Run the 3-pass normalisation pipeline on cleaned filestsek_ony_note.py— (Optional) Audit remaining shad-starting footnotes