nalanda-note-normalizer

A collection of Python scripts for cleaning and normalising footnotes in Tibetan Buddhist .docx files from the Nalanda collection. The tools process books organised by pandita (scholar) under the data/ directory.

Scripts

1. `clean_empty_footnotes.py` — Clean empty & invalid footnotes

Walks all pandita folders under data/, opens each .docx, and performs the following cleanup operations:

Remove empty footnotes — Detects footnotes with no text content (only whitespace) and removes both the <w:footnote> entry and its reference marker in the document body.
Remove invalid footnotes — Detects footnotes whose text begins with ༼ (Tibetan opening bracket), which are considered malformed, and removes them.
Replace tsheg before footnote markers — Replaces ་ (tsheg, U+0F0B) with ༌ (tsheg-bstar, U+0F0C) immediately before each footnote reference marker, following Tibetan typographic convention.
Export footnotes page-wise — Exports all footnotes grouped by page number to a {book_name}.txt file alongside each .docx.

Log files produced:

Log file	Content
`deleted_footnotes.log`	Records of empty footnotes that were removed (book, page, footnote ID)
`deleted_invalid_footnotes.log`	Records of invalid footnotes starting with `༼` that were removed (book, page, full text)

Usage:

python clean_empty_footnotes.py

Set dry_run=True in the main() call to preview changes without modifying files.

2. `normalise_notes.py` — Normalise footnotes (3-pass pipeline)

The main normalisation script that runs three sequential passes on every .docx under data/:

Pass 1 — Remove punctuation-only footnotes

Removes footnotes that contain only Tibetan punctuation marks (། ༄ ༅ ་ ༌), edition labels (༼…༽), Tibetan digits (༠–༩), and whitespace — i.e. no real textual content.

Pass 2 — Fix incomplete-reference footnotes

Detects footnotes that start with a bare ། before the edition label (e.g. ། ༼སྣར། པེ།༽ ཞིག །), indicating the reference syllable from the main text is missing. Uses botok-rs tokenizer to extract the last syllable from the main text preceding the footnote marker and prepends it to the footnote.

Tibetan orthographic rules are applied:

ང final: the tsheg ་ is kept before the shad (e.g. ཡོང་། → extracts ཡོང་)
Other finals: the tsheg is omitted before the shad (e.g. ཡོད་པར། → extracts པར)

Pass 3 — Normalise archaic-word footnotes

Uses the archaic_words.yml dictionary to identify footnotes whose reference spelling is an archaic Tibetan word. For each match it:

Swaps the archaic reference and the modern variant in the footnote text
Complements the edition label (e.g. ༼སྣར། པེ།༽ → ༼སྡེ། ཅོ།༽) using the standard four-edition order: Derge (སྡེ།), Chone (ཅོ།), Narthang (སྣར།), Peking (པེ།)
Replaces the archaic spelling in the main document body with the modern variant

Log files produced:

Log file	Content
`deleted_punctuation_footnotes.log`	Punctuation-only footnotes that were removed
`fixed_incomplete_footnotes.log`	Incomplete references that were fixed (before → after), plus unfixable entries
`normalised_archaic_footnotes.log`	Archaic-word footnotes that were normalised (before → after)

Usage:

python normalise_notes.py

Set dry_run=True in the main() call to preview changes without modifying files.

3. `tsek_ony_note.py` — List shad-starting footnotes

A diagnostic/reporting script that scans every .docx under data/ and collects footnotes whose text begins with ། (Tibetan shad, U+0F0D). These are candidates for the incomplete-reference fix in normalise_notes.py.

Log file produced:

Log file	Content
`tsek_only_footnotes.log`	All footnotes starting with `།` (book, page, full text)

Usage:

python tsek_ony_note.py

Data Layout

data/
├── 01-Nagarjuna/
│   ├── Book No 15) ….docx
│   ├── Book No 16) ….docx
│   └── ...
├── 02-Aryadeva/
│   └── ...
├── ...
└── 17-Atisha/
    └── ...

Each subdirectory corresponds to a pandita (Indian Buddhist scholar). The .docx files are volumes of their collected works with Tibetan textual-critical footnotes referencing variant readings across four canonical editions.

Supporting Files

File	Description
`archaic_words.yml`	YAML list of ~2,800 archaic Tibetan word forms used by Pass 3 of `normalise_notes.py`

Requirements

Python 3.10+
lxml — XML parsing of .docx internals
botok-rs — Tibetan word tokenizer (used for syllable extraction in Pass 2)

Install dependencies:

pip install -r requirement.txt

Recommended Run Order

clean_empty_footnotes.py — Remove empty/invalid footnotes and fix tsheg-bstar first
normalise_notes.py — Run the 3-pass normalisation pipeline on cleaned files
tsek_ony_note.py — (Optional) Audit remaining shad-starting footnotes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nalanda-note-normalizer

Scripts

1. `clean_empty_footnotes.py` — Clean empty & invalid footnotes

2. `normalise_notes.py` — Normalise footnotes (3-pass pipeline)

Pass 1 — Remove punctuation-only footnotes

Pass 2 — Fix incomplete-reference footnotes

Pass 3 — Normalise archaic-word footnotes

3. `tsek_ony_note.py` — List shad-starting footnotes

Data Layout

Supporting Files

Requirements

Recommended Run Order

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.env		.env
.vscode		.vscode
data		data
ori_docx		ori_docx
README.md		README.md
archaic_words.yml		archaic_words.yml
clean_empty_footnotes.py		clean_empty_footnotes.py
deleted_footnotes.log		deleted_footnotes.log
deleted_invalid_footnotes.log		deleted_invalid_footnotes.log
deleted_punctuation_footnotes.log		deleted_punctuation_footnotes.log
fixed_incomplete_footnotes.log		fixed_incomplete_footnotes.log
normalise_notes.py		normalise_notes.py
normalised_archaic_footnotes.log		normalised_archaic_footnotes.log
requirement.txt		requirement.txt
tsek_only_footnotes.log		tsek_only_footnotes.log
tsek_ony_note.py		tsek_ony_note.py

Folders and files

Latest commit

History

Repository files navigation

nalanda-note-normalizer

Scripts

1. clean_empty_footnotes.py — Clean empty & invalid footnotes

2. normalise_notes.py — Normalise footnotes (3-pass pipeline)

Pass 1 — Remove punctuation-only footnotes

Pass 2 — Fix incomplete-reference footnotes

Pass 3 — Normalise archaic-word footnotes

3. tsek_ony_note.py — List shad-starting footnotes

Data Layout

Supporting Files

Requirements

Recommended Run Order

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `clean_empty_footnotes.py` — Clean empty & invalid footnotes

2. `normalise_notes.py` — Normalise footnotes (3-pass pipeline)

3. `tsek_ony_note.py` — List shad-starting footnotes

Packages