Tutorials

A Story of Parsing, Annotating, and Serializing Tibetan Text

Let's follow a story of how we can process a Tibetan text through our pipeline. We'll use a simple example of a Tibetan verse with its translation.

Our Sample Data

Let's say we have this Tibetan text with its English translation:

བདེ་གཤེགས་སྤྱན་རས་གཟིགས་དབང་ཕྱུག་ལ་ཕྱག་འཚལ་ལོ། །
I pay homage to the Lord Avalokiteśvara.

དེ་ཡི་མཚན་ཉིད་རྣམས་ནི་མཐོང་བ་མེད། །
His characteristics cannot be seen.

དེ་ཡི་སྐུ་ནི་མཐོང་བ་མེད། །
His body cannot be seen.

དེ་ཡི་ཡི་གེ་ནི་མཐོང་བ་མེད། །
His letters cannot be seen.

Chapter 1: The Parser's Tale

Our parser's job is to break this text into meaningful segments. Let's create a parser that understands Tibetan verses:

from typing import List, Dict, Any
from openpecha.pecha import Pecha
from openpecha.pecha.annotations import AnnotationModel, AnnotationType

class TibetanVerseParser:
    def __init__(self):
        self.segments = []
        self.current_position = 0
    
    def parse(self, text: str) -> List[Dict[str, Any]]:
        """
        Parse Tibetan text into verses and their translations.
        """
        # Split by double newlines to separate verses
        verses = text.split('\n\n')
        
        for verse in verses:
            # Split into Tibetan and English
            lines = verse.strip().split('\n')
            if len(lines) >= 2:
                tibetan = lines[0].strip()
                english = lines[1].strip()
                
                # Create segment for Tibetan text
                tibetan_segment = {
                    'text': tibetan,
                    'start': self.current_position,
                    'end': self.current_position + len(tibetan),
                    'type': 'tibetan'
                }
                self.current_position += len(tibetan) + 1
                
                # Create segment for English translation
                english_segment = {
                    'text': english,
                    'start': self.current_position,
                    'end': self.current_position + len(english),
                    'type': 'translation'
                }
                self.current_position += len(english) + 2  # +2 for the newlines
                
                self.segments.extend([tibetan_segment, english_segment])
        
        return self.segments

# Let's try our parser
parser = TibetanVerseParser()
segments = parser.parse(our_tibetan_text)
print("Parsed segments:", segments)

Chapter 2: The Annotation Adventure

Now that we have our segments, let's add annotations to mark them as Tibetan verses and translations:

def create_verse_annotations(pecha: Pecha, segments: List[Dict[str, Any]]) -> List[AnnotationModel]:
    """
    Create annotations for Tibetan verses and their translations.
    """
    annotations = []
    
    for i, segment in enumerate(segments):
        # Create text selector
        text_selector = {
            "@type": "TextSelector",
            "resource": "base",
            "offset": {
                "@type": "Offset",
                "begin": {
                    "@type": "BeginAlignedCursor",
                    "value": segment['start']
                },
                "end": {
                    "@type": "BeginAlignedCursor",
                    "value": segment['end']
                }
            }
        }
        
        # Create annotation data
        annotation_data = {
            "@type": "AnnotationData",
            "@id": f"verse_{i}",
            "key": "verse_type",
            "value": {
                "@type": "String",
                "value": segment['type']
            }
        }
        
        # Create the annotation
        annotation = {
            "@type": "Annotation",
            "@id": f"ann_{i}",
            "target": text_selector,
            "data": [annotation_data]
        }
        
        annotations.append(annotation)
    
    return annotations

# Create annotations
annotations = create_verse_annotations(pecha, segments)
print("Created annotations:", annotations)

Chapter 3: The Serializer's Journey

Finally, let's create a serializer to package everything together:

class TibetanVerseSerializer:
    def __init__(self):
        self.annotation_store = {
            "@type": "AnnotationStore",
            "@id": "tibetan_verse_store",
            "resources": [
                {
                    "@type": "TextResource",
                    "@id": "base",
                    "@include": "verses.txt"
                }
            ],
            "annotationsets": [
                {
                    "@type": "AnnotationDataSet",
                    "@id": "verse_annotation",
                    "keys": [
                        {
                            "@type": "DataKey",
                            "@id": "verse_type"
                        }
                    ],
                    "data": []
                }
            ],
            "annotations": []
        }
    
    def serialize(self, pecha: Pecha, annotations: List[AnnotationModel]) -> Dict[str, Any]:
        """
        Serialize the pecha and its annotations.
        """
        # Add annotations to the store
        self.annotation_store["annotations"] = annotations
        
        # Add annotation data to the dataset
        for annotation in annotations:
            for data in annotation["data"]:
                self.annotation_store["annotationsets"][0]["data"].append(data)
        
        return self.annotation_store

# Let's serialize our data
serializer = TibetanVerseSerializer()
serialized_data = serializer.serialize(pecha, annotations)

# Save the serialized data
import json
with open('tibetan_verses.json', 'w', encoding='utf-8') as f:
    json.dump(serialized_data, f, ensure_ascii=False, indent=2)

The Final Output

After running our pipeline, we get a JSON file that looks like this:

{
  "@type": "AnnotationStore",
  "@id": "tibetan_verse_store",
  "resources": [
    {
      "@type": "TextResource",
      "@id": "base",
      "@include": "verses.txt"
    }
  ],
  "annotationsets": [
    {
      "@type": "AnnotationDataSet",
      "@id": "verse_annotation",
      "keys": [
        {
          "@type": "DataKey",
          "@id": "verse_type"
        }
      ],
      "data": [
        {
          "@type": "AnnotationData",
          "@id": "verse_0",
          "key": "verse_type",
          "value": {
            "@type": "String",
            "value": "tibetan"
          }
        },
        {
          "@type": "AnnotationData",
          "@id": "verse_1",
          "key": "verse_type",
          "value": {
            "@type": "String",
            "value": "translation"
          }
        }
        // ... more annotations ...
      ]
    }
  ],
  "annotations": [
    {
      "@type": "Annotation",
      "@id": "ann_0",
      "target": {
        "@type": "TextSelector",
        "resource": "base",
        "offset": {
          "@type": "Offset",
          "begin": {
            "@type": "BeginAlignedCursor",
            "value": 0
          },
          "end": {
            "@type": "BeginAlignedCursor",
            "value": 45
          }
        }
      },
      "data": [
        {
          "@type": "AnnotationData",
          "@id": "verse_0",
          "set": "verse_annotation"
        }
      ]
    }
    // ... more annotations ...
  ]
}

Epilogue: What We've Learned

In this story, we've seen how to:

Parse Tibetan text into meaningful segments
Add annotations to mark different types of content
Serialize everything into a structured format

The resulting JSON file can be used by other tools to:

Display the text with proper formatting
Extract specific types of content
Perform analysis on the text
Create translations or other derived works

Remember that this is just one way to process Tibetan text. You can extend this pipeline to handle more complex cases, such as:

Multiple translations
Commentary layers
Cross-references
Metadata about the text
And much more!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tutorials

A Story of Parsing, Annotating, and Serializing Tibetan Text

Our Sample Data

Chapter 1: The Parser's Tale

Chapter 2: The Annotation Adventure

Chapter 3: The Serializer's Journey

The Final Output

Epilogue: What We've Learned

FilesExpand file tree

tutorials.md

Latest commit

History

tutorials.md

File metadata and controls

Tutorials

A Story of Parsing, Annotating, and Serializing Tibetan Text

Our Sample Data

Chapter 1: The Parser's Tale

Chapter 2: The Annotation Adventure

Chapter 3: The Serializer's Journey

The Final Output

Epilogue: What We've Learned