Tokenizers using Byte Pair Encoding Algorithim

Project Title: Hindi BPE Tokenizer

Description

This project implements a custom Byte Pair Encoding (BPE) tokenizer specifically tailored for the Hindi language. It features a robust pre-tokenization strategy to handle Devanagari script nuances and provides an interactive web interface for visualizing how text is split into tokens.

Key Features

Hindi-Specific Pre-tokenization: Uses specialized regex patterns (bpe.py) to correctly group Devanagari characters, combining marks, and numerals before applying BPE.
Custom BPE Implementation: A pure Python implementation of the BPE algorithm, including training (merges) and inference (encoding/decoding).
Interactive Visualization: A Gradio-based web app (app.py) that color-codes tokens to visually demonstrate the tokenizer's performance on arbitrary input text.
Model Persistence: The tokenizer can be saved to and loaded from a pickle file (hindi_tokenizer.pkl), preserving the vocabulary and merge rules.

File Structure

app.py: The entry point for the web application. It sets up the Gradio interface, handles user input, and renders the color-coded token output.
bpe.py: The core logic file containing the HindiBPETokenizer class. It handles:
- Loading the pre-trained model.
- Pre-tokenization using regex.
- Applying BPE merges to encode text.
- Decoding token IDs back to strings.
hindi_tokenizer.pkl: The serialized file containing the trained tokenizer's vocabulary (id2sym, sym2id) and merge rules.

How to Run

Ensure you have the necessary dependencies installed (e.g., gradio, regex).
Run the application:
```
python app.py
```
Open the provided local URL in your browser to interact with the tokenizer.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
app.py		app.py
bpe.py		bpe.py
hindi_tokenizer.pkl		hindi_tokenizer.pkl
merges.txt		merges.txt
pyproject.toml		pyproject.toml
tokenizer-hindi.ipynb		tokenizer-hindi.ipynb
tokenizer.png		tokenizer.png
uv.lock		uv.lock
vocab.json		vocab.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tokenizers using Byte Pair Encoding Algorithim

Project Title: Hindi BPE Tokenizer

Description

Key Features

File Structure

How to Run

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Tokenizers using Byte Pair Encoding Algorithim

Project Title: Hindi BPE Tokenizer

Description

Key Features

File Structure

How to Run

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages