feat: add SentencePiece unigram encoding for models without merges by dndungu · Pull Request #2 · zerfoo/ztoken

dndungu · 2026-03-26T16:20:31Z

Greedy longest-match encoder using token scores. Fixes Mistral/Llama SentencePiece tokenizer (no BPE merges). 7 tests.

- WordPieceTokenizer implementing Tokenizer interface with greedy longest-prefix subword splitting and ## continuation tokens - EncodeForBERT method producing input_ids, attention_mask, and token_type_ids for single sentences and sentence pairs with padding - Pre-tokenization splitting on whitespace and punctuation boundaries - Load function in loader.go dispatching to BPE or WordPiece based on model.type in tokenizer.json - extractSpecialTokens recognizes BERT-style [CLS]/[SEP]/[PAD]/[UNK] - Comprehensive tests: encode, decode, round-trip, BERT format, padding, sentence pairs, pre-tokenization, loader integration

SentencePiece unigram models (e.g., Mistral 7B GGUF) provide vocabulary scores but no BPE merge table. Without this, encoding fails silently, producing wrong token IDs and garbage output. Add SetScores() to BPETokenizer and a greedy longest-match encoder that selects tokens by length first, then by score. When merges are empty but scores are present, encodeSegment automatically uses this path instead of BPE merging. Also extend the gguf.Metadata interface with GetFloat32Array and extract tokenizer.ggml.scores in ExtractTokenizer so GGUF-loaded tokenizers automatically use unigram encoding when appropriate.

dndungu added 2 commits March 20, 2026 20:06

dndungu merged commit 59c06d0 into main Mar 26, 2026
1 check passed

dndungu deleted the feat/sentencepiece-unigram branch March 26, 2026 16:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add SentencePiece unigram encoding for models without merges#2

feat: add SentencePiece unigram encoding for models without merges#2
dndungu merged 2 commits intomainfrom
feat/sentencepiece-unigram

dndungu commented Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dndungu commented Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant