Skip to content

feat: add SentencePiece unigram encoding for models without merges#2

Merged
dndungu merged 2 commits intomainfrom
feat/sentencepiece-unigram
Mar 26, 2026
Merged

feat: add SentencePiece unigram encoding for models without merges#2
dndungu merged 2 commits intomainfrom
feat/sentencepiece-unigram

Conversation

@dndungu
Copy link
Copy Markdown
Contributor

@dndungu dndungu commented Mar 26, 2026

Greedy longest-match encoder using token scores. Fixes Mistral/Llama SentencePiece tokenizer (no BPE merges). 7 tests.

dndungu added 2 commits March 20, 2026 20:06
- WordPieceTokenizer implementing Tokenizer interface with greedy
  longest-prefix subword splitting and ## continuation tokens
- EncodeForBERT method producing input_ids, attention_mask, and
  token_type_ids for single sentences and sentence pairs with padding
- Pre-tokenization splitting on whitespace and punctuation boundaries
- Load function in loader.go dispatching to BPE or WordPiece based
  on model.type in tokenizer.json
- extractSpecialTokens recognizes BERT-style [CLS]/[SEP]/[PAD]/[UNK]
- Comprehensive tests: encode, decode, round-trip, BERT format,
  padding, sentence pairs, pre-tokenization, loader integration
SentencePiece unigram models (e.g., Mistral 7B GGUF) provide vocabulary
scores but no BPE merge table. Without this, encoding fails silently,
producing wrong token IDs and garbage output.

Add SetScores() to BPETokenizer and a greedy longest-match encoder that
selects tokens by length first, then by score. When merges are empty but
scores are present, encodeSegment automatically uses this path instead
of BPE merging.

Also extend the gguf.Metadata interface with GetFloat32Array and extract
tokenizer.ggml.scores in ExtractTokenizer so GGUF-loaded tokenizers
automatically use unigram encoding when appropriate.
@dndungu dndungu merged commit 59c06d0 into main Mar 26, 2026
1 check passed
@dndungu dndungu deleted the feat/sentencepiece-unigram branch March 26, 2026 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant