fix: implement Viterbi SentencePiece encoding (replaces greedy) by dndungu · Pull Request #3 · zerfoo/ztoken

dndungu · 2026-03-26T16:33:41Z

Viterbi DP for globally optimal tokenization. Byte fallback via <0xNN>. Fixes Mistral garbage output. 6 new tests.

The greedy longest-match approach in sentencePieceEncode produced suboptimal tokenization for SentencePiece unigram models (e.g., Mistral 7B). Replace it with Viterbi dynamic programming that finds the segmentation maximizing the sum of log-probability scores. Also adds: - Byte fallback encoding/decoding via <0xNN> tokens for chars not in vocab - decodeSentencePieceBytes for proper round-trip of byte fallback tokens - Tests: Viterbi vs greedy, byte fallback, sentence round-trip, edge cases

dndungu merged commit 8f43e44 into main Mar 26, 2026
1 check passed

dndungu deleted the fix/viterbi-sentencepiece branch March 26, 2026 16:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: implement Viterbi SentencePiece encoding (replaces greedy)#3

fix: implement Viterbi SentencePiece encoding (replaces greedy)#3
dndungu merged 1 commit intomainfrom
fix/viterbi-sentencepiece

dndungu commented Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dndungu commented Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant