fix: use greedy longest-match for SentencePiece (matches llama.cpp) by dndungu · Pull Request #6 · zerfoo/ztoken

dndungu · 2026-03-26T18:32:00Z

Replace Viterbi DP with greedy longest match. Viterbi picked sub-word splits over whole words because sum(sub-scores) > single-token-score. Greedy prefers longest match, matching llama.cpp behavior.

Replace Viterbi DP with greedy leftmost-longest match in sentencePieceEncode. This matches llama.cpp's llm_tokenizer_spm behavior and fixes token splitting where sub-token score sums beat whole-token scores (e.g., "▁What" splitting into "Wh"+"at").

dndungu merged commit c6ceb7c into main Mar 26, 2026

dndungu deleted the fix/viterbi-length-bias branch March 26, 2026 18:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use greedy longest-match for SentencePiece (matches llama.cpp)#6

fix: use greedy longest-match for SentencePiece (matches llama.cpp)#6
dndungu merged 1 commit intomainfrom
fix/viterbi-length-bias

dndungu commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dndungu commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant