Skip to content

fix: use greedy longest-match for SentencePiece (matches llama.cpp)#6

Merged
dndungu merged 1 commit intomainfrom
fix/viterbi-length-bias
Mar 26, 2026
Merged

fix: use greedy longest-match for SentencePiece (matches llama.cpp)#6
dndungu merged 1 commit intomainfrom
fix/viterbi-length-bias

Conversation

@dndungu
Copy link
Copy Markdown
Contributor

@dndungu dndungu commented Mar 26, 2026

Replace Viterbi DP with greedy longest match. Viterbi picked sub-word splits over whole words because sum(sub-scores) > single-token-score. Greedy prefers longest match, matching llama.cpp behavior.

Replace Viterbi DP with greedy leftmost-longest match in
sentencePieceEncode. This matches llama.cpp's llm_tokenizer_spm
behavior and fixes token splitting where sub-token score sums
beat whole-token scores (e.g., "▁What" splitting into "Wh"+"at").
@dndungu dndungu merged commit c6ceb7c into main Mar 26, 2026
@dndungu dndungu deleted the fix/viterbi-length-bias branch March 26, 2026 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant