fix: enable addLeadingSpace for SentencePiece unigram models by dndungu · Pull Request #5 · zerfoo/ztoken

dndungu · 2026-03-26T18:20:40Z

SetSentencePiece(true) now sets addLeadingSpace=true. Fixes ▁ prefix not being prepended, causing word tokens to not match vocab.

SetSentencePiece(true) now sets addLeadingSpace=true as a persistent field on BPETokenizer, matching llama.cpp / SentencePiece default behavior. Previously addLeadingSpace was only a parameter passed through the call chain — making it a field ensures the first word always gets the ▁ prefix prepended, so tokens like ▁What are found by the Viterbi DP instead of falling back to character-level tokens. Also adds SetAddLeadingSpace() for GGUF models that override the default via tokenizer.ggml.add_space_prefix metadata.

dndungu merged commit 0d1b102 into main Mar 26, 2026

dndungu deleted the fix/viterbi-two-pass branch March 26, 2026 18:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: enable addLeadingSpace for SentencePiece unigram models#5

fix: enable addLeadingSpace for SentencePiece unigram models#5
dndungu merged 1 commit intomainfrom
fix/viterbi-two-pass

dndungu commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dndungu commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant