Skip to content

fix: enable addLeadingSpace for SentencePiece unigram models#5

Merged
dndungu merged 1 commit intomainfrom
fix/viterbi-two-pass
Mar 26, 2026
Merged

fix: enable addLeadingSpace for SentencePiece unigram models#5
dndungu merged 1 commit intomainfrom
fix/viterbi-two-pass

Conversation

@dndungu
Copy link
Copy Markdown
Contributor

@dndungu dndungu commented Mar 26, 2026

SetSentencePiece(true) now sets addLeadingSpace=true. Fixes ▁ prefix not being prepended, causing word tokens to not match vocab.

SetSentencePiece(true) now sets addLeadingSpace=true as a persistent
field on BPETokenizer, matching llama.cpp / SentencePiece default
behavior. Previously addLeadingSpace was only a parameter passed
through the call chain — making it a field ensures the first word
always gets the ▁ prefix prepended, so tokens like ▁What are found
by the Viterbi DP instead of falling back to character-level tokens.

Also adds SetAddLeadingSpace() for GGUF models that override the
default via tokenizer.ggml.add_space_prefix metadata.
@dndungu dndungu merged commit 0d1b102 into main Mar 26, 2026
@dndungu dndungu deleted the fix/viterbi-two-pass branch March 26, 2026 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant