feat: add SentencePiece unigram encoding for models without merges#2
Merged
feat: add SentencePiece unigram encoding for models without merges#2
Conversation
- WordPieceTokenizer implementing Tokenizer interface with greedy longest-prefix subword splitting and ## continuation tokens - EncodeForBERT method producing input_ids, attention_mask, and token_type_ids for single sentences and sentence pairs with padding - Pre-tokenization splitting on whitespace and punctuation boundaries - Load function in loader.go dispatching to BPE or WordPiece based on model.type in tokenizer.json - extractSpecialTokens recognizes BERT-style [CLS]/[SEP]/[PAD]/[UNK] - Comprehensive tests: encode, decode, round-trip, BERT format, padding, sentence pairs, pre-tokenization, loader integration
SentencePiece unigram models (e.g., Mistral 7B GGUF) provide vocabulary scores but no BPE merge table. Without this, encoding fails silently, producing wrong token IDs and garbage output. Add SetScores() to BPETokenizer and a greedy longest-match encoder that selects tokens by length first, then by score. When merges are empty but scores are present, encodeSegment automatically uses this path instead of BPE merging. Also extend the gguf.Metadata interface with GetFloat32Array and extract tokenizer.ggml.scores in ExtractTokenizer so GGUF-loaded tokenizers automatically use unigram encoding when appropriate.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Greedy longest-match encoder using token scores. Fixes Mistral/Llama SentencePiece tokenizer (no BPE merges). 7 tests.