Summary
Parakeet TDT v3 transcribes short Polish (and likely other Latin-script Slavic) utterances in Cyrillic. For example, "Wpisz google kropka com" becomes "Впиш гугл к ком.". Longer sentences transcribe correctly. English is unaffected.
Root Cause
The v3 JointDecision CoreML model performs argmax internally and outputs a single token_id (Int32). The shared 8,192-token vocabulary contains 1,161 Cyrillic tokens (~14%) alongside 6,987 Latin tokens. For short audio, the model doesn't accumulate enough language context in the decoder LSTM, and phonetically similar Slavic languages cause it to pick Cyrillic tokens over Latin ones.
Because argmax happens inside the CoreML model, there is no point in the Swift pipeline where token probabilities are available for filtering or masking by language.
AsrManager.transcribe() ignores language
The public API:
public func transcribe(_ samples: [Float], source: AudioSource = .file) async throws -> ASRResult
has no language parameter. Callers (like TypeWhisper's Parakeet plugin) know the user's language but cannot pass it through.
Reproduction
- Use
AsrModels.downloadAndLoad(version: .v3) with AsrManager(config: .default)
- Record a short (1-3 second) Polish phrase, e.g., "Wpisz google kropka com"
- Transcribe — output will often be Cyrillic instead of Latin Polish
- Try a longer Polish sentence (10+ seconds) — output will be correct Latin Polish
Summary
Parakeet TDT v3 transcribes short Polish (and likely other Latin-script Slavic) utterances in Cyrillic. For example, "Wpisz google kropka com" becomes "Впиш гугл к ком.". Longer sentences transcribe correctly. English is unaffected.
Root Cause
The v3
JointDecisionCoreML model performs argmax internally and outputs a singletoken_id(Int32). The shared 8,192-token vocabulary contains 1,161 Cyrillic tokens (~14%) alongside 6,987 Latin tokens. For short audio, the model doesn't accumulate enough language context in the decoder LSTM, and phonetically similar Slavic languages cause it to pick Cyrillic tokens over Latin ones.Because argmax happens inside the CoreML model, there is no point in the Swift pipeline where token probabilities are available for filtering or masking by language.
AsrManager.transcribe()ignores languageThe public API:
has no
languageparameter. Callers (like TypeWhisper's Parakeet plugin) know the user's language but cannot pass it through.Reproduction
AsrModels.downloadAndLoad(version: .v3)withAsrManager(config: .default)