Skip to content

Short utterances in Latin-script languages transcribed as Cyrillic [Parakeet TDT v3] #512

@tajchert

Description

@tajchert

Summary

Parakeet TDT v3 transcribes short Polish (and likely other Latin-script Slavic) utterances in Cyrillic. For example, "Wpisz google kropka com" becomes "Впиш гугл к ком.". Longer sentences transcribe correctly. English is unaffected.

Root Cause

The v3 JointDecision CoreML model performs argmax internally and outputs a single token_id (Int32). The shared 8,192-token vocabulary contains 1,161 Cyrillic tokens (~14%) alongside 6,987 Latin tokens. For short audio, the model doesn't accumulate enough language context in the decoder LSTM, and phonetically similar Slavic languages cause it to pick Cyrillic tokens over Latin ones.

Because argmax happens inside the CoreML model, there is no point in the Swift pipeline where token probabilities are available for filtering or masking by language.

AsrManager.transcribe() ignores language

The public API:

public func transcribe(_ samples: [Float], source: AudioSource = .file) async throws -> ASRResult

has no language parameter. Callers (like TypeWhisper's Parakeet plugin) know the user's language but cannot pass it through.

Reproduction

  1. Use AsrModels.downloadAndLoad(version: .v3) with AsrManager(config: .default)
  2. Record a short (1-3 second) Polish phrase, e.g., "Wpisz google kropka com"
  3. Transcribe — output will often be Cyrillic instead of Latin Polish
  4. Try a longer Polish sentence (10+ seconds) — output will be correct Latin Polish

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions