Skip to content

feat(gladia, soniox): add translation support with input_language/input_text on SpeechData#5111

Open
MSameerAbbas wants to merge 4 commits intolivekit:mainfrom
MSameerAbbas:feat/soniox-full-feature-support
Open

feat(gladia, soniox): add translation support with input_language/input_text on SpeechData#5111
MSameerAbbas wants to merge 4 commits intolivekit:mainfrom
MSameerAbbas:feat/soniox-full-feature-support

Conversation

@MSameerAbbas
Copy link
Contributor

@MSameerAbbas MSameerAbbas commented Mar 15, 2026

Summary

Adds real-time translation support to the Soniox and Gladia STT plugins by introducing input_language and input_text fields on the core SpeechData dataclass. Also includes a cleanup of the Soniox SpeechStream to align with patterns used by other plugins like Deepgram.

Closes #4943, closes #4402. Supersedes #5148.

Core type change

Two new optional fields on SpeechData in livekit-agents/livekit/agents/stt/stt.py:

  • input_language: LanguageCode | None -- the detected/input language spoken by the user. Populated by STT services that support translation, where language holds the target language and input_language holds the original spoken language.
  • input_text: str | None -- the original transcription in the input language, when translation is active.

Both default to None so existing behavior is completely unchanged.

Soniox: new features + SpeechStream cleanup

New features:

  • Real-time translation (one-way and two-way) via TranslationConfig dataclass with __post_init__ validation
  • Configurable max_endpoint_delay_ms (500-3000ms) for tuning endpoint detection latency
  • models.py with Literal type aliases (SonioxRTModels, SonioxLanguages) for IDE autocomplete -- follows the same pattern as the Google STT plugin
  • Flush sentinel mapped to Soniox's documented FINALIZE_MSG for clean session shutdown. Previously not handled.
  • End-of-stream signal sent after input channel closes for graceful server-side shutdown

SpeechStream cleanup (while adding translation, simplified the streaming implementation to match other plugins):

  • Consolidated into a single _run() that connects, runs tasks, and cleans up. The base class _main_task() already handles retry logic, so the plugin doesn't need its own retry loop.
  • Reduced task count (4 -> 3): the intermediate audio_queue between _prepare_audio_task and _send_audio_task was consolidated into a single _send_task that reads _input_ch directly.
  • Subtasks receive the WebSocket as a parameter rather than reading self._ws, similar to how the Deepgram plugin passes connection state.
  • Server errors now raise APIConnectionError (5xx) or APIStatusError (4xx) so the base class can decide whether to retry. Unexpected WebSocket closure raises instead of silently returning.
  • END_OF_SPEECH decoupled from FINAL_TRANSCRIPT in flush_endpoint -- previously both were gated on final text presence, so an error arriving mid-speech (after interim tokens but before finalization) would skip END_OF_SPEECH, leaving downstream consumers stuck in speaking state. Pre-existing bug also present on main. Only affects turn_detection="stt" (no VAD).

Translation design: Dict-keyed accumulators ("original" / "translation") route tokens by translation_status. At output time, _pick_primary selects the translation accumulator if it has content, otherwise falls back to original. _build_speech_data attaches input_text/input_language from the original accumulator when the primary is a translation. When translation is off, all tokens route to "original" and the identity check (primary is not original) skips the input fields -- one code path, no flags, no branching.

Gladia: translation fields

Surgical addition of input_language and input_text to the existing translation handler in _process_gladia_message. Extracts original utterance language and text from the translation message data and attaches them to the SpeechData. No structural changes.

What was NOT changed

  • _TokenAccumulator class -- kept as-is, added a merge classmethod
  • STT class -- kept as-is
  • All STTOptions defaults preserved (model, sample_rate, num_channels, etc.)
  • Context dataclasses (ContextObject, ContextGeneralItem, ContextTranslationTerm) -- unchanged
  • Gladia plugin structure -- no cleanup, only the translation field addition

Files changed

File Change
livekit-agents/livekit/agents/stt/stt.py Add input_language, input_text to SpeechData
livekit-plugins/.../soniox/stt.py SpeechStream rewrite + translation support
livekit-plugins/.../soniox/__init__.py Export TranslationConfig, SonioxLanguages, SonioxRTModels
livekit-plugins/.../soniox/models.py New file with Literal type aliases
livekit-plugins/.../gladia/stt.py Add input_language/input_text to translation handler

Test plan

  • Two-way translation (en/ur) -- verified both directions produce correct input_text and input_language
  • One-way translation (to ur) -- verified single target language translation
  • No translation (backward compat) -- verified input_text and input_language are None, identical to previous behavior
  • max_endpoint_delay_ms -- verified API accepts the parameter
  • TranslationConfig validation -- verified __post_init__ catches missing required fields
  • END_OF_SPEECH lifecycle -- verified flush_endpoint emits END_OF_SPEECH independently of final text presence
  • Ruff format and lint -- all checks passed
  • mypy strict -- 0 new errors (1 pre-existing across all STT plugins)
  • Unit test suite (294 passed, 2 skipped, 9 errors from missing LiveKit server -- pre-existing)

Refs: #4943, #4402

devin-ai-integration[bot]

This comment was marked as resolved.

@MSameerAbbas
Copy link
Contributor Author

Hey @tinalenguyen, I saw this was assigned to you - hope it's helpful! Would love your review.

@MSameerAbbas MSameerAbbas changed the title feat(soniox): add real-time translation support and rewrite SpeechStream feat(gladia, soniox): add translation support with input_language/input_text on SpeechData Mar 20, 2026
devin-ai-integration[bot]

This comment was marked as resolved.

Rewrite the Soniox STT plugin to support all WebSocket API features and
fix structural issues in the streaming implementation.

New features:
- Real-time translation (one-way and two-way) via TranslationConfig
- Configurable max_endpoint_delay_ms (500-3000ms)
- Typed Literal autocomplete for models, languages, and translation type
- Flush sentinel mapped to FINALIZE_MSG for clean session shutdown

Structural fixes:
- Remove dead reconnect machinery (_reconnect_event was never set)
- Eliminate unnecessary intermediate audio queue (2 tasks -> 1)
- Pass ws as parameter to subtasks instead of mutable self._ws
- Single connection lifecycle in _run(); base class handles retries
- Proper error semantics (5xx -> APIConnectionError, 4xx -> APIStatusError)
- Raise on unexpected WS closure instead of silent hang
- Handle _FlushSentinel (was silently dropped)
- Remove unreachable except clause

Translation design:
- alternatives[0] = original text (always present)
- alternatives[1] = translated text (when translation is enabled)
- Fully backward-compatible: all consumers read alternatives[0]
- Dict-keyed accumulators with no special cases

Refs: livekit#4943
… translation support

Add input_language and input_text fields to the core SpeechData dataclass
so STT plugins can expose the original spoken text alongside translations.
Update both Soniox and Gladia plugins to populate these fields.

- SpeechData.input_language: the detected language spoken by the user
- SpeechData.input_text: the original transcription before translation
- Soniox: use dict-keyed accumulators with _pick_primary selection
- Gladia: extract original utterance from translation message data
- Replaces the previous alternatives[1] approach with first-class fields
…dpoint

Emit END_OF_SPEECH based on speaking state, not final text presence.
Previously both were inside the same conditional, so if an error or
finished message arrived while speaking but before final tokens
accumulated, END_OF_SPEECH was skipped. This left downstream consumers
in speaking state with no turn detection triggered.

Only affects agents using turn_detection=stt (no VAD). Pre-existing
bug also present on main and livekit#5148.
@MSameerAbbas MSameerAbbas force-pushed the feat/soniox-full-feature-support branch from 47983a7 to 27fc775 Compare March 20, 2026 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Soniox Real-Time Translation Support to livekit-plugins-soniox Feature Request: Add Original Language Detection to Gladia STT Plugin

1 participant