feat(gladia, soniox): add translation support with input_language/input_text on SpeechData#5111
Open
MSameerAbbas wants to merge 4 commits intolivekit:mainfrom
Open
feat(gladia, soniox): add translation support with input_language/input_text on SpeechData#5111MSameerAbbas wants to merge 4 commits intolivekit:mainfrom
MSameerAbbas wants to merge 4 commits intolivekit:mainfrom
Conversation
Contributor
Author
|
Hey @tinalenguyen, I saw this was assigned to you - hope it's helpful! Would love your review. |
Rewrite the Soniox STT plugin to support all WebSocket API features and fix structural issues in the streaming implementation. New features: - Real-time translation (one-way and two-way) via TranslationConfig - Configurable max_endpoint_delay_ms (500-3000ms) - Typed Literal autocomplete for models, languages, and translation type - Flush sentinel mapped to FINALIZE_MSG for clean session shutdown Structural fixes: - Remove dead reconnect machinery (_reconnect_event was never set) - Eliminate unnecessary intermediate audio queue (2 tasks -> 1) - Pass ws as parameter to subtasks instead of mutable self._ws - Single connection lifecycle in _run(); base class handles retries - Proper error semantics (5xx -> APIConnectionError, 4xx -> APIStatusError) - Raise on unexpected WS closure instead of silent hang - Handle _FlushSentinel (was silently dropped) - Remove unreachable except clause Translation design: - alternatives[0] = original text (always present) - alternatives[1] = translated text (when translation is enabled) - Fully backward-compatible: all consumers read alternatives[0] - Dict-keyed accumulators with no special cases Refs: livekit#4943
… translation support Add input_language and input_text fields to the core SpeechData dataclass so STT plugins can expose the original spoken text alongside translations. Update both Soniox and Gladia plugins to populate these fields. - SpeechData.input_language: the detected language spoken by the user - SpeechData.input_text: the original transcription before translation - Soniox: use dict-keyed accumulators with _pick_primary selection - Gladia: extract original utterance from translation message data - Replaces the previous alternatives[1] approach with first-class fields
…dpoint Emit END_OF_SPEECH based on speaking state, not final text presence. Previously both were inside the same conditional, so if an error or finished message arrived while speaking but before final tokens accumulated, END_OF_SPEECH was skipped. This left downstream consumers in speaking state with no turn detection triggered. Only affects agents using turn_detection=stt (no VAD). Pre-existing bug also present on main and livekit#5148.
47983a7 to
27fc775
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds real-time translation support to the Soniox and Gladia STT plugins by introducing
input_languageandinput_textfields on the coreSpeechDatadataclass. Also includes a cleanup of the SonioxSpeechStreamto align with patterns used by other plugins like Deepgram.Closes #4943, closes #4402. Supersedes #5148.
Core type change
Two new optional fields on
SpeechDatainlivekit-agents/livekit/agents/stt/stt.py:input_language: LanguageCode | None-- the detected/input language spoken by the user. Populated by STT services that support translation, wherelanguageholds the target language andinput_languageholds the original spoken language.input_text: str | None-- the original transcription in the input language, when translation is active.Both default to
Noneso existing behavior is completely unchanged.Soniox: new features + SpeechStream cleanup
New features:
TranslationConfigdataclass with__post_init__validationmax_endpoint_delay_ms(500-3000ms) for tuning endpoint detection latencymodels.pywithLiteraltype aliases (SonioxRTModels,SonioxLanguages) for IDE autocomplete -- follows the same pattern as the Google STT pluginFINALIZE_MSGfor clean session shutdown. Previously not handled.SpeechStream cleanup (while adding translation, simplified the streaming implementation to match other plugins):
_run()that connects, runs tasks, and cleans up. The base class_main_task()already handles retry logic, so the plugin doesn't need its own retry loop.audio_queuebetween_prepare_audio_taskand_send_audio_taskwas consolidated into a single_send_taskthat reads_input_chdirectly.self._ws, similar to how the Deepgram plugin passes connection state.APIConnectionError(5xx) orAPIStatusError(4xx) so the base class can decide whether to retry. Unexpected WebSocket closure raises instead of silently returning.END_OF_SPEECHdecoupled fromFINAL_TRANSCRIPTinflush_endpoint-- previously both were gated on final text presence, so an error arriving mid-speech (after interim tokens but before finalization) would skipEND_OF_SPEECH, leaving downstream consumers stuck in speaking state. Pre-existing bug also present onmain. Only affectsturn_detection="stt"(no VAD).Translation design: Dict-keyed accumulators (
"original"/"translation") route tokens bytranslation_status. At output time,_pick_primaryselects the translation accumulator if it has content, otherwise falls back to original._build_speech_dataattachesinput_text/input_languagefrom the original accumulator when the primary is a translation. When translation is off, all tokens route to"original"and the identity check (primary is not original) skips the input fields -- one code path, no flags, no branching.Gladia: translation fields
Surgical addition of
input_languageandinput_textto the existing translation handler in_process_gladia_message. Extracts original utterance language and text from the translation message data and attaches them to theSpeechData. No structural changes.What was NOT changed
_TokenAccumulatorclass -- kept as-is, added amergeclassmethodSTTclass -- kept as-isSTTOptionsdefaults preserved (model, sample_rate, num_channels, etc.)ContextObject,ContextGeneralItem,ContextTranslationTerm) -- unchangedFiles changed
livekit-agents/livekit/agents/stt/stt.pyinput_language,input_texttoSpeechDatalivekit-plugins/.../soniox/stt.pylivekit-plugins/.../soniox/__init__.pyTranslationConfig,SonioxLanguages,SonioxRTModelslivekit-plugins/.../soniox/models.pyLiteraltype aliaseslivekit-plugins/.../gladia/stt.pyinput_language/input_textto translation handlerTest plan
input_textandinput_languageinput_textandinput_languageareNone, identical to previous behaviormax_endpoint_delay_ms-- verified API accepts the parameterTranslationConfigvalidation -- verified__post_init__catches missing required fieldsEND_OF_SPEECHlifecycle -- verifiedflush_endpointemitsEND_OF_SPEECHindependently of final text presenceRefs: #4943, #4402