Skip to content

Add retry semantics and error recovery for adaptive interruption detection#1149

Merged
toubatbrian merged 5 commits intochenghao/fix/barge-in-fallbackfrom
brian/unify-interruption-transport-error-handling
Mar 19, 2026
Merged

Add retry semantics and error recovery for adaptive interruption detection#1149
toubatbrian merged 5 commits intochenghao/fix/barge-in-fallbackfrom
brian/unify-interruption-transport-error-handling

Conversation

@toubatbrian
Copy link
Contributor

Summary

Ports Python's recoverable/unrecoverable retry semantics for adaptive interruption detection (agents#5142) to the JS SDK, and fixes a JS-specific bug where interruption detection becomes permanently dead after a transient transport error.

Changes

Transport error classification (http_transport.ts, ws_transport.ts)

  • Wrap transport errors as APIError subclasses (APIStatusError, APIConnectionError, APITimeoutError) instead of generic Error — enables the retry loop to classify errors via isAPIError() and .retryable
  • Disable ofetch's internal retry (retry: 0) so retry logic is centralized in the stream consumer
  • Propagate errors via controller.error() instead of swallowing them with logger.error
  • Remove internal retry loop from WS ensureConnection() (retry is now handled at the stream level)
    Retry loop in createInterruptionTask (audio_recognition.ts)
  • Restructure from a single-stream-per-session design to a while loop that creates a fresh InterruptionStreamBase on each retry attempt, matching Python's _main_task pattern
  • Classify caught errors: retryable APIError → emit recoverable error + retry with exponential backoff; non-retryable or retries exhausted → emit unrecoverable error + break (triggers VAD fallback)
  • Catch forwardTask rejections in finally to prevent unhandled rejections from breaking the retry loop
  • Re-inject agent-speech-started sentinel on retry — JS creates a new stream per retry (unlike Python where self._agent_speech_started persists on the instance), so the new stream's agentSpeechStarted closure variable must be re-initialized
    Type and logic alignment (interruption.ts, agent_activity.ts, report.ts)
  • Remove false from InterruptionOptions.mode union type to match Python's Literal["adaptive", "vad"]
  • Simplify resolveInterruptionDetector conditions accordingly
  • Fix allowInterruptions computation in session report

Test plan

  • Build passes (pnpm build --filter agents)
  • Tested transient error scenario: single HTTP 500 → retry → adaptive interruption resumes and successfully detects interruptions
  • Tested persistent error scenario: repeated HTTP 500s → retries exhaust → emits unrecoverable error → falls back to VAD → user can still interrupt via VAD
  • Verified onInterruptionError correctly routes recoverable errors (log only) vs unrecoverable errors (fallback to VAD)
  • Error handling is transport-agnostic: both HTTP and WS transports propagate APIError subclasses through controller.error(), consumed identically by the retry loop

@toubatbrian toubatbrian requested a review from lukasIO March 19, 2026 02:16
@changeset-bot
Copy link

changeset-bot bot commented Mar 19, 2026

⚠️ No Changeset found

Latest commit: 8ded378

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@toubatbrian toubatbrian requested review from a team and chenghao-mou March 19, 2026 02:16
* @defaultValue undefined
*/
mode: 'adaptive' | 'vad' | false | undefined;
mode: 'adaptive' | 'vad' | undefined;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove false to keep aligned with python impl

devin-ai-integration[bot]

This comment was marked as resolved.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f1141df03e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 1023 to 1025
if (numRetries > 0 && this.isAgentSpeaking) {
await stream.pushFrame(InterruptionStreamSentinel.agentSpeechStarted());
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Rehydrate overlap state when recreating interruption stream

On retry, the code only re-injects agent-speech-started into the new interruption stream. If the transport failure happens during an active overlap, the replacement stream never receives overlap-speech-started, so its internal overlapSpeechStarted flag stays false and incoming audio frames are not sent for adaptive inference until a brand-new overlap begins. This causes the current interruption to be missed after a transient error, which undermines the intended error-recovery behavior.

Useful? React with 👍 / 👎.

Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Copy link
Contributor

@lukasIO lukasIO left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apart from my other comment about catching abort before the async push, this looks good to me.

Generally not a fan of the nested while loop as it introduces a lot of logic flow through multiple break levels, but we can address this afterwards also.

Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 7 additional findings in Devin Review.

Open in Devin Review

Comment on lines +141 to +143
outputController?.error(
new APIConnectionError({ message: `WebSocket error: ${err.message}` }),
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Multiple controller.error() calls in ws_transport can crash due to WHATWG Streams assertion

The PR changes the socket.on('error') handler (line 141) and transform catch block (line 381) from logging errors to calling controller.error() / outputController.error(). These join the existing MSG_ERROR handler (ws_transport.ts:277) and timeout check (ws_transport.ts:368) that already call controller.error(). If any two of these fire in sequence (e.g., sendAudioData throws APIConnectionError because ws.readyState !== WebSocket.OPEN, the transform catch calls controller.error(), and then the pending socket.on('error') event fires and calls outputController.error()), the second call attempts to error an already-errored stream. In Node.js's WHATWG Streams implementation, ReadableStreamError asserts stream.state === 'readable', so the second call throws an AssertionError that is unhandled in the event handler, crashing the process.

The old code avoided this by only logging in socket.on('error') and transform catch, so only MSG_ERROR and the timeout check could call controller.error().

Prompt for agents
In agents/src/inference/interruption/ws_transport.ts, guard all calls to outputController.error() and controller.error() against the stream already being in an errored state. One approach: introduce a boolean flag (e.g. let streamErrored = false) scoped alongside outputController at line 123. Before each controller.error() call (lines 141-143, 277-280, 368-370, 381), check if streamErrored is false, and after a successful controller.error() call, set streamErrored = true. This prevents the second controller.error() call from triggering an assertion failure in Node.js's WHATWG Streams implementation. Affected locations:
- Line 141-143: socket.on('error') handler in setupMessageHandler
- Line 277-280: MSG_ERROR case in handleMessage
- Line 368-370: timeout check in transform
- Line 381: catch block in transform
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@toubatbrian toubatbrian merged commit 71af1c4 into chenghao/fix/barge-in-fallback Mar 19, 2026
2 checks passed
@toubatbrian toubatbrian deleted the brian/unify-interruption-transport-error-handling branch March 19, 2026 08:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants