Skip to content

fix(cloud-agent-sdk): recover stale WebSocket connections on tab resume#1919

Merged
eshurakov merged 4 commits intomainfrom
eshurakov/spotted-hemisphere
Apr 2, 2026
Merged

fix(cloud-agent-sdk): recover stale WebSocket connections on tab resume#1919
eshurakov merged 4 commits intomainfrom
eshurakov/spotted-hemisphere

Conversation

@eshurakov
Copy link
Copy Markdown
Contributor

Summary

Fixes half-open WebSocket connections in cloud agent sessions. When a user backgrounds a tab and returns after the TCP socket has silently died, onclose never fires and the session appears frozen. This adds application-level staleness detection and recovery.

Connection layer (base-connection.ts):

  • Listens for visibilitychange, pageshow (BFCache), and online events
  • Sends an application-level ping on tab resume; if no message arrives within 5 seconds, closes the stale socket and reconnects
  • Proactively refreshes the stream ticket before reconnecting (refreshAndConnect), fixing expired JWT failures on reconnect
  • Resets reconnect attempt counter when the user returns to the tab
  • Cleans up all event listeners on destroy()

Transport layer (cloud-agent-transport.ts, cli-live-transport.ts):

  • onReconnected callback refetches the session snapshot and replays it into the sink, avoiding blank screens after reconnection

Session routing (session.ts, CloudAgentProvider.tsx):

  • pickTransportFactory now checks resolved.isLive for Cloud Agent sessions — completed sessions route to the read-only historical transport instead of opening a live WebSocket
  • resolveSession queries getWithRuntimeState to determine actual session liveness from execution status, instead of hardcoding isLive: true

Verification

  • npx jest src/lib/cloud-agent-sdk/ --no-coverage — 18 suites, 551 tests, all pass
  • pnpm typecheck — passes, no errors
  • Manual testing via Chrome DevTools MCP (6 tests, all pass):
    • Test 1: Baseline session connect + stream
    • Test 2: Short tab background (7s) — no reconnect, streaming uninterrupted
    • Test 3: Stale connection (ping timeout) — ticket refreshed, reconnect succeeds on first attempt, 0 auth failures
    • Test 6: Completed session — Historical transport selected, 0 [Connection] messages
    • Test 7: Rapid tab switching (10 cycles) — no duplicate connections, no errors
    • Test 9: Reconnect after session stops — ticket refreshed, reconnect succeeds, UI shows completed state

Visual Changes

N/A

Reviewer Notes

  • Duplicate [Connection] messages in dev: All connection log lines appear in pairs due to React StrictMode double-mounting effects. This is expected in development only and does not occur in production.
  • Historical transport rendering gap: When re-opening a completed Cloud Agent session, the historical transport correctly loads (no WebSocket), but the chat message content may not render. This is a pre-existing issue with the historical transport's snapshot replay for Cloud Agent sessions, not introduced by this PR.
  • No server-side changes needed: Cloud Agent DOs already ignore unknown message types (like "ping") gracefully.
  • Risk area: The isLive detection in CloudAgentProvider.tsx handles multiple edge cases (terminal execution, active execution, null execution with/without initiatedAt). The initiatedAt heuristic covers the case where the DO has cleaned up the execution record after completion.

Add application-level ping/pong staleness detection for WebSocket
connections that silently die when tabs are backgrounded. On tab
resume, sends a ping and reconnects if no response within 5s.

- base-connection: visibilitychange/pageshow/online handlers, ping
  timeout, proactive ticket refresh before reconnect
- cloud-agent-transport/cli-live-transport: snapshot refetch on
  reconnect via onReconnected callback
- session.ts: route completed sessions to historical transport
- CloudAgentProvider: determine isLive from DO execution status
@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot bot commented Apr 2, 2026

Code Review Summary

Status: 4 Issues Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 0
WARNING 4
SUGGESTION 0

Fix these issues in Kilo Cloud

Issue Details (click to expand)

WARNING

File Line Issue
src/components/cloud-agent-next/CloudAgentProvider.tsx 53 Completed Cloud Agent sessions now route to the live transport instead of historical replay
Other Observations (not in diff)

Issues found in unchanged code that cannot receive inline comments:

File Line Issue
src/lib/cloud-agent-sdk/base-connection.ts N/A Ping-based staleness checks never receive a response and force reconnects on quiet healthy sockets
src/lib/cloud-agent-sdk/base-connection.ts N/A disconnect() leaves visibility and online listeners attached, leaking stale connection instances
src/lib/cloud-agent-sdk/base-connection.ts 157 Replacement sockets inherit the prior staleness timestamp because lastMessageTime is not reset
Files Reviewed (8 files)
  • src/components/cloud-agent-next/CloudAgentProvider.tsx - 1 issue
  • src/lib/cloud-agent-sdk/session-manager.test.ts - 0 issues
  • src/lib/cloud-agent-sdk/session-manager.ts - 0 issues
  • src/lib/cloud-agent-sdk/session-phase.test.ts - 0 issues
  • src/lib/cloud-agent-sdk/session-routing.test.ts - 0 issues
  • src/lib/cloud-agent-sdk/session-transport.test.ts - 0 issues
  • src/lib/cloud-agent-sdk/session.ts - 0 issues
  • src/lib/cloud-agent-sdk/types.ts - 0 issues

Reviewed by gpt-5.4-20260305 · 595,751 tokens

…ction

Address review feedback on stale WebSocket recovery:

- Remove ws.send('ping') — server never responds; staleness detection
  now relies on server heartbeats canceling the timeout
- Make staleness timeout configurable (stalenessTimeoutMs) so the
  transport layer that knows the heartbeat interval controls the value
- Increase default from 5s to 30s to exceed server heartbeat intervals
- Track lastMessageTime to skip the check when a recent message proves
  the connection is alive
- Wire heartbeatTimeoutMs through cloud-agent-connection
- disconnect() now removes visibility/pageshow/online listeners,
  fixing a leak when transports disconnect without calling destroy()
…minated union

Replace the flat { cloudAgentSessionId, isLive } shape with a
discriminated union ('remote' | 'cloud-agent' | 'read-only') so
transport routing is explicit and type-safe. Simplify session
resolution in CloudAgentProvider by removing the runtime-state
liveness check. Add runtime exhaustive check in pickTransportFactory.
@eshurakov eshurakov merged commit a7cb3b7 into main Apr 2, 2026
15 checks passed
@eshurakov eshurakov deleted the eshurakov/spotted-hemisphere branch April 2, 2026 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants