fix: prevent silent chat death and reduce streaming overhead#1484
Open
wgnrai wants to merge 2 commits intoagent0ai:developmentfrom
Open
fix: prevent silent chat death and reduce streaming overhead#1484wgnrai wants to merge 2 commits intoagent0ai:developmentfrom
wgnrai wants to merge 2 commits intoagent0ai:developmentfrom
Conversation
Three-factor compound failure causing chat sessions to silently stall: 1. Silent Exception Death (helpers/defer.py): DeferredTask._on_task_done() captures exceptions in Future but no code reads it. Chat sessions die with zero UI feedback. Fix: Log exceptions and notify AgentContext on task failure. 2. O(n²) Per-Chunk DirtyJson Parsing (agent.py): stream_callback runs extract_json_root_string() + json_parse_dirty() on EVERY streaming chunk with growing string length. Fix: Only parse when full.rstrip() ends with } or ]. 3. Indefinite History Compression Wait: organize_history_wait has await task.result() with no timeout. Fix: Wrap with asyncio.wait_for() 30s timeout. Amplifier: All chats share one EventLoopThread singleton, making concurrent sessions particularly affected.
- Add 10s timeout to _run_on_dispatcher_loop() to prevent indefinite blocking when main event loop is saturated - Add fire-and-forget _emit_fire_and_forget() for high-frequency streaming events to avoid queue buildup - Add 60s timeout to handler execute_inside() to prevent single stalled handler from blocking all WebSocket event processing Related: Issue agent0ai#1485
Author
|
Friendly bump — these patches are still relevant against v1.9 (just re-tested, applied cleanly). Noticed v1.9 shipped without them and the adjacent code in |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix: Silent Chat Death, Stream Overhead & Thread Leak Mitigation
Branch:
fix/silent-chat-death-and-stream-overheadRelated: Issue #1485
Fork:
wgnrai/agent-zeroOverview
Six patches addressing a compound failure in Agent Zero's WebSocket event pipeline that causes silent chat stalls, thread leaks, and degraded performance under sustained multi-session use.
Original Issue
Chats silently stop responding — no error, no timeout, no user feedback. The root cause is a three-factor compound failure (Fixes 1–3), compounded by three additional pipeline stalls discovered during fix validation (Fixes 4–6).
Patches
Fix 1: Silent Exception Swallowing in
helpers/defer.pySeverity: Critical — root cause of silent death
DeferredTask._on_task_done()silently catches and discards all exceptions from background tasks. When an agent task fails, the UI never receives any notification — the chat simply stops responding with no indication of failure.Change: Log the exception and notify the
AgentContextso the UI can display an error state.File:
helpers/defer.pyFix 2: O(n²) JSON Parsing on Every Stream Chunk in
agent.pySeverity: High — causes event loop saturation
_agent_output_callback()calls_try_parse_json()on every streaming chunk. Most chunks are incomplete JSON fragments. The parser attempts full deserialization each time, creating O(n²) behavior as the growing buffer is re-parsed on every token.Change: Gate the parsing attempt — only try to parse when the buffer ends with
}or], indicating a potentially complete JSON object.File:
agent.pyFix 3: Unbounded
task.result()Wait inorganize_history_wait.pySeverity: High — blocks handler worker indefinitely
The
organize_history_waitextension awaitstask.result()without any timeout. If the task hangs (which Fix 1 now surfaces), this blocks the singleWsHandlersworker thread, preventing all subsequent WebSocket event processing.Change: Wrap with
asyncio.wait_for(..., timeout=30.0).File:
extensions/built_in/organize_history_wait.pyFix 4: Dispatcher Loop Bridge Timeout in
helpers/ws_manager.pySeverity: High — causes WebSocket disconnects under load
_run_on_dispatcher_loop()creates aFuturebridge between the caller's event loop and the main uvicorn event loop. When the main loop is saturated, allemitcalls block indefinitely, causingping_timeoutdisconnects.Change: Add
asyncio.wait_for(..., timeout=10.0)around the future bridge.File:
helpers/ws_manager.pyFix 5: Fire-and-Forget Emit for Streaming Events in
helpers/ws_manager.pySeverity: Medium — prevents queue buildup during streaming
High-frequency streaming events (LLM token-by-token responses) all route through the blocking
_run_on_dispatcher_loopbridge. Under concurrent sessions, this causes queue buildup and dispatcher loop starvation.Change: New
_emit_fire_and_forget()method that bypasses the dispatcher bridge for streaming events where best-effort delivery is acceptable.File:
helpers/ws_manager.pyFix 6: Handler Execution Timeout in
helpers/ws_manager.pySeverity: High — single stall blocks all events
WsManageruses a singleDeferredTaskworker that serializes all incoming WebSocket event handlers. If any handler blocks, all subsequent events queue up, causing a complete event processing stall.Change: Wrap handler execution with
asyncio.wait_for(..., timeout=60.0).File:
helpers/ws_manager.pyThread Leak (Separate Issue, Related Findings)
During validation, we confirmed a pre-existing thread leak:
Thread Composition (at 333 threads, py-spy snapshot)
asyncio_Nworkers — unboundedThreadPoolExecutorfromrun_in_executor(None, ...)EventLoopThreadinstances fromdefer.pyThe 6 patches above do not fix the thread leak — they prevent the blocking stalls that mask it. Thread leak mitigation requires separate upstream changes (bounded executor, consolidated watchdog observers, DeferredTask lifecycle management).
Files Changed
helpers/defer.pyagent.pyextensions/built_in/organize_history_wait.pyhelpers/ws_manager.pyTesting
py-spy dumpand/proc/{pid}/taskcountingBreaking Changes
None. All changes are additive (timeouts, logging, new methods) with no API surface modifications.