Skip to content

fix: prevent silent chat death and reduce streaming overhead#1484

Open
wgnrai wants to merge 2 commits intoagent0ai:developmentfrom
wgnrai:fix/silent-chat-death-and-stream-overhead
Open

fix: prevent silent chat death and reduce streaming overhead#1484
wgnrai wants to merge 2 commits intoagent0ai:developmentfrom
wgnrai:fix/silent-chat-death-and-stream-overhead

Conversation

@wgnrai
Copy link
Copy Markdown

@wgnrai wgnrai commented Apr 10, 2026

Fix: Silent Chat Death, Stream Overhead & Thread Leak Mitigation

Branch: fix/silent-chat-death-and-stream-overhead
Related: Issue #1485
Fork: wgnrai/agent-zero


Overview

Six patches addressing a compound failure in Agent Zero's WebSocket event pipeline that causes silent chat stalls, thread leaks, and degraded performance under sustained multi-session use.

Original Issue

Chats silently stop responding — no error, no timeout, no user feedback. The root cause is a three-factor compound failure (Fixes 1–3), compounded by three additional pipeline stalls discovered during fix validation (Fixes 4–6).


Patches

Fix 1: Silent Exception Swallowing in helpers/defer.py

Severity: Critical — root cause of silent death

DeferredTask._on_task_done() silently catches and discards all exceptions from background tasks. When an agent task fails, the UI never receives any notification — the chat simply stops responding with no indication of failure.

Change: Log the exception and notify the AgentContext so the UI can display an error state.

def _on_task_done(self, task: asyncio.Task) -> None:
    try:
        task.result()  # Re-raises if task failed
    except Exception as e:
        logger.error(f"DeferredTask '{self.thread_name}' failed: {e}", exc_info=True)
        # Notify AgentContext so UI shows the error
        if self.context:
            self.context.set_error(str(e))

File: helpers/defer.py


Fix 2: O(n²) JSON Parsing on Every Stream Chunk in agent.py

Severity: High — causes event loop saturation

_agent_output_callback() calls _try_parse_json() on every streaming chunk. Most chunks are incomplete JSON fragments. The parser attempts full deserialization each time, creating O(n²) behavior as the growing buffer is re-parsed on every token.

Change: Gate the parsing attempt — only try to parse when the buffer ends with } or ], indicating a potentially complete JSON object.

def _agent_output_callback(self, chunk: str, ...):
    # ... append to buffer ...
    buffer_text = "".join(self._buffer)
    # Only attempt parse when buffer looks like complete JSON
    if buffer_text.rstrip().endswith(("}", "]")):
        self._try_parse_json(buffer_text)

File: agent.py


Fix 3: Unbounded task.result() Wait in organize_history_wait.py

Severity: High — blocks handler worker indefinitely

The organize_history_wait extension awaits task.result() without any timeout. If the task hangs (which Fix 1 now surfaces), this blocks the single WsHandlers worker thread, preventing all subsequent WebSocket event processing.

Change: Wrap with asyncio.wait_for(..., timeout=30.0).

try:
    result = await asyncio.wait_for(task.result(), timeout=30.0)
except asyncio.TimeoutError:
    logger.warning(f"Task {task_id} timed out after 30s in organize_history_wait")

File: extensions/built_in/organize_history_wait.py


Fix 4: Dispatcher Loop Bridge Timeout in helpers/ws_manager.py

Severity: High — causes WebSocket disconnects under load

_run_on_dispatcher_loop() creates a Future bridge between the caller's event loop and the main uvicorn event loop. When the main loop is saturated, all emit calls block indefinitely, causing ping_timeout disconnects.

Change: Add asyncio.wait_for(..., timeout=10.0) around the future bridge.

try:
    return await asyncio.wait_for(asyncio.wrap_future(future), timeout=10.0)
except asyncio.TimeoutError:
    future.cancel()
    raise RuntimeError("Dispatcher loop emit timed out after 10s")

File: helpers/ws_manager.py


Fix 5: Fire-and-Forget Emit for Streaming Events in helpers/ws_manager.py

Severity: Medium — prevents queue buildup during streaming

High-frequency streaming events (LLM token-by-token responses) all route through the blocking _run_on_dispatcher_loop bridge. Under concurrent sessions, this causes queue buildup and dispatcher loop starvation.

Change: New _emit_fire_and_forget() method that bypasses the dispatcher bridge for streaming events where best-effort delivery is acceptable.

async def _emit_fire_and_forget(self, namespace, sid, event_type, data):
    try:
        self.socketio.emit(event_type, data, to=sid, namespace=namespace)
    except Exception:
        pass  # Best effort for streaming

File: helpers/ws_manager.py


Fix 6: Handler Execution Timeout in helpers/ws_manager.py

Severity: High — single stall blocks all events

WsManager uses a single DeferredTask worker that serializes all incoming WebSocket event handlers. If any handler blocks, all subsequent events queue up, causing a complete event processing stall.

Change: Wrap handler execution with asyncio.wait_for(..., timeout=60.0).

try:
    value = await asyncio.wait_for(
        self._get_handler_worker().execute_inside(
            handler.process, event_type, payload, sid
        ),
        timeout=60.0,
    )
except asyncio.TimeoutError:
    return _HandlerExecution(
        handler, RuntimeError("Handler execution timed out after 60s"), duration_ms
    )

File: helpers/ws_manager.py


Thread Leak (Separate Issue, Related Findings)

During validation, we confirmed a pre-existing thread leak:

Metric Baseline After 60 min
Thread count 155 460
Growth rate ~3.4 threads/min
Time to system limit (~4096) ~20 hours

Thread Composition (at 333 threads, py-spy snapshot)

  • 33 asyncio_N workers — unbounded ThreadPoolExecutor from run_in_executor(None, ...)
  • 14 watchdog observer threads (7 inotify + 7 dispatch)
  • 5 EventLoopThread instances from defer.py
  • ~310 native C-level threads from engineio/uvicorn ASGI transport
  • Misc: tqdm monitors, bridge readers, MainThread

The 6 patches above do not fix the thread leak — they prevent the blocking stalls that mask it. Thread leak mitigation requires separate upstream changes (bounded executor, consolidated watchdog observers, DeferredTask lifecycle management).


Files Changed

File Patches
helpers/defer.py Fix 1
agent.py Fix 2
extensions/built_in/organize_history_wait.py Fix 3
helpers/ws_manager.py Fix 4, Fix 5, Fix 6

Testing

  • Deployed on wgnr.ai production instance for 48+ hours
  • Validated with 2+ concurrent chat sessions over 60+ minutes
  • Thread growth monitored via py-spy dump and /proc/{pid}/task counting
  • All patches applied together — no regressions observed

Breaking Changes

None. All changes are additive (timeouts, logging, new methods) with no API surface modifications.

Three-factor compound failure causing chat sessions to silently stall:

1. Silent Exception Death (helpers/defer.py):
   DeferredTask._on_task_done() captures exceptions in Future but
   no code reads it. Chat sessions die with zero UI feedback.
   Fix: Log exceptions and notify AgentContext on task failure.

2. O(n²) Per-Chunk DirtyJson Parsing (agent.py):
   stream_callback runs extract_json_root_string() + json_parse_dirty()
   on EVERY streaming chunk with growing string length.
   Fix: Only parse when full.rstrip() ends with } or ].

3. Indefinite History Compression Wait:
   organize_history_wait has await task.result() with no timeout.
   Fix: Wrap with asyncio.wait_for() 30s timeout.

Amplifier: All chats share one EventLoopThread singleton,
making concurrent sessions particularly affected.
- Add 10s timeout to _run_on_dispatcher_loop() to prevent indefinite
  blocking when main event loop is saturated
- Add fire-and-forget _emit_fire_and_forget() for high-frequency
  streaming events to avoid queue buildup
- Add 60s timeout to handler execute_inside() to prevent single stalled
  handler from blocking all WebSocket event processing

Related: Issue agent0ai#1485
@wgnrai
Copy link
Copy Markdown
Author

wgnrai commented Apr 13, 2026

Friendly bump — these patches are still relevant against v1.9 (just re-tested, applied cleanly). Noticed v1.9 shipped without them and the adjacent code in ws_manager, defer, and streaming handlers got touched by other commits. Happy to resolve any merge conflicts if needed. The stall/hang issues these address are still impacting production use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant