Skip to content

bug(mcp): tools/list hangs 60s when client sends initialize + notifications/initialized + tools/list in rapid succession #98

@halindrome

Description

@halindrome

Summary

codebase-memory-mcp hangs for 60 seconds before responding to tools/list when an MCP client sends the three standard initialization messages without artificial delays between them. This manifests as a "connecting..." state in Claude Code that resolves only after STORE_IDLE_TIMEOUT_S (60s) elapses.

Root Cause

The MCP event loop in cbm_mcp_server_run (src/mcp/mcp.c) mixes poll() on the raw file descriptor with getline() on a buffered FILE*. These two abstractions operate at different layers of the I/O stack, and the combination creates a correctness hazard:

  1. The client sends three messages back-to-back with no delay between them (all arrive in the kernel receive buffer simultaneously)
  2. poll() fires — data is available
  3. getline() reads initialize and over-reads — libc's FILE* buffer drains the entire kernel buffer, pulling all three messages into userspace
  4. cbm_mcp_server_handle() processes initialize and returns a response
  5. getline() processes notifications/initialized (a notification with no id) — cbm_mcp_server_handle() returns NULL (correct per spec), no response written
  6. The loop calls poll() again for the next message — but the tools/list payload is already in libc's FILE* buffer, not the kernel fd
  7. poll() sees an empty kernel fd and blocks for 60 seconds
  8. tools/list never receives a response within any reasonable timeout

The bug was reliably triggered by Claude Code 2.1.80, which sends all three initialization messages as a rapid burst (no inter-message delay). Earlier client versions or clients that insert delays between messages may never observe the bug.

Reproduction:

import subprocess, json, time

binary = \"codebase-memory-mcp\"
proc = subprocess.Popen([binary], stdin=subprocess.PIPE, stdout=subprocess.PIPE, text=True)

msgs = [
    {\"method\":\"initialize\",\"params\":{\"protocolVersion\":\"2025-11-25\",\"capabilities\":{},\"clientInfo\":{\"name\":\"test\",\"version\":\"1.0\"}},\"jsonrpc\":\"2.0\",\"id\":0},
    {\"method\":\"notifications/initialized\",\"jsonrpc\":\"2.0\"},
    {\"method\":\"tools/list\",\"jsonrpc\":\"2.0\",\"id\":1},
]

# Send all three with NO delay — triggers the hang
for m in msgs:
    proc.stdin.write(json.dumps(m) + \"\\n\")
proc.stdin.flush()

start = time.time()
for _ in range(2):  # expect initialize response + tools/list response
    line = proc.stdout.readline()
    print(f\"{time.time()-start:.2f}s: {line[:80]}\")
proc.terminate()

Expected: both responses arrive within ~1 second.
Observed (before fix): initialize response arrives immediately; tools/list response arrives after ~60 seconds.

The comment at the original poll() call site stated "MCP is request-response (one line at a time), so mixing poll() on the raw fd with getline() on the buffered FILE is safe in practice."* This assumption does not hold when multiple messages arrive in a single kernel receive event.

Trigger Context: Claude Code 2.1.80

Claude Code 2.1.80 changed its MCP client startup to send the three initialization messages (initialize, notifications/initialized, tools/list) in rapid succession as part of a single write burst. This is legal behavior under the MCP specification — the protocol does not require delays between messages. The server bug was latent before this client change; 2.1.80 made it reliably reproducible.

The three messages CC 2.1.80 sends on startup (captured via spy):

{"method":"initialize","params":{"protocolVersion":"2025-11-25","capabilities":{"roots":{},"elicitation":{"form":{},"url":{}}},"clientInfo":{"name":"claude-code","version":"2.1.80"}},"jsonrpc":"2.0","id":0}
{"method":"notifications/initialized","jsonrpc":"2.0"}
{"method":"tools/list","jsonrpc":"2.0","id":1}

Fix

Replace the single blocking poll() call with a three-phase approach that correctly handles data already buffered in the FILE* layer:

Phase 1: Non-blocking poll(timeout=0) — fast path, catches data already in the kernel fd.

Phase 2: If Phase 1 returns 0 (no kernel data), peek one byte from the FILE* buffer using fgetc(in) + ungetc(). This detects data that a prior getline() over-read pulled into libc's buffer. If data is found, skip the blocking poll and fall through to getline().

Phase 3: Only if both Phase 1 and Phase 2 confirm no data — call blocking poll(STORE_IDLE_TIMEOUT_S * 1000) for idle eviction.

This approach is fully POSIX-portable and does not require making the fd non-blocking (which would complicate getline() error handling for EAGAIN), nor does it rely on GNU-only extensions like __fpending().

The inaccurate comment at the original call site is also corrected to document the actual hazard.

Test Coverage

  • C unit test (tests/test_mcp.c): mcp_server_run_rapid_messages — uses pipe() + alarm(5) to verify all three init messages are processed without hanging
  • Python integration test (scripts/test_mcp_rapid_init.py): sends all three messages simultaneously via proc.communicate(), asserts tools/list response arrives within 5 seconds against the installed binary

Test results: 2043/2043 tests pass. Python integration test passes against built binary and installed binary.

Affected Versions

Triggered reliably by Claude Code ≥ 2.1.80. Latent in earlier versions where client insert inter-message delays.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingeditor/integrationEditor compatibility and CLI integration

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions