Skip to content

fix: make agentcore-worker-loop compatible with OTEL threading instrumentation#405

Open
aidandaly24 wants to merge 2 commits intomainfrom
fix/otel-threading-worker-loop-compat
Open

fix: make agentcore-worker-loop compatible with OTEL threading instrumentation#405
aidandaly24 wants to merge 2 commits intomainfrom
fix/otel-threading-worker-loop-compat

Conversation

@aidandaly24
Copy link
Copy Markdown
Contributor

Description

Fix a runtime crash when using opentelemetry-instrument in the Dockerfile CMD (the documented pattern for ACO observability) with async handlers. The agentcore-worker-loop thread (added in v1.3.3, PR #273) is incompatible with OpenTelemetry's opentelemetry-instrumentation-threading, which wraps Thread.run() to propagate trace context.

Worked on: bedrock-agentcore==1.2.1
Breaks on: bedrock-agentcore >= 1.3.3 (tested through 1.6.0)

Error

Exception in thread agentcore-worker-loop:
  File "opentelemetry/instrumentation/threading/__init__.py", line 152, in __wrap_threading_run
  File "bedrock_agentcore/runtime/app.py", line 545, in _run_worker_loop
    self._worker_loop.run_forever()
RuntimeError: Cannot run the event loop while another loop is running

Motivation

Users following the documented ACO observability pattern (using opentelemetry-instrument as the Dockerfile CMD entrypoint, as configured by the starter toolkit when observability_enabled=true) hit this crash on every async handler invocation. The only workaround was removing opentelemetry-instrument from the CMD, which loses all observability (no traces/spans in CloudWatch GenAI dashboard).

Root Cause

OpenTelemetry's opentelemetry-instrumentation-threading wraps every Thread.run() to propagate trace context from parent to child threads. This wrapper leaks the parent thread's "running event loop" state (tracked via asyncio._get_running_loop(), a per-thread C variable in CPython) into the child thread.

Previously, _ensure_worker_loop created the event loop in the main thread and _run_worker_loop called run_forever() in the worker thread. When OTEL's wrapper propagated the main thread's running-loop state, run_forever() saw a non-None running loop and raised:

RuntimeError: Cannot run the event loop while another loop is running

Note: OTEL_PYTHON_DISABLED_INSTRUMENTATIONS=threading was reported as a workaround that didn't work. We confirmed the SDK does not override this env var — the issue is likely that the var must be set before opentelemetry-instrument bootstraps. Regardless, the SDK should be compatible without requiring users to disable instrumentation.

Changes

Three changes to _run_worker_loop / _ensure_worker_loop in src/bedrock_agentcore/runtime/app.py:

  1. asyncio._set_running_loop(None) at the top of _run_worker_loop — clears any running-loop state leaked from the parent thread by OTEL's threading instrumentation context propagation. This is the critical one-liner that prevents the crash.

  2. Event loop creation moved into the worker threadasyncio.new_event_loop() + asyncio.set_event_loop() now happen inside _run_worker_loop instead of _ensure_worker_loop. This follows the canonical asyncio pattern (create loops in the thread that runs them) and eliminates any cross-thread state leakage.

  3. threading.Event synchronization with loop.call_soon(ready.set) — the parent thread now waits (with a 10s timeout) until the worker loop is actually running before returning the loop reference. The call_soon callback fires after run_forever() sets _running = True, guaranteeing is_running() returns True before any caller uses the loop. This also fixes a pre-existing race condition where the parent could return a loop reference before run_forever() had started.

Test

Added test_worker_loop_compatible_with_otel_threading_instrumentation which simulates the exact OTEL condition by setting asyncio._set_running_loop(leak) in a child thread before calling _run_worker_loop, then verifies the loop starts and can execute coroutines.

Testing

  • All 116 existing tests in test_app.py pass (0 regressions)
  • New unit test reproducing the OTEL conflict passes
  • Manual integration test: built a container with opentelemetry-instrument CMD, real ThreadingInstrumentor active, confirmed async handler invocation, worker loop reuse, and /ping all work with OTEL traces flowing

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…mentation

When opentelemetry-instrumentation-threading is active (e.g. via
opentelemetry-instrument in Dockerfile CMD), Thread.run() is wrapped to
propagate trace context from the parent thread. This leaks the parent's
running event loop state into the worker thread, causing run_forever()
to raise "RuntimeError: Cannot run the event loop while another loop is
running".

Three changes to _run_worker_loop / _ensure_worker_loop:

1. Clear leaked running-loop state with asyncio._set_running_loop(None)
   at the top of the worker thread — this is the critical fix.

2. Create the event loop inside the worker thread (not the parent) to
   follow the canonical asyncio pattern and eliminate cross-thread state
   leakage.

3. Use threading.Event + loop.call_soon(ready.set) so the parent waits
   until the loop is truly running before returning a reference, fixing
   a pre-existing race condition.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 10, 2026

✅ No Breaking Changes Detected

No public API breaking changes found in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant