fix: make agentcore-worker-loop compatible with OTEL threading instrumentation#405
Open
aidandaly24 wants to merge 2 commits intomainfrom
Open
fix: make agentcore-worker-loop compatible with OTEL threading instrumentation#405aidandaly24 wants to merge 2 commits intomainfrom
aidandaly24 wants to merge 2 commits intomainfrom
Conversation
…mentation When opentelemetry-instrumentation-threading is active (e.g. via opentelemetry-instrument in Dockerfile CMD), Thread.run() is wrapped to propagate trace context from the parent thread. This leaks the parent's running event loop state into the worker thread, causing run_forever() to raise "RuntimeError: Cannot run the event loop while another loop is running". Three changes to _run_worker_loop / _ensure_worker_loop: 1. Clear leaked running-loop state with asyncio._set_running_loop(None) at the top of the worker thread — this is the critical fix. 2. Create the event loop inside the worker thread (not the parent) to follow the canonical asyncio pattern and eliminate cross-thread state leakage. 3. Use threading.Event + loop.call_soon(ready.set) so the parent waits until the loop is truly running before returning a reference, fixing a pre-existing race condition.
Contributor
✅ No Breaking Changes DetectedNo public API breaking changes found in this PR. |
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Fix a runtime crash when using
opentelemetry-instrumentin the Dockerfile CMD (the documented pattern for ACO observability) with async handlers. Theagentcore-worker-loopthread (added in v1.3.3, PR #273) is incompatible with OpenTelemetry'sopentelemetry-instrumentation-threading, which wrapsThread.run()to propagate trace context.Worked on:
bedrock-agentcore==1.2.1Breaks on:
bedrock-agentcore >= 1.3.3(tested through 1.6.0)Error
Motivation
Users following the documented ACO observability pattern (using
opentelemetry-instrumentas the Dockerfile CMD entrypoint, as configured by the starter toolkit whenobservability_enabled=true) hit this crash on every async handler invocation. The only workaround was removingopentelemetry-instrumentfrom the CMD, which loses all observability (no traces/spans in CloudWatch GenAI dashboard).Root Cause
OpenTelemetry's
opentelemetry-instrumentation-threadingwraps everyThread.run()to propagate trace context from parent to child threads. This wrapper leaks the parent thread's "running event loop" state (tracked viaasyncio._get_running_loop(), a per-thread C variable in CPython) into the child thread.Previously,
_ensure_worker_loopcreated the event loop in the main thread and_run_worker_loopcalledrun_forever()in the worker thread. When OTEL's wrapper propagated the main thread's running-loop state,run_forever()saw a non-None running loop and raised:Note:
OTEL_PYTHON_DISABLED_INSTRUMENTATIONS=threadingwas reported as a workaround that didn't work. We confirmed the SDK does not override this env var — the issue is likely that the var must be set beforeopentelemetry-instrumentbootstraps. Regardless, the SDK should be compatible without requiring users to disable instrumentation.Changes
Three changes to
_run_worker_loop/_ensure_worker_loopinsrc/bedrock_agentcore/runtime/app.py:asyncio._set_running_loop(None)at the top of_run_worker_loop— clears any running-loop state leaked from the parent thread by OTEL's threading instrumentation context propagation. This is the critical one-liner that prevents the crash.Event loop creation moved into the worker thread —
asyncio.new_event_loop()+asyncio.set_event_loop()now happen inside_run_worker_loopinstead of_ensure_worker_loop. This follows the canonical asyncio pattern (create loops in the thread that runs them) and eliminates any cross-thread state leakage.threading.Eventsynchronization withloop.call_soon(ready.set)— the parent thread now waits (with a 10s timeout) until the worker loop is actually running before returning the loop reference. Thecall_sooncallback fires afterrun_forever()sets_running = True, guaranteeingis_running()returnsTruebefore any caller uses the loop. This also fixes a pre-existing race condition where the parent could return a loop reference beforerun_forever()had started.Test
Added
test_worker_loop_compatible_with_otel_threading_instrumentationwhich simulates the exact OTEL condition by settingasyncio._set_running_loop(leak)in a child thread before calling_run_worker_loop, then verifies the loop starts and can execute coroutines.Testing
test_app.pypass (0 regressions)opentelemetry-instrumentCMD, realThreadingInstrumentoractive, confirmed async handler invocation, worker loop reuse, and/pingall work with OTEL traces flowingBy submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.