Add async telemetry functions to prevent SynchronousOnlyOperation errors#877
Add async telemetry functions to prevent SynchronousOnlyOperation errors#877
Conversation
The MCP server runs inside Daphne's async event loop, but the telemetry functions were calling record_event() synchronously, which triggers Installation.objects.get() — a sync ORM call that raises SynchronousOnlyOperation in async contexts. Add async variants (arecord_event, arecord_mcp_tool_call, arecord_mcp_resource_read, arecord_mcp_request) that wrap the sync telemetry through sync_to_async, running the ORM lookup in a thread pool. Update all MCP server call sites to await the async versions. Closes the "Failed to get installation ID: You cannot call this from an async context" warning that fired on every MCP request. https://claude.ai/code/session_014BGMhoou6nVtKnDmu8NZVV
…y call The installation UUID is a singleton that never changes, but _get_installation_id() was hitting the database on every call. Cache the result after the first successful lookup so subsequent telemetry events (including high-frequency MCP requests) skip the ORM entirely. The cache is reset alongside the PostHog client in _reset_posthog_client() for test isolation. https://claude.ai/code/session_014BGMhoou6nVtKnDmu8NZVV
Code Review - PR #877: Add async telemetry functionsSummaryThis PR successfully addresses ✅ Strengths1. Correct Async Pattern
2. Performance OptimizationThe installation ID caching (
3. Complete Implementation
4. DocumentationClear docstrings explain:
🔍 Issues & Recommendations1. Missing Test Coverage for Async Functions (Medium Priority)The new async telemetry functions (
Recommendation: Add async test cases to import asyncio
from django.test import TestCase, override_settings
from config.telemetry import arecord_event, _reset_posthog_client
class AsyncTelemetryTestCase(TestCase):
def setUp(self):
_reset_posthog_client()
# PostHog mock setup...
def test_arecord_event_success(self):
"""Test async event recording delegates to sync version."""
with override_settings(MODE="DEV", TELEMETRY_ENABLED=True):
result = asyncio.run(arecord_event("test_event", {"test": "value"}))
self.assertTrue(result)
# Verify PostHog was called...
def test_arecord_event_caching_works(self):
"""Test installation ID cache works across async calls."""
with override_settings(MODE="DEV", TELEMETRY_ENABLED=True):
with patch('opencontractserver.users.models.Installation.objects.get') as mock_get:
mock_get.return_value = self.mock_installation
# First call should hit DB
asyncio.run(arecord_event("event1"))
self.assertEqual(mock_get.call_count, 1)
# Second call should use cache
asyncio.run(arecord_event("event2"))
self.assertEqual(mock_get.call_count, 1) # No additional DB call2. Thread Safety of Cache Not Documented (Low Priority)The PR description states "caching mechanism is thread-safe for the process lifetime", but Python's GIL provides thread-safety for the simple case here. However:
Recommendation: Add a comment in def _get_installation_id() -> str | None:
"""
Get the installation ID from the Installation model.
The result is cached after the first successful lookup because the
installation UUID is a singleton that never changes for the lifetime
of the process. This avoids a database hit on every telemetry call.
Note: Multiple concurrent first calls may race to populate the cache,
but this is harmless - the Installation singleton ensures the same
value is cached regardless of which call wins.
"""3. CHANGELOG.md Not Updated (Medium Priority)Per CLAUDE.md guidelines, the CHANGELOG should be updated for significant changes. This fix prevents runtime errors in production MCP servers. Recommendation: Add to CHANGELOG.md under ### Fixed
#### MCP Telemetry in Async Context
- **`SynchronousOnlyOperation` in MCP server** (`config/telemetry.py`, `opencontractserver/mcp/telemetry.py`, `opencontractserver/mcp/server.py`): Added async telemetry functions (`arecord_event`, `arecord_mcp_tool_call`, `arecord_mcp_resource_read`, `arecord_mcp_request`) that use `sync_to_async` to safely run Django ORM lookups in a thread pool. Prevents "You cannot call this from an async context" errors on every MCP request.
- **Installation ID caching** (`config/telemetry.py:91-112`): Added process-lifetime cache for installation UUID to eliminate redundant database queries on every telemetry call, particularly beneficial for high-frequency MCP requests.4. Minor: Type Ignore Could Be More Specific (Low Priority)In return _cached_installation_id # type: ignore[return-value]This is suppressing a type error because mypy can't narrow Recommendation: Use a more specific type ignore or cast: from typing import cast
return cast(str | None, _cached_installation_id)🔒 Security Review✅ No security concerns identified
⚡ Performance Impact✅ Positive impact
🧪 Test Coverage Assessment
📋 Pre-Merge Checklist
🎯 VerdictRecommended for merge with minor improvements The core implementation is solid and solves the immediate problem correctly. The missing test coverage and CHANGELOG update should be addressed before merge, but they're not blocking issues if time is constrained. The code follows project conventions and won't cause regressions. Priority: Address test coverage (#1) and CHANGELOG (#3) before merge. Items #2 and #4 can be deferred to follow-up PRs if needed. |
- Add AsyncTelemetryTestCase: tests arecord_event delegates correctly, respects telemetry-disabled setting, and handles missing properties. - Add InstallationIdCacheTestCase: verifies cache returns correct ID, avoids repeated DB hits, resets properly, and works across async calls. - Add thread-safety note to _get_installation_id cache docstring. - Replace type: ignore[return-value] with typing.cast for clarity. - Update CHANGELOG.md with MCP telemetry fix entry. https://claude.ai/code/session_014BGMhoou6nVtKnDmu8NZVV
Code Review for PR #877: Add async telemetry functionsSummaryThis PR successfully addresses the ✅ Strengths
🔍 Observations & Minor Suggestions1. Thread Safety of Cache (Info Only)File: The installation ID cache uses a global variable without explicit locking. While you correctly note that races are harmless (line 100-102), this is only true because:
This is acceptable for this use case, but future developers should be aware this pattern wouldn't work for non-singleton caching. Consider adding this to the docstring: # Note: This cache is safe without locking ONLY because:
# 1. Installation.objects.get() returns the same singleton value
# 2. CPython's GIL makes simple assignment atomic
# Do not use this pattern for caching non-singleton values.2. Error Handling in CacheFile: When
Suggestion: Consider caching the failure too: except Exception as e:
logger.warning(f"Failed to get installation ID: {e}")
_cached_installation_id = None # Cache the failure
return NoneThis prevents repeated DB hits when the Installation record genuinely doesn't exist. 3. DRY Violation in Async FunctionsFiles: The three async MCP telemetry functions ( Suggestion: Consider a helper function to reduce duplication: async def _arecord_mcp_event(
event_type: str,
properties: dict[str, Any],
success: bool = True,
error_type: str | None = None,
) -> bool:
"""Helper for async MCP telemetry with common context handling."""
context = _get_request_context()
properties["success"] = success
properties["transport"] = context.get("transport", "unknown")
if context.get("client_ip_hash"):
properties["client_ip_hash"] = context["client_ip_hash"]
if not success and error_type:
properties["error_type"] = error_type
return await arecord_event(event_type, properties)
async def arecord_mcp_tool_call(
tool_name: str,
success: bool = True,
error_type: str | None = None,
) -> bool:
return await _arecord_mcp_event(
"mcp_tool_call",
{"tool_name": tool_name},
success,
error_type,
)This would reduce code from ~85 lines to ~30-40 lines while maintaining readability. 4. Type Annotation PrecisionFile: _cached_installation_id: str | None | object = _UNSETThe type hint 🔒 Security Review✅ No security concerns identified
⚡ Performance Review✅ Excellent performance improvements
Measured Impact: Reduces DB queries from O(n) to O(1) where n = telemetry calls 🧪 Test Coverage Assessment✅ Comprehensive test coverage Strengths:
Coverage gaps (acceptable for this PR):
📝 Code Style & Conventions✅ Follows project standards
🎯 RecommendationsPriority: Low - Code is production-ready as-is. These are optimization suggestions:
✅ Approval StatusLGTM with minor suggestions This PR:
The suggestions above are optimizations, not blockers. The code is well-written and ready to merge. 📚 References
|
Two root causes: 1. AsyncTelemetryTestCase: Django TestCase wraps each test in an uncommitted transaction visible only to the current DB connection. sync_to_async runs record_event in a different thread with a different connection, so Installation.objects.get() fails. Fix: mock record_event directly since we're testing that arecord_event delegates correctly, not the full pipeline. 2. InstallationIdCacheTestCase: mock path "config.telemetry.Installation .objects.get" was wrong because Installation is imported locally inside _get_installation_id(), not as a module attribute. Fix: use patch.object(Installation.objects, "get", ...) which intercepts the call regardless of import path. For test_cache_works_across_async_calls, pre-populate the cache synchronously before async calls to avoid the cross-thread DB issue. https://claude.ai/code/session_014BGMhoou6nVtKnDmu8NZVV
Pull Request ReviewThis PR successfully addresses the SynchronousOnlyOperation error in MCP telemetry by introducing async variants of telemetry functions and implementing installation ID caching. The implementation is well-architected and follows Django async best practices. Strengths1. Correct Async Pattern
2. Performance OptimizationThe installation ID caching is excellent:
3. Test CoverageThe tests are comprehensive and well-documented:
4. Code Quality
Minor Observations1. Type Cast Usage (config/telemetry.py:107)Using typing.cast is correct here since the type checker cannot infer the narrowing from the sentinel check. The current implementation is clean and appropriate. 2. Cache Thread Safety (config/telemetry.py:100-102)The docstring correctly explains the race condition is harmless. Consider adding a note about why no lock is needed - all racing calls produce identical results, so the overhead would provide no benefit. 3. DRY in Async Functions (opencontractserver/mcp/telemetry.py:225-308)The three async functions have nearly identical implementations, but this is appropriate because:
Security and PerformanceSecurity
Performance
Test Coverage AssessmentThe tests cover:
Missing coverage (acceptable for this PR):
These edge cases are already covered by the sync version tests. Best Practices AlignmentChecking against CLAUDE.md:
Recommendations1. Pre-merge ChecklistRun these commands before merging:
2. Documentation Enhancement (Optional)Consider adding a note to docs/architecture/llms/README.md about telemetry in async agent tools, since this is a pattern other developers might need to follow. 3. Future Optimization (Post-merge)If telemetry becomes a performance bottleneck under high load, consider:
SummaryThis is high-quality work that solves a real production issue. The implementation is:
Verdict: APPROVE - Ready to merge after confirming tests pass. The async telemetry pattern introduced here should be considered the standard approach for any future ASGI/async contexts that need to interact with Django ORM. Review completed following CLAUDE.md guidelines and project conventions. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Summary
This PR adds async versions of telemetry recording functions to prevent
SynchronousOnlyOperationerrors when calling telemetry from ASGI handlers and other async contexts. It also implements caching for the installation ID lookup to avoid repeated database hits.Key Changes
Added
arecord_event()inconfig/telemetry.py: Async wrapper aroundrecord_event()usingsync_to_asyncto safely run Django ORM lookups in a thread pool instead of blocking the event loop.Implemented installation ID caching: The
_get_installation_id()function now caches its result after the first successful lookup, since the installation UUID is a singleton that never changes during the process lifetime. This eliminates redundant database queries on every telemetry call.Added async telemetry functions in
opencontractserver/mcp/telemetry.py:arecord_mcp_tool_call()- async version for tool call telemetryarecord_mcp_resource_read()- async version for resource read telemetryarecord_mcp_request()- async version for request telemetryUpdated
opencontractserver/mcp/server.py: Replaced all synchronous telemetry calls with their async counterparts throughout the ASGI handlers, ensuring proper async/await patterns and preventing event loop blocking.Updated tests in
opencontractserver/mcp/tests/test_mcp.py: Updated mock patches to reference the new async telemetry functions.Implementation Details
_UNSETsentinel object is used to distinguish between "not yet cached" and "cached as None" states for the installation ID._reset_posthog_client()function now also resets the cached installation ID, ensuring clean state during testing.https://claude.ai/code/session_014BGMhoou6nVtKnDmu8NZVV