Skip to content

Add async telemetry functions to prevent SynchronousOnlyOperation errors#877

Open
JSv4 wants to merge 5 commits intomainfrom
claude/fix-mcp-telemetry-rZTDn
Open

Add async telemetry functions to prevent SynchronousOnlyOperation errors#877
JSv4 wants to merge 5 commits intomainfrom
claude/fix-mcp-telemetry-rZTDn

Conversation

@JSv4
Copy link
Collaborator

@JSv4 JSv4 commented Feb 10, 2026

Summary

This PR adds async versions of telemetry recording functions to prevent SynchronousOnlyOperation errors when calling telemetry from ASGI handlers and other async contexts. It also implements caching for the installation ID lookup to avoid repeated database hits.

Key Changes

  • Added arecord_event() in config/telemetry.py: Async wrapper around record_event() using sync_to_async to safely run Django ORM lookups in a thread pool instead of blocking the event loop.

  • Implemented installation ID caching: The _get_installation_id() function now caches its result after the first successful lookup, since the installation UUID is a singleton that never changes during the process lifetime. This eliminates redundant database queries on every telemetry call.

  • Added async telemetry functions in opencontractserver/mcp/telemetry.py:

    • arecord_mcp_tool_call() - async version for tool call telemetry
    • arecord_mcp_resource_read() - async version for resource read telemetry
    • arecord_mcp_request() - async version for request telemetry
  • Updated opencontractserver/mcp/server.py: Replaced all synchronous telemetry calls with their async counterparts throughout the ASGI handlers, ensuring proper async/await patterns and preventing event loop blocking.

  • Updated tests in opencontractserver/mcp/tests/test_mcp.py: Updated mock patches to reference the new async telemetry functions.

Implementation Details

  • The _UNSET sentinel object is used to distinguish between "not yet cached" and "cached as None" states for the installation ID.
  • The _reset_posthog_client() function now also resets the cached installation ID, ensuring clean state during testing.
  • All async telemetry functions maintain the same property structure and behavior as their synchronous counterparts, just with proper async/await handling.
  • The caching mechanism is thread-safe for the process lifetime since the installation ID never changes.

https://claude.ai/code/session_014BGMhoou6nVtKnDmu8NZVV

The MCP server runs inside Daphne's async event loop, but the telemetry
functions were calling record_event() synchronously, which triggers
Installation.objects.get() — a sync ORM call that raises
SynchronousOnlyOperation in async contexts.

Add async variants (arecord_event, arecord_mcp_tool_call,
arecord_mcp_resource_read, arecord_mcp_request) that wrap the sync
telemetry through sync_to_async, running the ORM lookup in a thread pool.
Update all MCP server call sites to await the async versions.

Closes the "Failed to get installation ID: You cannot call this from an
async context" warning that fired on every MCP request.

https://claude.ai/code/session_014BGMhoou6nVtKnDmu8NZVV
…y call

The installation UUID is a singleton that never changes, but
_get_installation_id() was hitting the database on every call. Cache
the result after the first successful lookup so subsequent telemetry
events (including high-frequency MCP requests) skip the ORM entirely.

The cache is reset alongside the PostHog client in _reset_posthog_client()
for test isolation.

https://claude.ai/code/session_014BGMhoou6nVtKnDmu8NZVV
@claude
Copy link

claude bot commented Feb 10, 2026

Code Review - PR #877: Add async telemetry functions

Summary

This PR successfully addresses SynchronousOnlyOperation errors in the MCP server by adding async versions of telemetry functions. The implementation is solid with good caching optimization for the installation ID lookup.


✅ Strengths

1. Correct Async Pattern

  • Properly uses sync_to_async to wrap synchronous Django ORM calls
  • All MCP server handlers correctly await the async telemetry functions
  • Follows the async-first pattern documented in CLAUDE.md for agent tools

2. Performance Optimization

The installation ID caching (config/telemetry.py:91-112) is well-designed:

  • Uses sentinel pattern (_UNSET) to distinguish "not cached" from "cached as None"
  • Proper cache invalidation in _reset_posthog_client() for test isolation
  • Eliminates database hit on every telemetry call (good for high-frequency MCP requests)

3. Complete Implementation

  • All three async variants implemented: arecord_event, arecord_mcp_tool_call, arecord_mcp_resource_read, arecord_mcp_request
  • All call sites in opencontractserver/mcp/server.py updated
  • Test mocks updated to reference async functions
  • Consistent error handling maintained

4. Documentation

Clear docstrings explain:

  • When to use async vs sync versions
  • Why sync_to_async is necessary
  • Cache behavior and rationale

🔍 Issues & Recommendations

1. Missing Test Coverage for Async Functions (Medium Priority)

The new async telemetry functions (arecord_event, arecord_mcp_tool_call, etc.) have no dedicated tests. While test_mcp.py mocks them, there are no tests verifying:

  • That arecord_event correctly delegates to record_event via sync_to_async
  • That async functions work correctly in async contexts
  • That the installation ID caching works with async calls

Recommendation: Add async test cases to opencontractserver/tests/test_telemetry.py:

import asyncio
from django.test import TestCase, override_settings
from config.telemetry import arecord_event, _reset_posthog_client

class AsyncTelemetryTestCase(TestCase):
    def setUp(self):
        _reset_posthog_client()
        # PostHog mock setup...
    
    def test_arecord_event_success(self):
        """Test async event recording delegates to sync version."""
        with override_settings(MODE="DEV", TELEMETRY_ENABLED=True):
            result = asyncio.run(arecord_event("test_event", {"test": "value"}))
        
        self.assertTrue(result)
        # Verify PostHog was called...
    
    def test_arecord_event_caching_works(self):
        """Test installation ID cache works across async calls."""
        with override_settings(MODE="DEV", TELEMETRY_ENABLED=True):
            with patch('opencontractserver.users.models.Installation.objects.get') as mock_get:
                mock_get.return_value = self.mock_installation
                
                # First call should hit DB
                asyncio.run(arecord_event("event1"))
                self.assertEqual(mock_get.call_count, 1)
                
                # Second call should use cache
                asyncio.run(arecord_event("event2"))
                self.assertEqual(mock_get.call_count, 1)  # No additional DB call

2. Thread Safety of Cache Not Documented (Low Priority)

The PR description states "caching mechanism is thread-safe for the process lifetime", but Python's GIL provides thread-safety for the simple case here. However:

  • In an async context with multiple coroutines, race conditions are still possible during the initial cache population
  • This is unlikely to cause issues (worst case: duplicate DB queries), but should be noted

Recommendation: Add a comment in config/telemetry.py:

def _get_installation_id() -> str | None:
    """
    Get the installation ID from the Installation model.
    
    The result is cached after the first successful lookup because the
    installation UUID is a singleton that never changes for the lifetime
    of the process. This avoids a database hit on every telemetry call.
    
    Note: Multiple concurrent first calls may race to populate the cache,
    but this is harmless - the Installation singleton ensures the same
    value is cached regardless of which call wins.
    """

3. CHANGELOG.md Not Updated (Medium Priority)

Per CLAUDE.md guidelines, the CHANGELOG should be updated for significant changes. This fix prevents runtime errors in production MCP servers.

Recommendation: Add to CHANGELOG.md under ## [Unreleased]:

### Fixed

#### MCP Telemetry in Async Context
- **`SynchronousOnlyOperation` in MCP server** (`config/telemetry.py`, `opencontractserver/mcp/telemetry.py`, `opencontractserver/mcp/server.py`): Added async telemetry functions (`arecord_event`, `arecord_mcp_tool_call`, `arecord_mcp_resource_read`, `arecord_mcp_request`) that use `sync_to_async` to safely run Django ORM lookups in a thread pool. Prevents "You cannot call this from an async context" errors on every MCP request.
- **Installation ID caching** (`config/telemetry.py:91-112`): Added process-lifetime cache for installation UUID to eliminate redundant database queries on every telemetry call, particularly beneficial for high-frequency MCP requests.

4. Minor: Type Ignore Could Be More Specific (Low Priority)

In config/telemetry.py:102:

return _cached_installation_id  # type: ignore[return-value]

This is suppressing a type error because mypy can't narrow str | None | object to str | None after the sentinel check.

Recommendation: Use a more specific type ignore or cast:

from typing import cast
return cast(str | None, _cached_installation_id)

🔒 Security Review

No security concerns identified

  • Telemetry functions don't handle user input
  • Caching doesn't expose sensitive data
  • No changes to permission checks or data access patterns

⚡ Performance Impact

Positive impact

  • Eliminates database query on every telemetry call after the first
  • Prevents event loop blocking in async contexts
  • No measurable overhead from sync_to_async wrapper (negligible thread pool dispatch)

🧪 Test Coverage Assessment

Component Coverage Notes
record_event() ✅ Excellent Comprehensive tests in test_telemetry.py
arecord_event() ⚠️ Mocked only No direct tests, only mocked in MCP tests
Installation ID caching ⚠️ Untested Cache behavior not verified
Async MCP telemetry ⚠️ Mocked only Functions mocked in test_mcp.py but not tested directly

📋 Pre-Merge Checklist

  • Add async telemetry test cases to opencontractserver/tests/test_telemetry.py
  • Update CHANGELOG.md
  • (Optional) Add thread-safety note to cache comment
  • (Optional) Replace type ignore with cast

🎯 Verdict

Recommended for merge with minor improvements

The core implementation is solid and solves the immediate problem correctly. The missing test coverage and CHANGELOG update should be addressed before merge, but they're not blocking issues if time is constrained. The code follows project conventions and won't cause regressions.

Priority: Address test coverage (#1) and CHANGELOG (#3) before merge. Items #2 and #4 can be deferred to follow-up PRs if needed.

- Add AsyncTelemetryTestCase: tests arecord_event delegates correctly,
  respects telemetry-disabled setting, and handles missing properties.
- Add InstallationIdCacheTestCase: verifies cache returns correct ID,
  avoids repeated DB hits, resets properly, and works across async calls.
- Add thread-safety note to _get_installation_id cache docstring.
- Replace type: ignore[return-value] with typing.cast for clarity.
- Update CHANGELOG.md with MCP telemetry fix entry.

https://claude.ai/code/session_014BGMhoou6nVtKnDmu8NZVV
@claude
Copy link

claude bot commented Feb 11, 2026

Code Review for PR #877: Add async telemetry functions

Summary

This PR successfully addresses the SynchronousOnlyOperation errors occurring in MCP ASGI handlers by adding async telemetry functions and implementing installation ID caching. The implementation is well-structured and follows Django async patterns correctly.


✅ Strengths

  1. Correct Async Pattern: Using sync_to_async to wrap the sync record_event function is the right approach per Django's async best practices. This allows the ORM lookup in _get_installation_id to run in a thread pool.

  2. Smart Caching Strategy: The installation ID caching using a sentinel object (_UNSET) is elegant and properly distinguishes between "not cached" and "cached as None" states. The race condition note in the docstring (config/telemetry.py:100-102) is accurate and the harmless nature is correctly identified.

  3. Comprehensive Test Coverage: Excellent test coverage including:

    • Async function behavior verification
    • Cache effectiveness testing
    • Cross-async-call cache persistence
    • Cache reset functionality
  4. Consistent API Design: The async functions (arecord_*) maintain identical signatures and behavior to their sync counterparts, making them intuitive to use.

  5. Proper Documentation: Docstrings clearly explain when to use async vs sync versions and the rationale behind the implementation.

  6. CHANGELOG Updated: Appropriately documented with file paths and line numbers per project standards.


🔍 Observations & Minor Suggestions

1. Thread Safety of Cache (Info Only)

File: config/telemetry.py:92-117

The installation ID cache uses a global variable without explicit locking. While you correctly note that races are harmless (line 100-102), this is only true because:

  • Installation ID is a singleton
  • Assignment to _cached_installation_id is atomic in CPython (GIL)
  • All race outcomes produce the same value

This is acceptable for this use case, but future developers should be aware this pattern wouldn't work for non-singleton caching. Consider adding this to the docstring:

# Note: This cache is safe without locking ONLY because:
# 1. Installation.objects.get() returns the same singleton value
# 2. CPython's GIL makes simple assignment atomic
# Do not use this pattern for caching non-singleton values.

2. Error Handling in Cache

File: config/telemetry.py:115-117

When Installation.objects.get() fails, you return None but don't cache this failure. This means:

  • Every telemetry call will retry the DB lookup if Installation doesn't exist
  • Could cause performance issues during startup or in test environments

Suggestion: Consider caching the failure too:

except Exception as e:
    logger.warning(f"Failed to get installation ID: {e}")
    _cached_installation_id = None  # Cache the failure
    return None

This prevents repeated DB hits when the Installation record genuinely doesn't exist.

3. DRY Violation in Async Functions

Files: opencontractserver/mcp/telemetry.py:225-308

The three async MCP telemetry functions (arecord_mcp_tool_call, arecord_mcp_resource_read, arecord_mcp_request) have nearly identical implementations with only property names differing. This violates the DRY principle mentioned in CLAUDE.md.

Suggestion: Consider a helper function to reduce duplication:

async def _arecord_mcp_event(
    event_type: str,
    properties: dict[str, Any],
    success: bool = True,
    error_type: str | None = None,
) -> bool:
    """Helper for async MCP telemetry with common context handling."""
    context = _get_request_context()
    
    properties["success"] = success
    properties["transport"] = context.get("transport", "unknown")
    
    if context.get("client_ip_hash"):
        properties["client_ip_hash"] = context["client_ip_hash"]
    
    if not success and error_type:
        properties["error_type"] = error_type
    
    return await arecord_event(event_type, properties)

async def arecord_mcp_tool_call(
    tool_name: str,
    success: bool = True,
    error_type: str | None = None,
) -> bool:
    return await _arecord_mcp_event(
        "mcp_tool_call",
        {"tool_name": tool_name},
        success,
        error_type,
    )

This would reduce code from ~85 lines to ~30-40 lines while maintaining readability.

4. Type Annotation Precision

File: config/telemetry.py:20

_cached_installation_id: str | None | object = _UNSET

The type hint str | None | object is imprecise. The sentinel pattern is better typed using typing.Literal or a custom type guard. However, this is a minor style issue and the cast on line 107 handles it correctly.


🔒 Security Review

No security concerns identified

  • Async wrapper doesn't introduce new attack surface
  • Cache doesn't expose sensitive data
  • Installation ID is already used in telemetry, no new privacy implications

⚡ Performance Review

Excellent performance improvements

  • Caching eliminates repeated DB queries (major win for high-frequency MCP requests)
  • sync_to_async overhead is minimal and necessary
  • Test at config/telemetry.py:342-356 validates cache effectiveness

Measured Impact: Reduces DB queries from O(n) to O(1) where n = telemetry calls


🧪 Test Coverage Assessment

Comprehensive test coverage

Strengths:

  • Tests both success and failure paths
  • Validates cache behavior across multiple calls
  • Tests interaction with settings (TELEMETRY_ENABLED)
  • Tests cache reset functionality
  • Tests async/await behavior

Coverage gaps (acceptable for this PR):


📝 Code Style & Conventions

Follows project standards

  • Consistent with CLAUDE.md guidelines
  • Proper use of docstrings
  • Type hints included
  • Pre-commit hooks should pass (black, isort, flake8 compliant)

🎯 Recommendations

Priority: Low - Code is production-ready as-is. These are optimization suggestions:

  1. Consider caching Installation lookup failures (see observation Bump traefik from v2.8.7 to v2.9.1 in /compose/production/traefik #2)
  2. Optional DRY refactoring for async MCP functions (see observation Bump actions/setup-node from 3.4.1 to 3.5.1 #3)
  3. Add clarifying comment about thread safety assumptions (see observation Bump postgres from 14.5 to 15.0 in /compose/production/postgres #1)

✅ Approval Status

LGTM with minor suggestions

This PR:

  • ✅ Solves the stated problem (SynchronousOnlyOperation errors)
  • ✅ Includes excellent test coverage
  • ✅ Follows Django async best practices
  • ✅ Improves performance via caching
  • ✅ Maintains backward compatibility
  • ✅ Updates documentation appropriately

The suggestions above are optimizations, not blockers. The code is well-written and ready to merge.


📚 References

  • Django async docs: Async support
  • Project conventions: CLAUDE.md sections on DRY and testing patterns

Two root causes:

1. AsyncTelemetryTestCase: Django TestCase wraps each test in an
   uncommitted transaction visible only to the current DB connection.
   sync_to_async runs record_event in a different thread with a
   different connection, so Installation.objects.get() fails.
   Fix: mock record_event directly since we're testing that arecord_event
   delegates correctly, not the full pipeline.

2. InstallationIdCacheTestCase: mock path "config.telemetry.Installation
   .objects.get" was wrong because Installation is imported locally
   inside _get_installation_id(), not as a module attribute.
   Fix: use patch.object(Installation.objects, "get", ...) which
   intercepts the call regardless of import path.

   For test_cache_works_across_async_calls, pre-populate the cache
   synchronously before async calls to avoid the cross-thread DB issue.

https://claude.ai/code/session_014BGMhoou6nVtKnDmu8NZVV
@claude
Copy link

claude bot commented Feb 12, 2026

Pull Request Review

This PR successfully addresses the SynchronousOnlyOperation error in MCP telemetry by introducing async variants of telemetry functions and implementing installation ID caching. The implementation is well-architected and follows Django async best practices.


Strengths

1. Correct Async Pattern

  • Proper use of sync_to_async to wrap Django ORM calls for ASGI contexts
  • All call sites updated consistently to use async versions with proper await
  • Maintains backward compatibility by keeping sync versions for non-async contexts

2. Performance Optimization

The installation ID caching is excellent:

  • Uses sentinel pattern to distinguish between "not cached" and "cached as None"
  • Process-lifetime cache is appropriate since installation UUID never changes
  • Properly resets cache in _reset_posthog_client() for test isolation
  • Good documentation of race condition safety

3. Test Coverage

The tests are comprehensive and well-documented:

  • AsyncTelemetryTestCase tests delegation pattern with good explanation
  • InstallationIdCacheTestCase covers cache behavior, DB hit avoidance, reset functionality
  • Proper use of patch.object to mock manager method regardless of import path
  • Smart workaround for cross-thread DB visibility issue

4. Code Quality

  • Clear docstrings explaining when to use sync vs async versions
  • Consistent error handling across all async functions
  • Type hints using modern Python union syntax
  • CHANGELOG updated with file locations and line numbers

Minor Observations

1. Type Cast Usage (config/telemetry.py:107)

Using typing.cast is correct here since the type checker cannot infer the narrowing from the sentinel check. The current implementation is clean and appropriate.

2. Cache Thread Safety (config/telemetry.py:100-102)

The docstring correctly explains the race condition is harmless. Consider adding a note about why no lock is needed - all racing calls produce identical results, so the overhead would provide no benefit.

3. DRY in Async Functions (opencontractserver/mcp/telemetry.py:225-308)

The three async functions have nearly identical implementations, but this is appropriate because:

  • Each function has a clear, focused purpose
  • The duplication is minimal and localized
  • Abstracting would reduce readability
  • This follows the existing pattern for sync versions

Security and Performance

Security

  • IP hashing with salt remains intact
  • No sensitive data exposure in telemetry
  • Proper context isolation via ContextVar

Performance

  • Installation ID caching eliminates repeated DB queries
  • sync_to_async runs ORM in thread pool (non-blocking)
  • PostHog client singleton maintains connection pooling
  • Each sync_to_async call creates a thread pool task, adding some overhead for high-frequency calls, but this is acceptable given the alternative is blocking the event loop

Test Coverage Assessment

The tests cover:

  • Async delegation to sync functions
  • Return value propagation
  • Cache hit/miss behavior
  • Cache reset functionality
  • Cross-thread cache usage
  • Integration with mock patches in MCP tests

Missing coverage (acceptable for this PR):

  • Error handling when Installation.objects.get() raises exceptions in async context
  • Behavior when cached installation ID is set to None

These edge cases are already covered by the sync version tests.


Best Practices Alignment

Checking against CLAUDE.md:

  • Agent Tool Architecture: Correctly uses async patterns
  • Testing Patterns: Proper use of TestCase, mocking, and explanations
  • Changelog: Updated with file locations and line numbers
  • No dead code: All changes are necessary and active
  • DRY principle: Appropriate level of abstraction
  • Single Responsibility: Each function has a clear purpose
  • No magic numbers: Constants used appropriately
  • Test isolation: Proper setUp/tearDown

Recommendations

1. Pre-merge Checklist

Run these commands before merging:

  • docker compose -f test.yml run django pytest opencontractserver/tests/test_telemetry.py -n 4 --dist loadscope
  • docker compose -f test.yml run django pytest opencontractserver/mcp/tests/test_mcp.py -n 4 --dist loadscope
  • pre-commit run --all-files

2. Documentation Enhancement (Optional)

Consider adding a note to docs/architecture/llms/README.md about telemetry in async agent tools, since this is a pattern other developers might need to follow.

3. Future Optimization (Post-merge)

If telemetry becomes a performance bottleneck under high load, consider:

  • Batching telemetry events in memory before sending
  • Using a fire-and-forget pattern with asyncio.create_task() instead of awaiting
  • Note: This is NOT needed now - only if profiling shows it is an issue

Summary

This is high-quality work that solves a real production issue. The implementation is:

  • Architecturally sound
  • Well-tested with clear explanations
  • Performance-optimized with caching
  • Fully documented
  • Consistent with project conventions

Verdict: APPROVE - Ready to merge after confirming tests pass.

The async telemetry pattern introduced here should be considered the standard approach for any future ASGI/async contexts that need to interact with Django ORM.


Review completed following CLAUDE.md guidelines and project conventions.

@codecov
Copy link

codecov bot commented Feb 12, 2026

Codecov Report

❌ Patch coverage is 80.20833% with 19 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
opencontractserver/mcp/telemetry.py 44.00% 14 Missing ⚠️
opencontractserver/mcp/server.py 72.22% 5 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants