Add async telemetry functions to prevent SynchronousOnlyOperation errors by JSv4 · Pull Request #877 · Open-Source-Legal/OpenContracts

JSv4 · 2026-02-10T14:29:34Z

Summary

This PR adds async versions of telemetry recording functions to prevent SynchronousOnlyOperation errors when calling telemetry from ASGI handlers and other async contexts. It also implements caching for the installation ID lookup to avoid repeated database hits.

Key Changes

Added arecord_event() in config/telemetry.py: Async wrapper around record_event() using sync_to_async to safely run Django ORM lookups in a thread pool instead of blocking the event loop.
Implemented installation ID caching: The _get_installation_id() function now caches its result after the first successful lookup, since the installation UUID is a singleton that never changes during the process lifetime. This eliminates redundant database queries on every telemetry call.
Added async telemetry functions in opencontractserver/mcp/telemetry.py:
- arecord_mcp_tool_call() - async version for tool call telemetry
- arecord_mcp_resource_read() - async version for resource read telemetry
- arecord_mcp_request() - async version for request telemetry
Updated opencontractserver/mcp/server.py: Replaced all synchronous telemetry calls with their async counterparts throughout the ASGI handlers, ensuring proper async/await patterns and preventing event loop blocking.
Updated tests in opencontractserver/mcp/tests/test_mcp.py: Updated mock patches to reference the new async telemetry functions.

Implementation Details

The _UNSET sentinel object is used to distinguish between "not yet cached" and "cached as None" states for the installation ID.
The _reset_posthog_client() function now also resets the cached installation ID, ensuring clean state during testing.
All async telemetry functions maintain the same property structure and behavior as their synchronous counterparts, just with proper async/await handling.
The caching mechanism is thread-safe for the process lifetime since the installation ID never changes.

https://claude.ai/code/session_014BGMhoou6nVtKnDmu8NZVV

The MCP server runs inside Daphne's async event loop, but the telemetry functions were calling record_event() synchronously, which triggers Installation.objects.get() — a sync ORM call that raises SynchronousOnlyOperation in async contexts. Add async variants (arecord_event, arecord_mcp_tool_call, arecord_mcp_resource_read, arecord_mcp_request) that wrap the sync telemetry through sync_to_async, running the ORM lookup in a thread pool. Update all MCP server call sites to await the async versions. Closes the "Failed to get installation ID: You cannot call this from an async context" warning that fired on every MCP request. https://claude.ai/code/session_014BGMhoou6nVtKnDmu8NZVV

…y call The installation UUID is a singleton that never changes, but _get_installation_id() was hitting the database on every call. Cache the result after the first successful lookup so subsequent telemetry events (including high-frequency MCP requests) skip the ORM entirely. The cache is reset alongside the PostHog client in _reset_posthog_client() for test isolation. https://claude.ai/code/session_014BGMhoou6nVtKnDmu8NZVV

claude · 2026-02-10T14:31:46Z

Code Review - PR #877: Add async telemetry functions

Summary

This PR successfully addresses SynchronousOnlyOperation errors in the MCP server by adding async versions of telemetry functions. The implementation is solid with good caching optimization for the installation ID lookup.

✅ Strengths

1. Correct Async Pattern

Properly uses sync_to_async to wrap synchronous Django ORM calls
All MCP server handlers correctly await the async telemetry functions
Follows the async-first pattern documented in CLAUDE.md for agent tools

2. Performance Optimization

The installation ID caching (config/telemetry.py:91-112) is well-designed:

Uses sentinel pattern (_UNSET) to distinguish "not cached" from "cached as None"
Proper cache invalidation in _reset_posthog_client() for test isolation
Eliminates database hit on every telemetry call (good for high-frequency MCP requests)

3. Complete Implementation

All three async variants implemented: arecord_event, arecord_mcp_tool_call, arecord_mcp_resource_read, arecord_mcp_request
All call sites in opencontractserver/mcp/server.py updated
Test mocks updated to reference async functions
Consistent error handling maintained

4. Documentation

Clear docstrings explain:

When to use async vs sync versions
Why sync_to_async is necessary
Cache behavior and rationale

🔍 Issues & Recommendations

1. Missing Test Coverage for Async Functions (Medium Priority)

The new async telemetry functions (arecord_event, arecord_mcp_tool_call, etc.) have no dedicated tests. While test_mcp.py mocks them, there are no tests verifying:

That arecord_event correctly delegates to record_event via sync_to_async
That async functions work correctly in async contexts
That the installation ID caching works with async calls

Recommendation: Add async test cases to opencontractserver/tests/test_telemetry.py:

import asyncio
from django.test import TestCase, override_settings
from config.telemetry import arecord_event, _reset_posthog_client

class AsyncTelemetryTestCase(TestCase):
    def setUp(self):
        _reset_posthog_client()
        # PostHog mock setup...
    
    def test_arecord_event_success(self):
        """Test async event recording delegates to sync version."""
        with override_settings(MODE="DEV", TELEMETRY_ENABLED=True):
            result = asyncio.run(arecord_event("test_event", {"test": "value"}))
        
        self.assertTrue(result)
        # Verify PostHog was called...
    
    def test_arecord_event_caching_works(self):
        """Test installation ID cache works across async calls."""
        with override_settings(MODE="DEV", TELEMETRY_ENABLED=True):
            with patch('opencontractserver.users.models.Installation.objects.get') as mock_get:
                mock_get.return_value = self.mock_installation
                
                # First call should hit DB
                asyncio.run(arecord_event("event1"))
                self.assertEqual(mock_get.call_count, 1)
                
                # Second call should use cache
                asyncio.run(arecord_event("event2"))
                self.assertEqual(mock_get.call_count, 1)  # No additional DB call

2. Thread Safety of Cache Not Documented (Low Priority)

The PR description states "caching mechanism is thread-safe for the process lifetime", but Python's GIL provides thread-safety for the simple case here. However:

In an async context with multiple coroutines, race conditions are still possible during the initial cache population
This is unlikely to cause issues (worst case: duplicate DB queries), but should be noted

Recommendation: Add a comment in config/telemetry.py:

def _get_installation_id() -> str | None:
    """
    Get the installation ID from the Installation model.
    
    The result is cached after the first successful lookup because the
    installation UUID is a singleton that never changes for the lifetime
    of the process. This avoids a database hit on every telemetry call.
    
    Note: Multiple concurrent first calls may race to populate the cache,
    but this is harmless - the Installation singleton ensures the same
    value is cached regardless of which call wins.
    """

3. CHANGELOG.md Not Updated (Medium Priority)

Per CLAUDE.md guidelines, the CHANGELOG should be updated for significant changes. This fix prevents runtime errors in production MCP servers.

Recommendation: Add to CHANGELOG.md under ## [Unreleased]:

### Fixed

#### MCP Telemetry in Async Context
- **`SynchronousOnlyOperation` in MCP server** (`config/telemetry.py`, `opencontractserver/mcp/telemetry.py`, `opencontractserver/mcp/server.py`): Added async telemetry functions (`arecord_event`, `arecord_mcp_tool_call`, `arecord_mcp_resource_read`, `arecord_mcp_request`) that use `sync_to_async` to safely run Django ORM lookups in a thread pool. Prevents "You cannot call this from an async context" errors on every MCP request.
- **Installation ID caching** (`config/telemetry.py:91-112`): Added process-lifetime cache for installation UUID to eliminate redundant database queries on every telemetry call, particularly beneficial for high-frequency MCP requests.

4. Minor: Type Ignore Could Be More Specific (Low Priority)

In config/telemetry.py:102:

return _cached_installation_id  # type: ignore[return-value]

This is suppressing a type error because mypy can't narrow str | None | object to str | None after the sentinel check.

Recommendation: Use a more specific type ignore or cast:

from typing import cast
return cast(str | None, _cached_installation_id)

🔒 Security Review

✅ No security concerns identified

Telemetry functions don't handle user input
Caching doesn't expose sensitive data
No changes to permission checks or data access patterns

⚡ Performance Impact

✅ Positive impact

Eliminates database query on every telemetry call after the first
Prevents event loop blocking in async contexts
No measurable overhead from sync_to_async wrapper (negligible thread pool dispatch)

🧪 Test Coverage Assessment

Component	Coverage	Notes
`record_event()`	✅ Excellent	Comprehensive tests in `test_telemetry.py`
`arecord_event()`	⚠️ Mocked only	No direct tests, only mocked in MCP tests
Installation ID caching	⚠️ Untested	Cache behavior not verified
Async MCP telemetry	⚠️ Mocked only	Functions mocked in `test_mcp.py` but not tested directly

📋 Pre-Merge Checklist

Add async telemetry test cases to opencontractserver/tests/test_telemetry.py
Update CHANGELOG.md
(Optional) Add thread-safety note to cache comment
(Optional) Replace type ignore with cast

🎯 Verdict

Recommended for merge with minor improvements

The core implementation is solid and solves the immediate problem correctly. The missing test coverage and CHANGELOG update should be addressed before merge, but they're not blocking issues if time is constrained. The code follows project conventions and won't cause regressions.

Priority: Address test coverage (#1) and CHANGELOG (#3) before merge. Items #2 and #4 can be deferred to follow-up PRs if needed.

- Add AsyncTelemetryTestCase: tests arecord_event delegates correctly, respects telemetry-disabled setting, and handles missing properties. - Add InstallationIdCacheTestCase: verifies cache returns correct ID, avoids repeated DB hits, resets properly, and works across async calls. - Add thread-safety note to _get_installation_id cache docstring. - Replace type: ignore[return-value] with typing.cast for clarity. - Update CHANGELOG.md with MCP telemetry fix entry. https://claude.ai/code/session_014BGMhoou6nVtKnDmu8NZVV

claude · 2026-02-11T05:52:23Z

Code Review for PR #877: Add async telemetry functions

Summary

This PR successfully addresses the SynchronousOnlyOperation errors occurring in MCP ASGI handlers by adding async telemetry functions and implementing installation ID caching. The implementation is well-structured and follows Django async patterns correctly.

✅ Strengths

Correct Async Pattern: Using sync_to_async to wrap the sync record_event function is the right approach per Django's async best practices. This allows the ORM lookup in _get_installation_id to run in a thread pool.
Smart Caching Strategy: The installation ID caching using a sentinel object (_UNSET) is elegant and properly distinguishes between "not cached" and "cached as None" states. The race condition note in the docstring (config/telemetry.py:100-102) is accurate and the harmless nature is correctly identified.
Comprehensive Test Coverage: Excellent test coverage including:
- Async function behavior verification
- Cache effectiveness testing
- Cross-async-call cache persistence
- Cache reset functionality
Consistent API Design: The async functions (arecord_*) maintain identical signatures and behavior to their sync counterparts, making them intuitive to use.
Proper Documentation: Docstrings clearly explain when to use async vs sync versions and the rationale behind the implementation.
CHANGELOG Updated: Appropriately documented with file paths and line numbers per project standards.

🔍 Observations & Minor Suggestions

1. Thread Safety of Cache (Info Only)

File: config/telemetry.py:92-117

The installation ID cache uses a global variable without explicit locking. While you correctly note that races are harmless (line 100-102), this is only true because:

Installation ID is a singleton
Assignment to _cached_installation_id is atomic in CPython (GIL)
All race outcomes produce the same value

This is acceptable for this use case, but future developers should be aware this pattern wouldn't work for non-singleton caching. Consider adding this to the docstring:

# Note: This cache is safe without locking ONLY because:
# 1. Installation.objects.get() returns the same singleton value
# 2. CPython's GIL makes simple assignment atomic
# Do not use this pattern for caching non-singleton values.

2. Error Handling in Cache

File: config/telemetry.py:115-117

When Installation.objects.get() fails, you return None but don't cache this failure. This means:

Every telemetry call will retry the DB lookup if Installation doesn't exist
Could cause performance issues during startup or in test environments

Suggestion: Consider caching the failure too:

except Exception as e:
    logger.warning(f"Failed to get installation ID: {e}")
    _cached_installation_id = None  # Cache the failure
    return None

This prevents repeated DB hits when the Installation record genuinely doesn't exist.

3. DRY Violation in Async Functions

Files: opencontractserver/mcp/telemetry.py:225-308

The three async MCP telemetry functions (arecord_mcp_tool_call, arecord_mcp_resource_read, arecord_mcp_request) have nearly identical implementations with only property names differing. This violates the DRY principle mentioned in CLAUDE.md.

Suggestion: Consider a helper function to reduce duplication:

async def _arecord_mcp_event(
    event_type: str,
    properties: dict[str, Any],
    success: bool = True,
    error_type: str | None = None,
) -> bool:
    """Helper for async MCP telemetry with common context handling."""
    context = _get_request_context()
    
    properties["success"] = success
    properties["transport"] = context.get("transport", "unknown")
    
    if context.get("client_ip_hash"):
        properties["client_ip_hash"] = context["client_ip_hash"]
    
    if not success and error_type:
        properties["error_type"] = error_type
    
    return await arecord_event(event_type, properties)

async def arecord_mcp_tool_call(
    tool_name: str,
    success: bool = True,
    error_type: str | None = None,
) -> bool:
    return await _arecord_mcp_event(
        "mcp_tool_call",
        {"tool_name": tool_name},
        success,
        error_type,
    )

This would reduce code from ~85 lines to ~30-40 lines while maintaining readability.

4. Type Annotation Precision

File: config/telemetry.py:20

_cached_installation_id: str | None | object = _UNSET

The type hint str | None | object is imprecise. The sentinel pattern is better typed using typing.Literal or a custom type guard. However, this is a minor style issue and the cast on line 107 handles it correctly.

🔒 Security Review

✅ No security concerns identified

Async wrapper doesn't introduce new attack surface
Cache doesn't expose sensitive data
Installation ID is already used in telemetry, no new privacy implications

⚡ Performance Review

✅ Excellent performance improvements

Caching eliminates repeated DB queries (major win for high-frequency MCP requests)
sync_to_async overhead is minimal and necessary
Test at config/telemetry.py:342-356 validates cache effectiveness

Measured Impact: Reduces DB queries from O(n) to O(1) where n = telemetry calls

🧪 Test Coverage Assessment

✅ Comprehensive test coverage

Strengths:

Tests both success and failure paths
Validates cache behavior across multiple calls
Tests interaction with settings (TELEMETRY_ENABLED)
Tests cache reset functionality
Tests async/await behavior

Coverage gaps (acceptable for this PR):

No tests for arecord_mcp_tool_call, arecord_mcp_resource_read, arecord_mcp_request directly (though they're tested indirectly via test_mcp.py patches)
No test for the error caching suggestion in observation Bump traefik from v2.8.7 to v2.9.1 in /compose/production/traefik #2

📝 Code Style & Conventions

✅ Follows project standards

Consistent with CLAUDE.md guidelines
Proper use of docstrings
Type hints included
Pre-commit hooks should pass (black, isort, flake8 compliant)

🎯 Recommendations

Priority: Low - Code is production-ready as-is. These are optimization suggestions:

Consider caching Installation lookup failures (see observation Bump traefik from v2.8.7 to v2.9.1 in /compose/production/traefik #2)
Optional DRY refactoring for async MCP functions (see observation Bump actions/setup-node from 3.4.1 to 3.5.1 #3)
Add clarifying comment about thread safety assumptions (see observation Bump postgres from 14.5 to 15.0 in /compose/production/postgres #1)

✅ Approval Status

LGTM with minor suggestions

This PR:

✅ Solves the stated problem (SynchronousOnlyOperation errors)
✅ Includes excellent test coverage
✅ Follows Django async best practices
✅ Improves performance via caching
✅ Maintains backward compatibility
✅ Updates documentation appropriately

The suggestions above are optimizations, not blockers. The code is well-written and ready to merge.

📚 References

Django async docs: Async support
Project conventions: CLAUDE.md sections on DRY and testing patterns

…try-rZTDn

Two root causes: 1. AsyncTelemetryTestCase: Django TestCase wraps each test in an uncommitted transaction visible only to the current DB connection. sync_to_async runs record_event in a different thread with a different connection, so Installation.objects.get() fails. Fix: mock record_event directly since we're testing that arecord_event delegates correctly, not the full pipeline. 2. InstallationIdCacheTestCase: mock path "config.telemetry.Installation .objects.get" was wrong because Installation is imported locally inside _get_installation_id(), not as a module attribute. Fix: use patch.object(Installation.objects, "get", ...) which intercepts the call regardless of import path. For test_cache_works_across_async_calls, pre-populate the cache synchronously before async calls to avoid the cross-thread DB issue. https://claude.ai/code/session_014BGMhoou6nVtKnDmu8NZVV

claude · 2026-02-12T19:29:20Z

Pull Request Review

This PR successfully addresses the SynchronousOnlyOperation error in MCP telemetry by introducing async variants of telemetry functions and implementing installation ID caching. The implementation is well-architected and follows Django async best practices.

Strengths

1. Correct Async Pattern

Proper use of sync_to_async to wrap Django ORM calls for ASGI contexts
All call sites updated consistently to use async versions with proper await
Maintains backward compatibility by keeping sync versions for non-async contexts

2. Performance Optimization

The installation ID caching is excellent:

Uses sentinel pattern to distinguish between "not cached" and "cached as None"
Process-lifetime cache is appropriate since installation UUID never changes
Properly resets cache in _reset_posthog_client() for test isolation
Good documentation of race condition safety

3. Test Coverage

The tests are comprehensive and well-documented:

AsyncTelemetryTestCase tests delegation pattern with good explanation
InstallationIdCacheTestCase covers cache behavior, DB hit avoidance, reset functionality
Proper use of patch.object to mock manager method regardless of import path
Smart workaround for cross-thread DB visibility issue

4. Code Quality

Clear docstrings explaining when to use sync vs async versions
Consistent error handling across all async functions
Type hints using modern Python union syntax
CHANGELOG updated with file locations and line numbers

Minor Observations

1. Type Cast Usage (config/telemetry.py:107)

Using typing.cast is correct here since the type checker cannot infer the narrowing from the sentinel check. The current implementation is clean and appropriate.

2. Cache Thread Safety (config/telemetry.py:100-102)

The docstring correctly explains the race condition is harmless. Consider adding a note about why no lock is needed - all racing calls produce identical results, so the overhead would provide no benefit.

3. DRY in Async Functions (opencontractserver/mcp/telemetry.py:225-308)

The three async functions have nearly identical implementations, but this is appropriate because:

Each function has a clear, focused purpose
The duplication is minimal and localized
Abstracting would reduce readability
This follows the existing pattern for sync versions

Security and Performance

Security

IP hashing with salt remains intact
No sensitive data exposure in telemetry
Proper context isolation via ContextVar

Performance

Installation ID caching eliminates repeated DB queries
sync_to_async runs ORM in thread pool (non-blocking)
PostHog client singleton maintains connection pooling
Each sync_to_async call creates a thread pool task, adding some overhead for high-frequency calls, but this is acceptable given the alternative is blocking the event loop

Test Coverage Assessment

The tests cover:

Async delegation to sync functions
Return value propagation
Cache hit/miss behavior
Cache reset functionality
Cross-thread cache usage
Integration with mock patches in MCP tests

Missing coverage (acceptable for this PR):

Error handling when Installation.objects.get() raises exceptions in async context
Behavior when cached installation ID is set to None

These edge cases are already covered by the sync version tests.

Best Practices Alignment

Checking against CLAUDE.md:

Agent Tool Architecture: Correctly uses async patterns
Testing Patterns: Proper use of TestCase, mocking, and explanations
Changelog: Updated with file locations and line numbers
No dead code: All changes are necessary and active
DRY principle: Appropriate level of abstraction
Single Responsibility: Each function has a clear purpose
No magic numbers: Constants used appropriately
Test isolation: Proper setUp/tearDown

Recommendations

1. Pre-merge Checklist

Run these commands before merging:

docker compose -f test.yml run django pytest opencontractserver/tests/test_telemetry.py -n 4 --dist loadscope
docker compose -f test.yml run django pytest opencontractserver/mcp/tests/test_mcp.py -n 4 --dist loadscope
pre-commit run --all-files

2. Documentation Enhancement (Optional)

Consider adding a note to docs/architecture/llms/README.md about telemetry in async agent tools, since this is a pattern other developers might need to follow.

3. Future Optimization (Post-merge)

If telemetry becomes a performance bottleneck under high load, consider:

Batching telemetry events in memory before sending
Using a fire-and-forget pattern with asyncio.create_task() instead of awaiting
Note: This is NOT needed now - only if profiling shows it is an issue

Summary

This is high-quality work that solves a real production issue. The implementation is:

Architecturally sound
Well-tested with clear explanations
Performance-optimized with caching
Fully documented
Consistent with project conventions

Verdict: APPROVE - Ready to merge after confirming tests pass.

The async telemetry pattern introduced here should be considered the standard approach for any future ASGI/async contexts that need to interact with Django ORM.

Review completed following CLAUDE.md guidelines and project conventions.

codecov · 2026-02-12T20:20:00Z

Codecov Report

❌ Patch coverage is 80.20833% with 19 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
opencontractserver/mcp/telemetry.py	44.00%	14 Missing ⚠️
opencontractserver/mcp/server.py	72.22%	5 Missing ⚠️

📢 Thoughts on this report? Let us know!

claude added 2 commits February 10, 2026 14:15

claude added 2 commits February 12, 2026 19:21

Merge remote-tracking branch 'origin/main' into claude/fix-mcp-teleme…

e4e53c1

…try-rZTDn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add async telemetry functions to prevent SynchronousOnlyOperation errors#877

Add async telemetry functions to prevent SynchronousOnlyOperation errors#877
JSv4 wants to merge 5 commits intomainfrom
claude/fix-mcp-telemetry-rZTDn

JSv4 commented Feb 10, 2026

Uh oh!

claude bot commented Feb 10, 2026

Uh oh!

claude bot commented Feb 11, 2026

Uh oh!

claude bot commented Feb 12, 2026

Uh oh!

codecov bot commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

JSv4 commented Feb 10, 2026

Summary

Key Changes

Implementation Details

Uh oh!

claude bot commented Feb 10, 2026

Code Review - PR #877: Add async telemetry functions

Summary

✅ Strengths

1. Correct Async Pattern

2. Performance Optimization

3. Complete Implementation

4. Documentation

🔍 Issues & Recommendations

1. Missing Test Coverage for Async Functions (Medium Priority)

2. Thread Safety of Cache Not Documented (Low Priority)

3. CHANGELOG.md Not Updated (Medium Priority)

4. Minor: Type Ignore Could Be More Specific (Low Priority)

🔒 Security Review

⚡ Performance Impact

🧪 Test Coverage Assessment

📋 Pre-Merge Checklist

🎯 Verdict

Uh oh!

claude bot commented Feb 11, 2026

Code Review for PR #877: Add async telemetry functions

Summary

✅ Strengths

🔍 Observations & Minor Suggestions

1. Thread Safety of Cache (Info Only)

2. Error Handling in Cache

3. DRY Violation in Async Functions

4. Type Annotation Precision

🔒 Security Review

⚡ Performance Review

🧪 Test Coverage Assessment

📝 Code Style & Conventions

🎯 Recommendations

✅ Approval Status

📚 References

Uh oh!

claude bot commented Feb 12, 2026

Pull Request Review

Strengths

1. Correct Async Pattern

2. Performance Optimization

3. Test Coverage

4. Code Quality

Minor Observations

1. Type Cast Usage (config/telemetry.py:107)

2. Cache Thread Safety (config/telemetry.py:100-102)

3. DRY in Async Functions (opencontractserver/mcp/telemetry.py:225-308)

Security and Performance

Security

Performance

Test Coverage Assessment

Best Practices Alignment

Recommendations

1. Pre-merge Checklist

2. Documentation Enhancement (Optional)

3. Future Optimization (Post-merge)

Summary

Uh oh!

codecov bot commented Feb 12, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants