Skip to content

add cf-o11y worker#48

Open
iscekic wants to merge 33 commits intomainfrom
add-model-o11y
Open

add cf-o11y worker#48
iscekic wants to merge 33 commits intomainfrom
add-model-o11y

Conversation

@iscekic
Copy link
Contributor

@iscekic iscekic commented Feb 5, 2026

No description provided.

Add O11Y_SERVICE_URL environment variable for agent observability ingest service with empty string fallback
Implement POST /ingest/api-metrics endpoint using Hono framework to collect observability metrics from API clients. The endpoint validates incoming metrics data including client info, model usage, timing data, token counts, and error states using Zod schema validation.

Add dependencies hono and zod for routing and validation. Enable smart placement in Cloudflare Workers configuration for optimal performance. Include comprehensive tests for valid payloads and validation error handling.
Extract inline Zod validation code from the API metrics endpoint into a
reusable zodJsonValidator utility function. This improves code
maintainability and enables consistent validation patterns across
multiple endpoints.

- Add @hono/zod-validator dependency
- Create validation.ts utility with zodJsonValidator function
- Simplify /ingest/api-metrics endpoint by using the new validator
- Standardize error response format with ErrorResponse type
Integrate observability metrics collection into the OpenRouter API proxy to track model usage patterns. Creates a new server-side metrics emitter that sends provider, requested model, and resolved model information to the observability service.

The metrics are emitted asynchronously using Next.js after() to avoid impacting request latency. Includes best-effort error handling to ensure metrics failures never affect the primary request flow.
Remove clientName from API metrics parameters and derive it server-side from clientSecret using a new mapping function. This simplifies the client API by reducing required parameters and centralizes client identification logic.

- Add client-secrets.ts with getClientNameFromSecret() mapping function
- Update validation schema to verify clientSecret and transform to include clientName
- Remove clientName parameter from ApiMetricsParams type
- Update tests to use mapped clientSecret instead of explicit clientName
Add toolsAvailable field to API metrics to track which tools are available in requests. Implement getToolsAvailable helper to extract and format tool names from OpenAI tool definitions, supporting both function and custom tool types.

Refactor URL initialization to use IIFE pattern for better error handling and null safety.
Add getToolsUsed function to extract tool usage from assistant messages in the conversation history. This complements the existing toolsAvailable tracking by capturing which tools were actually invoked during the request.

The function parses tool_calls from assistant messages and categorizes them as function, custom, or unknown types with their respective names.
Move emitApiMetrics call to after provider response to capture time-to-first-byte (TTFB) measurement. This enables monitoring of model response latency for observability purposes.
Add new emitApiMetricsForResponse function that drains the response body
to measure full upstream response time. This provides more accurate timing
data by capturing the complete request lifecycle including TTFB and total
duration. The original emitApiMetrics function is preserved for backward
compatibility.
Add a 60-second timeout to drainResponseBody to prevent background work
from running indefinitely on long-lived SSE connections. The function now
tracks elapsed time and uses Promise.race to enforce the timeout, properly
canceling the reader when the limit is reached.
Add statusCode field to API metrics tracking to enable monitoring of response status distributions and error rates. The status code is captured from the response object and included in the metrics parameters.
…icit field

Remove the explicit 'success' boolean field from API metrics schema and instead infer success/failure from the HTTP status code. Error messages are now required when statusCode >= 400, making the API more intuitive and reducing redundancy.
The errorMessage field and its validation logic have been removed from the API metrics schema. Error context can be derived from the statusCode field, eliminating the need for explicit error messages in metrics collection.
Add ApiMetricsTokens type to track input, output, cache write, cache hit, and total tokens. Implement getTokensFromCompletionUsage helper function to extract token metrics from OpenAI CompletionUsage objects. Extend ApiMetricsParams to include optional tokens field for comprehensive API usage monitoring.
Add kiloUserId, organizationId, isStreaming, userByok, and mode fields to API metrics schema to enable better tracking and analysis of API usage patterns. Update all metric emission points to include the new contextual information.
Remove hardcoded TODO placeholder and implement proper client secret configuration for API metrics. The secret is now loaded from O11Y_KILO_GATEWAY_CLIENT_SECRET environment variable and automatically included in metrics requests.
Implement dynamic secret retrieval from Cloudflare's Secrets Store
binding to authenticate API clients. The authentication logic has been
moved from schema-level validation into the route handler to support
asynchronous secret fetching operations.

Configuration updates include wrangler.jsonc binding setup for the
O11Y_KILO_GATEWAY_CLIENT_SECRET resource. Test suite enhancements
provide mock secret bindings and verify proper rejection of
unauthorized requests with 403 status codes.
Implement PostHog event capture to track API usage metrics. The /ingest/api-metrics endpoint now forwards validated metrics to PostHog for analytics and monitoring.

- Add captureApiMetrics function to send events to PostHog
- Configure PostHog API key and host via environment variables
- Exclude clientSecret from captured properties for security
- Set $process_person_profile based on isAnonymous flag
- Update tests to include PostHog configuration
Remove intermediate ctx variable assignments and waitOnExecutionContext calls
in favor of directly passing createExecutionContext() to worker.fetch().
This reduces test boilerplate while maintaining the same test behavior.
Add new reusable workflow for deploying o11y service to Cloudflare Workers with environment selection (dev/prod). Integrate o11y deployment into production workflow with automatic triggering when cloudflare-o11y directory changes are detected.
Add optional ipAddress field to API metrics schema and pass it through
to PostHog analytics. This enables PostHog to resolve geographic
location from the user's actual IP address rather than the Cloudflare
worker's IP address, providing more accurate location analytics.

Changes include:
- Add ipAddress field to ApiMetricsParamsSchema with IPv4/IPv6 validation
- Extract and forward IP address in PostHog capture request
- Thread ipAddress parameter through OpenRouter API route
- Update ApiMetricsParams type definition
@iscekic iscekic self-assigned this Feb 5, 2026
…detection

Add comprehensive alerting system for LLM API observability:

- Implement multi-window burn rate alerting following Google SRE Workbook
  approach with 3 severity windows (5m/1m, 30m/3m page; 360m/30m ticket)
- Add Analytics Engine integration for time-series metrics storage with
  weighted sampling support for error rates and latency percentiles
- Implement KV-based alert deduplication with severity-aware suppression
  (pages suppress tickets for same dimension)
- Add Slack notification delivery with separate webhooks for pages/tickets
- Integrate recommended models API endpoint to determine page-eligible models
- Configure cron trigger for per-minute alert evaluation
- Add comprehensive test coverage for dedup logic and SLO configuration

The system tracks error rates (99.9% SLO) and latency p50/p90 thresholds,
firing alerts only when both long and short windows exceed burn rate
thresholds to reduce false positives.
…variable

- Replace toUInt64 type casting with IF expressions for better readability in error rate and slow request queries
- Change query format from JSONEachRow to JSON for consistency
- Rename O11Y_APP_BASE_URL to O11Y_API_BASE_URL for clarity
- Update all references across configuration, tests, and type definitions
Configure custom domain routing for o11y.kiloapps.io in the Cloudflare Worker configuration to enable direct access to the observability service through the custom domain.
Add clientName parameter to alert deduplication functions to ensure
alerts are tracked separately per client. This prevents alerts for
the same provider:model combination from being incorrectly suppressed
across different clients.

- Update alertKey() to include clientName in key generation
- Add clientName parameter to shouldSuppress() and recordAlertFired()
- Update all call sites in evaluate.ts to pass client_name
- Add test coverage for client-specific alert suppression
- Remove null return from effectiveSeverity() as it always returns a severity
@iscekic iscekic requested a review from Copilot February 6, 2026 01:04
@iscekic iscekic marked this pull request as ready for review February 6, 2026 01:04
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds a new Cloudflare Worker for observability (o11y) that implements API metrics ingestion and SLO-based alerting for the Kilo AI platform.

Changes:

  • Adds a new Cloudflare Worker that ingests API metrics from the Kilo gateway
  • Implements multi-window burn-rate alerting based on Google SRE Workbook practices for error rates and latency
  • Integrates with PostHog for analytics and Slack for alert notifications
  • Adds a new API endpoint /api/recommended-models to expose recommended models for alert filtering

Reviewed changes

Copilot reviewed 29 out of 32 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/lib/o11y/api-metrics.server.ts Helper functions for extracting API metrics from OpenAI-compatible responses
src/lib/config.server.ts Adds configuration variables for o11y service
src/app/api/recommended-models/route.ts New API endpoint exposing recommended model list
src/app/api/openrouter/[...path]/route.ts Integrates API metrics emission into the gateway proxy
cloudflare-o11y/* Complete o11y worker implementation with alerting, querying, and notification logic
.github/workflows/* CI/CD workflows for deploying the o11y worker
pnpm-workspace.yaml Adds cloudflare-o11y to workspace

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@kiloconnect
Copy link
Contributor

kiloconnect bot commented Feb 6, 2026

Code Review Summary

Status: 2 Issues Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 0
WARNING 2
SUGGESTION 0

Fix these issues in Kilo Cloud

Issue Details (click to expand)

WARNING

File Line Issue
src/lib/o11y/api-metrics.server.ts 145 after() drains a cloned response body after the response lifecycle, which can buffer large/streaming bodies
src/app/api/openrouter/[...path]/route.ts 386 resolvedModel recorded without normalization, potentially fragmenting metrics and breaking recommended-model matching
Files Reviewed (5 files)
  • .github/workflows/deploy-o11y.yml - 0 issues
  • .github/workflows/deploy-production.yml - 0 issues
  • src/app/api/openrouter/[...path]/route.ts - 1 issue
  • src/lib/config.server.ts - 0 issues
  • src/lib/o11y/api-metrics.server.ts - 1 issue

- Store alert timestamps in ISO 8601 format instead of Unix epoch for better readability
- Update authentication error message to be more generic and security-conscious
Wrap JSON.parse in try-catch to prevent crashes when KV cache contains
invalid JSON data. Falls through to network fetch on parse errors.
Update worker configuration types generated by wrangler with workerd@1.20260128.0. Adds readonly exports property to ExecutionContext and DurableObjectState interfaces, and removes unnecessary eslint-disable comments from AIGateway type definitions.
- Add type guards to safely access tool.function.name and tool.custom.name properties
- Move toolsAvailable and toolsUsed extraction after tool repair logic to ensure accurate metrics
- Add fetch-depth: 2 to checkout steps in deploy workflow for proper path filtering
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant