add cf-o11y worker by iscekic · Pull Request #48 · Kilo-Org/cloud

iscekic · 2026-02-05T21:47:24Z

No description provided.

Add O11Y_SERVICE_URL environment variable for agent observability ingest service with empty string fallback

Implement POST /ingest/api-metrics endpoint using Hono framework to collect observability metrics from API clients. The endpoint validates incoming metrics data including client info, model usage, timing data, token counts, and error states using Zod schema validation. Add dependencies hono and zod for routing and validation. Enable smart placement in Cloudflare Workers configuration for optimal performance. Include comprehensive tests for valid payloads and validation error handling.

Extract inline Zod validation code from the API metrics endpoint into a reusable zodJsonValidator utility function. This improves code maintainability and enables consistent validation patterns across multiple endpoints. - Add @hono/zod-validator dependency - Create validation.ts utility with zodJsonValidator function - Simplify /ingest/api-metrics endpoint by using the new validator - Standardize error response format with ErrorResponse type

Integrate observability metrics collection into the OpenRouter API proxy to track model usage patterns. Creates a new server-side metrics emitter that sends provider, requested model, and resolved model information to the observability service. The metrics are emitted asynchronously using Next.js after() to avoid impacting request latency. Includes best-effort error handling to ensure metrics failures never affect the primary request flow.

Remove clientName from API metrics parameters and derive it server-side from clientSecret using a new mapping function. This simplifies the client API by reducing required parameters and centralizes client identification logic. - Add client-secrets.ts with getClientNameFromSecret() mapping function - Update validation schema to verify clientSecret and transform to include clientName - Remove clientName parameter from ApiMetricsParams type - Update tests to use mapped clientSecret instead of explicit clientName

Add toolsAvailable field to API metrics to track which tools are available in requests. Implement getToolsAvailable helper to extract and format tool names from OpenAI tool definitions, supporting both function and custom tool types. Refactor URL initialization to use IIFE pattern for better error handling and null safety.

Add getToolsUsed function to extract tool usage from assistant messages in the conversation history. This complements the existing toolsAvailable tracking by capturing which tools were actually invoked during the request. The function parses tool_calls from assistant messages and categorizes them as function, custom, or unknown types with their respective names.

Move emitApiMetrics call to after provider response to capture time-to-first-byte (TTFB) measurement. This enables monitoring of model response latency for observability purposes.

Add new emitApiMetricsForResponse function that drains the response body to measure full upstream response time. This provides more accurate timing data by capturing the complete request lifecycle including TTFB and total duration. The original emitApiMetrics function is preserved for backward compatibility.

Add a 60-second timeout to drainResponseBody to prevent background work from running indefinitely on long-lived SSE connections. The function now tracks elapsed time and uses Promise.race to enforce the timeout, properly canceling the reader when the limit is reached.

Add statusCode field to API metrics tracking to enable monitoring of response status distributions and error rates. The status code is captured from the response object and included in the metrics parameters.

…icit field Remove the explicit 'success' boolean field from API metrics schema and instead infer success/failure from the HTTP status code. Error messages are now required when statusCode >= 400, making the API more intuitive and reducing redundancy.

The errorMessage field and its validation logic have been removed from the API metrics schema. Error context can be derived from the statusCode field, eliminating the need for explicit error messages in metrics collection.

Add ApiMetricsTokens type to track input, output, cache write, cache hit, and total tokens. Implement getTokensFromCompletionUsage helper function to extract token metrics from OpenAI CompletionUsage objects. Extend ApiMetricsParams to include optional tokens field for comprehensive API usage monitoring.

Add kiloUserId, organizationId, isStreaming, userByok, and mode fields to API metrics schema to enable better tracking and analysis of API usage patterns. Update all metric emission points to include the new contextual information.

Remove hardcoded TODO placeholder and implement proper client secret configuration for API metrics. The secret is now loaded from O11Y_KILO_GATEWAY_CLIENT_SECRET environment variable and automatically included in metrics requests.

…unction

Implement dynamic secret retrieval from Cloudflare's Secrets Store binding to authenticate API clients. The authentication logic has been moved from schema-level validation into the route handler to support asynchronous secret fetching operations. Configuration updates include wrangler.jsonc binding setup for the O11Y_KILO_GATEWAY_CLIENT_SECRET resource. Test suite enhancements provide mock secret bindings and verify proper rejection of unauthorized requests with 403 status codes.

Implement PostHog event capture to track API usage metrics. The /ingest/api-metrics endpoint now forwards validated metrics to PostHog for analytics and monitoring. - Add captureApiMetrics function to send events to PostHog - Configure PostHog API key and host via environment variables - Exclude clientSecret from captured properties for security - Set $process_person_profile based on isAnonymous flag - Update tests to include PostHog configuration

Remove intermediate ctx variable assignments and waitOnExecutionContext calls in favor of directly passing createExecutionContext() to worker.fetch(). This reduces test boilerplate while maintaining the same test behavior.

Add new reusable workflow for deploying o11y service to Cloudflare Workers with environment selection (dev/prod). Integrate o11y deployment into production workflow with automatic triggering when cloudflare-o11y directory changes are detected.

Add optional ipAddress field to API metrics schema and pass it through to PostHog analytics. This enables PostHog to resolve geographic location from the user's actual IP address rather than the Cloudflare worker's IP address, providing more accurate location analytics. Changes include: - Add ipAddress field to ApiMetricsParamsSchema with IPv4/IPv6 validation - Extract and forward IP address in PostHog capture request - Thread ipAddress parameter through OpenRouter API route - Update ApiMetricsParams type definition

…detection Add comprehensive alerting system for LLM API observability: - Implement multi-window burn rate alerting following Google SRE Workbook approach with 3 severity windows (5m/1m, 30m/3m page; 360m/30m ticket) - Add Analytics Engine integration for time-series metrics storage with weighted sampling support for error rates and latency percentiles - Implement KV-based alert deduplication with severity-aware suppression (pages suppress tickets for same dimension) - Add Slack notification delivery with separate webhooks for pages/tickets - Integrate recommended models API endpoint to determine page-eligible models - Configure cron trigger for per-minute alert evaluation - Add comprehensive test coverage for dedup logic and SLO configuration The system tracks error rates (99.9% SLO) and latency p50/p90 thresholds, firing alerts only when both long and short windows exceed burn rate thresholds to reduce false positives.

…variable - Replace toUInt64 type casting with IF expressions for better readability in error rate and slow request queries - Change query format from JSONEachRow to JSON for consistency - Rename O11Y_APP_BASE_URL to O11Y_API_BASE_URL for clarity - Update all references across configuration, tests, and type definitions

Configure custom domain routing for o11y.kiloapps.io in the Cloudflare Worker configuration to enable direct access to the observability service through the custom domain.

Add clientName parameter to alert deduplication functions to ensure alerts are tracked separately per client. This prevents alerts for the same provider:model combination from being incorrectly suppressed across different clients. - Update alertKey() to include clientName in key generation - Add clientName parameter to shouldSuppress() and recordAlertFired() - Update all call sites in evaluate.ts to pass client_name - Add test coverage for client-specific alert suppression - Remove null return from effectiveSeverity() as it always returns a severity

Copilot

Pull request overview

This pull request adds a new Cloudflare Worker for observability (o11y) that implements API metrics ingestion and SLO-based alerting for the Kilo AI platform.

Changes:

Adds a new Cloudflare Worker that ingests API metrics from the Kilo gateway
Implements multi-window burn-rate alerting based on Google SRE Workbook practices for error rates and latency
Integrates with PostHog for analytics and Slack for alert notifications
Adds a new API endpoint /api/recommended-models to expose recommended models for alert filtering

Reviewed changes

Copilot reviewed 29 out of 32 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
src/lib/o11y/api-metrics.server.ts	Helper functions for extracting API metrics from OpenAI-compatible responses
src/lib/config.server.ts	Adds configuration variables for o11y service
src/app/api/recommended-models/route.ts	New API endpoint exposing recommended model list
src/app/api/openrouter/[...path]/route.ts	Integrates API metrics emission into the gateway proxy
cloudflare-o11y/*	Complete o11y worker implementation with alerting, querying, and notification logic
.github/workflows/*	CI/CD workflows for deploying the o11y worker
pnpm-workspace.yaml	Adds cloudflare-o11y to workspace

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cloudflare-o11y/src/alerting/query.ts

cloudflare-o11y/src/index.ts

cloudflare-o11y/wrangler.jsonc

src/lib/o11y/api-metrics.server.ts

cloudflare-o11y/src/alerting/dedup.ts

cloudflare-o11y/src/alerting/query.ts

cloudflare-o11y/src/o11y-analytics.ts

cloudflare-o11y/src/posthog.ts

cloudflare-o11y/src/alerting/recommended-models.ts

kiloconnect · 2026-02-06T01:10:54Z

Code Review Summary

Status: 2 Issues Found | Recommendation: Address before merge

Overview

Severity	Count
CRITICAL	0
WARNING	2
SUGGESTION	0

Fix these issues in Kilo Cloud

Issue Details (click to expand)

WARNING

File	Line	Issue
`src/lib/o11y/api-metrics.server.ts`	145	`after()` drains a cloned response body after the response lifecycle, which can buffer large/streaming bodies
`src/app/api/openrouter/[...path]/route.ts`	386	`resolvedModel` recorded without normalization, potentially fragmenting metrics and breaking recommended-model matching

Files Reviewed (5 files)

.github/workflows/deploy-o11y.yml - 0 issues
.github/workflows/deploy-production.yml - 0 issues
src/app/api/openrouter/[...path]/route.ts - 1 issue
src/lib/config.server.ts - 0 issues
src/lib/o11y/api-metrics.server.ts - 1 issue

- Store alert timestamps in ISO 8601 format instead of Unix epoch for better readability - Update authentication error message to be more generic and security-conscious

Wrap JSON.parse in try-catch to prevent crashes when KV cache contains invalid JSON data. Falls through to network fetch on parse errors.

src/lib/o11y/api-metrics.server.ts

src/app/api/openrouter/[...path]/route.ts

.github/workflows/deploy-production.yml

Update worker configuration types generated by wrangler with workerd@1.20260128.0. Adds readonly exports property to ExecutionContext and DurableObjectState interfaces, and removes unnecessary eslint-disable comments from AIGateway type definitions.

src/lib/o11y/api-metrics.server.ts

src/app/api/openrouter/[...path]/route.ts

- Add type guards to safely access tool.function.name and tool.custom.name properties - Move toolsAvailable and toolsUsed extraction after tool repair logic to ensure accurate metrics - Add fetch-depth: 2 to checkout steps in deploy workflow for proper path filtering

iscekic added 23 commits February 5, 2026 16:13

add cf-o11y worker

7e0b465

feat(config): add observability service URL configuration

e43c56e

Add O11Y_SERVICE_URL environment variable for agent observability ingest service with empty string fallback

feat(o11y): add TTFB tracking to API metrics

d849d5e

Move emitApiMetrics call to after provider response to capture time-to-first-byte (TTFB) measurement. This enables monitoring of model response latency for observability purposes.

feat(o11y): track HTTP status codes in API metrics

0c4ff19

Add statusCode field to API metrics tracking to enable monitoring of response status distributions and error rates. The status code is captured from the response object and included in the metrics parameters.

style(o11y): add blank lines for improved readability in validation f…

140ace0

…unction

test(o11y): simplify test setup by inlining execution context

a7351c9

Remove intermediate ctx variable assignments and waitOnExecutionContext calls in favor of directly passing createExecutionContext() to worker.fetch(). This reduces test boilerplate while maintaining the same test behavior.

iscekic self-assigned this Feb 5, 2026

iscekic added 5 commits February 6, 2026 01:02

feat(o11y): add custom domain route configuration

db86f70

Configure custom domain routing for o11y.kiloapps.io in the Cloudflare Worker configuration to enable direct access to the observability service through the custom domain.

chore(o11y): update compatibility date to 2026-02-01

36d9dd0

iscekic requested a review from Copilot February 6, 2026 01:04

iscekic marked this pull request as ready for review February 6, 2026 01:04

Copilot started reviewing on behalf of iscekic February 6, 2026 01:04 View session

Copilot AI reviewed Feb 6, 2026

View reviewed changes

kiloconnect bot reviewed Feb 6, 2026

View reviewed changes

cloudflare-o11y/src/o11y-analytics.ts Show resolved Hide resolved

cloudflare-o11y/src/posthog.ts Show resolved Hide resolved

cloudflare-o11y/src/alerting/recommended-models.ts Outdated Show resolved Hide resolved

iscekic added 2 commits February 6, 2026 02:12

refactor(o11y): improve alert timestamp format and error message clarity

db8b3f2

- Store alert timestamps in ISO 8601 format instead of Unix epoch for better readability - Update authentication error message to be more generic and security-conscious

fix(o11y): add error handling for corrupted cache in recommended models

7a1cbae

Wrap JSON.parse in try-catch to prevent crashes when KV cache contains invalid JSON data. Falls through to network fetch on parse errors.

kiloconnect bot reviewed Feb 6, 2026

View reviewed changes

src/lib/o11y/api-metrics.server.ts Show resolved Hide resolved

src/app/api/openrouter/[...path]/route.ts Outdated Show resolved Hide resolved

iscekic added 2 commits February 6, 2026 02:34

fix(o11y): normalize resolved model to lowercase for consistency

f9ce09c

iscekic requested review from jrf0110, markijbema and pandemicsyn February 6, 2026 01:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add cf-o11y worker#48

add cf-o11y worker#48
iscekic wants to merge 33 commits intomainfrom
add-model-o11y

iscekic commented Feb 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kiloconnect bot commented Feb 6, 2026 •

edited

Loading

WARNING

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

iscekic commented Feb 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kiloconnect bot commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Summary

Overview

WARNING

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kiloconnect bot commented Feb 6, 2026 •

edited

Loading