diff --git a/modules/ROOT/nav.adoc b/modules/ROOT/nav.adoc index 4380ada99..9f381d3b6 100644 --- a/modules/ROOT/nav.adoc +++ b/modules/ROOT/nav.adoc @@ -42,6 +42,10 @@ **** xref:ai-agents:mcp/local/overview.adoc[Overview] **** xref:ai-agents:mcp/local/quickstart.adoc[Quickstart] **** xref:ai-agents:mcp/local/configuration.adoc[Configure] +** xref:ai-agents:observability/index.adoc[Transcripts] +*** xref:ai-agents:observability/concepts.adoc[Concepts] +*** xref:ai-agents:observability/transcripts.adoc[View Transcripts] +*** xref:ai-agents:observability/ingest-custom-traces.adoc[Ingest Traces from Custom Agents] * xref:develop:connect/about.adoc[Redpanda Connect] ** xref:develop:connect/connect-quickstart.adoc[Quickstart] diff --git a/modules/ai-agents/pages/observability/concepts.adoc b/modules/ai-agents/pages/observability/concepts.adoc new file mode 100644 index 000000000..aa777b29a --- /dev/null +++ b/modules/ai-agents/pages/observability/concepts.adoc @@ -0,0 +1,340 @@ += Transcripts and AI Observability +:description: Understand how Redpanda captures end-to-end execution transcripts on an immutable distributed log for agent governance and observability. +:page-topic-type: concepts +:personas: agent_developer, platform_admin, data_engineer +:learning-objective-1: Explain how transcripts and spans capture execution flow +:learning-objective-2: Interpret transcript structure for debugging and monitoring +:learning-objective-3: Distinguish between transcripts and audit logs + +Redpanda automatically captures glossterm:transcript[,transcripts] for AI agents, MCP servers, and AI Gateway operations. A transcript is the end-to-end execution record of an agentic behavior. It may span multiple agents, tools, and models and last from minutes to days. Redpanda's immutable distributed log stores every transcript, providing a correct record with no gaps. Transcripts form the keystone of Redpanda's governance for agents. + +After reading this page, you will be able to: + +* [ ] {learning-objective-1} +* [ ] {learning-objective-2} +* [ ] {learning-objective-3} + +== What are transcripts + +A transcript records the complete execution of an agentic behavior from start to finish. It captures every step — across multiple agents, tools, models, and services — in a single, traceable record. The AI Gateway and every glossterm:AI agent[,agent] and glossterm:MCP server[] in your Agentic Data Plane (ADP) automatically emit OpenTelemetry traces to a glossterm:topic[] called `redpanda.otel_traces`. Redpanda's immutable distributed log stores these traces. + +Transcripts capture: + +* Tool invocations and results +* Agent reasoning steps +* Data processing operations +* External API calls +* Error conditions +* Performance metrics + +With 100% sampling, every operation is captured with no gaps. The underlying storage uses a distributed log built on Raft consensus (with TLA+ proven correctness), giving transcripts a trustworthy, immutable record for governance, debugging, and performance analysis. + +== Traces and spans + +glossterm:OpenTelemetry[] traces provide a complete picture of how a request flows through your system: + +* A _trace_ represents the entire lifecycle of a request (for example, a tool invocation from start to finish). +* A _span_ represents a single unit of work within that trace (such as a data processing operation or an external API call). +* A trace contains one or more spans organized hierarchically, showing how operations relate to each other. + +== Agent transcript hierarchy + +Agent executions create a hierarchy of spans that reflect how agents process requests. Understanding this hierarchy helps you interpret agent behavior and identify where issues occur. + +=== Agent span types + +Agent transcripts contain these span types: + +[cols="2,3,3", options="header"] +|=== +| Span Type | Description | Use To + +| `ai-agent` +| Top-level span representing the entire agent invocation from start to finish. Includes all processing time, from receiving the request through executing the reasoning loop, calling tools, and returning the final response. +| Measure total request duration and identify slow agent invocations. + +| `agent` +| Internal agent processing that represents reasoning and decision-making. Shows time spent in the glossterm:large language model (LLM)[,LLM] reasoning loop, including context processing, tool selection, and response generation. Multiple `agent` spans may appear when the agent iterates through its reasoning loop. +| Track reasoning time and identify iteration patterns. + +| `invoke_agent` +| Agent and sub-agent invocation in multi-agent architectures, following the https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-agent-spans/[OpenTelemetry agent invocation semantic conventions^]. Represents one agent calling another via the glossterm:Agent2Agent (A2A) protocol[,A2A protocol]. +| Trace calls between root agents and sub-agents, measure cross-agent latency, and identify which sub-agent was invoked. + +| `openai`, `anthropic`, or other LLM providers +| LLM provider API call showing calls to the language model. The span name matches the provider, and attributes typically include the model name (like `gpt-5.2` or `claude-sonnet-4-5`). +| Identify which model was called, measure LLM response time, and debug LLM API errors. + +| `rpcn-mcp` +| MCP tool invocation representing calls to Remote MCP servers. Shows tool execution time, including network latency and tool processing. Child spans with `instrumentationScope.name` set to `redpanda-connect` represent internal Redpanda Connect processing. +| Measure tool execution time and identify slow MCP tool calls. +|=== + +=== Typical agent execution flow + +A simple agent request creates this hierarchy: + +---- +ai-agent (6.65 seconds) +├── agent (6.41 seconds) +│ ├── invoke_agent: customer-support-agent (6.39 seconds) +│ │ └── openai: chat gpt-5.2 (6.2 seconds) +---- + +This hierarchy shows that the LLM API call (6.2 seconds) accounts for most of the total agent invocation time (6.65 seconds), revealing the bottleneck in this execution flow. + +== MCP server transcript hierarchy + +MCP server tool invocations produce a different span hierarchy focused on tool execution and internal processing. This structure reveals performance bottlenecks and helps debug tool-specific issues. + +=== MCP server span types + +MCP server transcripts contain these span types: + +[cols="2,3,3", options="header"] +|=== +| Span Type | Description | Use To + +| `mcp-{server-id}` +| Top-level span representing the entire MCP server invocation. The server ID uniquely identifies the MCP server instance. This span encompasses all tool execution from request receipt to response completion. +| Measure total MCP server response time and identify slow tool invocations. + +| `service` +| Internal service processing span that appears at multiple levels in the hierarchy. Represents Redpanda Connect service operations including routing, processing, and component execution. +| Track internal processing overhead and identify where time is spent in the service layer. + +| Tool name (e.g., `get_order_status`, `get_customer_history`) +| The specific MCP tool being invoked. This span name matches the tool name defined in the MCP server configuration. +| Identify which tool was called and measure tool-specific execution time. + +| `processors` +| Processor pipeline execution span showing the collection of processors that process the tool's data. Appears as a child of the tool invocation span. +| Measure total processor pipeline execution time. + +| Processor name (e.g., `mapping`, `http`, `branch`) +| Individual processor execution span representing a single Redpanda Connect processor. The span name matches the processor type. +| Identify slow processors and debug processing logic. +|=== + +=== Typical MCP server execution flow + +An MCP tool invocation creates this hierarchy: + +---- +mcp-d5mnvn251oos73 (4.00 seconds) +├── service > get_order_status (4.07 seconds) +│ └── service > processors (43 microseconds) +│ └── service > mapping (18 microseconds) +---- + +This shows: + +1. Total MCP server invocation: 4.00 seconds +2. Tool execution (get_order_status): 4.07 seconds +3. Processor pipeline: 43 microseconds +4. Mapping processor: 18 microseconds (data transformation) + +The majority of time (4+ seconds) is spent in tool execution, while internal processing (mapping) takes only microseconds. This indicates the tool itself (likely making external API calls or database queries) is the bottleneck, not Redpanda Connect's internal processing. + +== Transcript layers and scope + +Transcripts contain multiple layers of instrumentation, from HTTP transport through application logic to external service calls. The `scope.name` field in each span identifies which instrumentation layer created that span. + +=== Instrumentation layers + +A complete agent transcript includes these layers: + +[cols="2,2,4", options="header"] +|=== +| Layer | Scope Name | Purpose + +| HTTP Server +| `go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp` +| HTTP transport layer receiving requests. Shows request/response sizes, status codes, client addresses, and network details. + +| AI SDK (Agent) +| `github.com/redpanda-data/ai-sdk-go/plugins/otel` +| Agent application logic. Shows agent invocations, LLM calls, tool executions, conversation IDs, token usage, and model details. Includes `gen_ai.*` semantic convention attributes. + +| HTTP Client +| `go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp` +| Outbound HTTP calls from agent to MCP servers. Shows target URLs, request methods, and response codes. + +| MCP Server +| `rpcn-mcp` +| MCP server tool execution. Shows tool name, input parameters, result size, and execution time. Appears as a separate `service.name` in resource attributes. + +| Redpanda Connect +| `redpanda-connect` +| Internal Redpanda Connect component execution within MCP tools. Shows pipeline and individual component spans. +|=== + +=== How layers connect + +Layers connect through parent-child relationships in a single transcript: + +---- +ai-agent-http-server (HTTP Server layer) +└── invoke_agent customer-support-agent (AI SDK layer) + ├── chat gpt-5-nano (AI SDK layer, LLM call 1) + ├── execute_tool get_order_status (AI SDK layer) + │ └── HTTP POST (HTTP Client layer) + │ └── get_order_status (MCP Server layer, different service) + │ └── processors (Redpanda Connect layer) + └── chat gpt-5-nano (AI SDK layer, LLM call 2) +---- + +The request flow demonstrates: + +1. HTTP request arrives at agent +2. Agent invokes sub-agent +3. Agent makes first LLM call to decide what to do +4. Agent executes tool, making HTTP call to MCP server +5. MCP server processes tool through its pipeline +6. Agent makes second LLM call with tool results +7. Response returns through HTTP layer + +=== Cross-service transcripts + +When agents call MCP tools, the transcript spans multiple services. Each service has a different `service.name` in the resource attributes: + +* Agent spans: `"service.name": "ai-agent"` +* MCP server spans: `"service.name": "mcp-{server-id}"` + +Both use the same `traceId`, allowing you to follow a request across service boundaries. + +=== Key attributes by layer + +Different layers expose different attributes: + +HTTP Server/Client layer (following https://opentelemetry.io/docs/specs/semconv/http/http-spans/[OpenTelemetry semantic conventions for HTTP^]): + +- `http.request.method`, `http.response.status_code` +- `server.address`, `url.path`, `url.full` +- `network.peer.address`, `network.peer.port` +- `http.request.body.size`, `http.response.body.size` + +AI SDK layer (following https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/[OpenTelemetry semantic conventions for generative AI^]): + +- `gen_ai.operation.name`: Operation type (`invoke_agent`, `chat`, `execute_tool`) +- `gen_ai.conversation.id`: Links spans to the same conversation session. A conversation may include multiple agent invocations (one per user request). Each invocation creates a separate trace that shares the same conversation ID. +- `gen_ai.agent.name`: Sub-agent name for multi-agent systems +- `gen_ai.provider.name`, `gen_ai.request.model`: LLM provider and model +- `gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`: Token consumption +- `gen_ai.tool.name`, `gen_ai.tool.call.arguments`: Tool execution details +- `gen_ai.input.messages`, `gen_ai.output.messages`: Full LLM conversation context + +MCP Server layer: + +- Tool-specific attributes like `order_id`, `customer_id` +- `result_prefix`, `result_length`: Tool result metadata + +Redpanda Connect layer: + +- Component-specific attributes from your tool configuration + +The `scope.name` field identifies which instrumentation layer created each span. + +== Understand the transcript structure + +Each span captures a unit of work. Here's what a typical MCP tool invocation looks like: + +[,json] +---- +{ + "traceId": "71cad555b35602fbb35f035d6114db54", + "spanId": "43ad6bc31a826afd", + "name": "http_processor", + "attributes": [ + {"key": "city_name", "value": {"stringValue": "london"}}, + {"key": "result_length", "value": {"intValue": "198"}} + ], + "startTimeUnixNano": "1765198415253280028", + "endTimeUnixNano": "1765198424660663434", + "instrumentationScope": {"name": "rpcn-mcp"}, + "status": {"code": 0, "message": ""} +} +---- + +* `traceId` links all spans in the same request across services +* `spanId` uniquely identifies this span +* `name` identifies the operation or tool +* `instrumentationScope.name` identifies which layer created the span (`rpcn-mcp` for MCP tools, `redpanda-connect` for internal processing) +* `attributes` contain operation-specific metadata +* `status.code` indicates success (0) or error (2) + +=== Parent-child relationships + +Transcripts show how operations relate. A tool invocation (parent) may trigger internal operations (children): + +[,json] +---- +{ + "traceId": "71cad555b35602fbb35f035d6114db54", + "spanId": "ed45544a7d7b08d4", + "parentSpanId": "43ad6bc31a826afd", + "name": "http", + "instrumentationScope": {"name": "redpanda-connect"}, + "status": {"code": 0, "message": ""} +} +---- + +The `parentSpanId` links this child span to the parent tool invocation. Both share the same `traceId` so you can reconstruct the complete operation. + +== Error events in transcripts + +When something goes wrong, transcripts capture error details: + +[,json] +---- +{ + "traceId": "71cad555b35602fbb35f035d6114db54", + "spanId": "ba332199f3af6d7f", + "parentSpanId": "43ad6bc31a826afd", + "name": "http_request", + "events": [ + { + "name": "event", + "timeUnixNano": "1765198420254169629", + "attributes": [{"key": "error", "value": {"stringValue": "type"}}] + } + ], + "status": {"code": 0, "message": ""} +} +---- + +The `events` array captures what happened and when. Use `timeUnixNano` to see exactly when the error occurred within the operation. + +[[opentelemetry-traces-topic]] +== How Redpanda stores trace data + +The `redpanda.otel_traces` topic stores OpenTelemetry spans using Redpanda's glossterm:Schema Registry[] wire format, with a custom Protobuf schema named `redpanda.otel_traces-value` that follows the https://opentelemetry.io/docs/specs/otel/protocol/[OpenTelemetry Protocol (OTLP)^] specification. Spans include attributes following OpenTelemetry https://opentelemetry.io/docs/specs/semconv/gen-ai/[semantic conventions for generative AI^], such as `gen_ai.operation.name` and `gen_ai.conversation.id`. The schema is automatically registered in the Schema Registry with the topic, so Kafka clients can consume and deserialize trace data correctly. + +Redpanda manages both the `redpanda.otel_traces` topic and its schema automatically. If you delete either the topic or the schema, they are recreated automatically. However, deleting the topic permanently deletes all trace data, and the topic comes back empty. Do not produce your own data to this topic. It is reserved for OpenTelemetry traces. + +=== Topic configuration and lifecycle + +The `redpanda.otel_traces` topic has a predefined retention policy. Configuration changes to this topic are not supported. If you modify settings, Redpanda reverts them to the default values. + +The topic persists in your cluster even after all agents and MCP servers are deleted, allowing you to retain historical trace data for analysis. + +Transcripts may contain sensitive information from your tool inputs and outputs. Consider implementing appropriate glossterm:ACL[access control lists (ACLs)] for the `redpanda.otel_traces` topic, and review the data in transcripts before sharing or exporting to external systems. + +== Transcripts compared to audit logs + +Transcripts and audit logs serve different but complementary purposes. + +Transcripts provide: + +* A complete, immutable record of every execution step, stored on Redpanda's distributed log with no gaps +* Hierarchical view of request flow through your system (parent-child span relationships) +* Detailed timing information for performance analysis +* Ability to reconstruct execution paths and identify bottlenecks + +Transcripts are optimized for execution-level observability and governance. For user-level accountability tracking ("who initiated what"), use the session and task topics for agents, which provide records of agent conversations and task execution. + +== Next steps + +* xref:ai-agents:observability/transcripts.adoc[] +* xref:ai-agents:agents/monitor-agents.adoc[] +* xref:ai-agents:mcp/remote/monitor-mcp-servers.adoc[] \ No newline at end of file diff --git a/modules/ai-agents/pages/observability/index.adoc b/modules/ai-agents/pages/observability/index.adoc new file mode 100644 index 000000000..d54b6e359 --- /dev/null +++ b/modules/ai-agents/pages/observability/index.adoc @@ -0,0 +1,5 @@ += Transcripts +:page-layout: index +:description: Govern agentic AI with complete execution transcripts built on Redpanda's immutable distributed log. + +{description} diff --git a/modules/ai-agents/pages/observability/ingest-custom-traces.adoc b/modules/ai-agents/pages/observability/ingest-custom-traces.adoc new file mode 100644 index 000000000..c9eeef879 --- /dev/null +++ b/modules/ai-agents/pages/observability/ingest-custom-traces.adoc @@ -0,0 +1,618 @@ += Ingest OpenTelemetry Traces from Custom Agents +:description: Configure a Redpanda Connect pipeline to ingest OpenTelemetry traces from custom agents into Redpanda's immutable log for unified governance and observability. +:page-topic-type: how-to +:learning-objective-1: Configure a Redpanda Connect pipeline to receive OpenTelemetry traces from custom agents via HTTP and publish them to `redpanda.otel_traces` +:learning-objective-2: Validate trace data format and compatibility with existing MCP server traces +:learning-objective-3: Secure the ingestion endpoint using authentication mechanisms + +When you build custom agents or instrument applications outside of Remote MCP servers and declarative agents, you can send OpenTelemetry (OTEL) traces to Redpanda for centralized observability. Deploy a Redpanda Connect pipeline as an HTTP ingestion endpoint to collect and publish traces to the `redpanda.otel_traces` topic. + +After reading this page, you will be able to: + +* [ ] {learning-objective-1} +* [ ] {learning-objective-2} +* [ ] {learning-objective-3} + +== Prerequisites + +* A BYOC cluster +* Ability to manage secrets in Redpanda Cloud +* The latest version of xref:manage:rpk/rpk-install.adoc[`rpk`] installed +* Custom agent or application instrumented with OpenTelemetry SDK +* Basic understanding of the https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-agent-spans/[OpenTelemetry span format^] and https://opentelemetry.io/docs/specs/otlp/[OpenTelemetry Protocol (OTLP)^] + +== Quickstart for LangChain users + +If you're using LangChain with OpenTelemetry tracing, you can send traces to Redpanda's `redpanda.otel_traces` glossterm:topic[] to view them in the Transcripts view. + +. Configure LangChain's OpenTelemetry integration by following the https://docs.langchain.com/langsmith/trace-with-opentelemetry[LangChain documentation^]. + +. Deploy a Redpanda Connect pipeline using the `otlp_http` input to receive OTLP traces over HTTP. Create the pipeline in the **Connect** page of your cluster, or see the <> section below for a sample configuration. + +. Configure your OTEL exporter to send traces to your Redpanda Connect pipeline using environment variables: ++ +[,bash] +---- +# Configure LangChain OTEL integration +export LANGSMITH_OTEL_ENABLED=true +export LANGSMITH_TRACING=true + +# Send traces to Redpanda Connect pipeline (use your pipeline URL) +export OTEL_EXPORTER_OTLP_ENDPOINT="https://.pipelines..clusters.rdpa.co" +export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer " +---- + +By default, traces are sent to both LangSmith and your Redpanda Connect pipeline. If you want to send traces only to Redpanda (not LangSmith), set: + +[,bash] +---- +export LANGSMITH_OTEL_ONLY="true" +---- + +Your LangChain application will send traces to the `redpanda.otel_traces` topic, making them visible in the Transcripts view in your cluster alongside Remote MCP server and declarative agent traces. + +For non-LangChain applications or custom instrumentation, continue with the sections below. + +== About custom trace ingestion + +Custom agents are applications with OpenTelemetry instrumentation that operate independently of Redpanda's Remote MCP servers or declarative agents (such as LangChain, CrewAI, or manually instrumented applications). + +When these agents send traces to `redpanda.otel_traces`, you gain unified observability alongside Remote MCP server and declarative agent traces. See xref:ai-agents:observability/concepts.adoc#cross-service-transcripts[Cross-service transcripts] for details on how traces correlate across services. + +=== Trace format requirements + +Custom agents must emit traces in OTLP format. The xref:develop:connect/components/inputs/otlp_http.adoc[`otlp_http`] input accepts both OTLP Protobuf (`application/x-protobuf`) and JSON (`application/json`) payloads. For <>, use the xref:develop:connect/components/inputs/otlp_grpc.adoc[`otlp_grpc`] input. + +Each trace must follow the OTLP specification with these required fields: + +[cols="1,3", options="header"] +|=== +| Field | Description + +| `traceId` +| Hex-encoded unique identifier for the entire trace + +| `spanId` +| Hex-encoded unique identifier for this span + +| `name` +| Descriptive operation name + +| `startTimeUnixNano` and `endTimeUnixNano` +| Timing information in nanoseconds + +| `instrumentationScope` +| Identifies the library that created the span + +| `status` +| Operation status with code (0 = OK, 2 = ERROR) +|=== + +Optional but recommended fields: + +- `parentSpanId` for hierarchical traces +- `attributes` for contextual information + +For complete trace structure details, see xref:ai-agents:observability/concepts.adoc#understand-the-transcript-structure[Understand the transcript structure]. + +== Configure the ingestion pipeline + +Create a Redpanda Connect pipeline that receives OTLP traces and publishes them to the `redpanda.otel_traces` topic. Choose HTTP or gRPC transport based on your agent's requirements. + +=== Create the pipeline configuration + +Create a pipeline configuration file that defines the OTLP ingestion endpoint. + +[tabs] +==== +HTTP:: ++ +-- +The `otlp_http` input component: + +* Exposes an OpenTelemetry Collector HTTP receiver +* Accepts traces at the standard `/v1/traces` endpoint +* Converts incoming OTLP data into individual Redpanda OTEL v1 Protobuf messages + +The following example shows a minimal pipeline configuration. Redpanda Cloud automatically injects authentication handling, so you don't need to configure `auth_token` in the input. + +[,yaml] +---- +input: + otlp_http: {} + +output: + redpanda: + seed_brokers: + - "${PRIVATE_REDPANDA_BROKERS}" + tls: + enabled: ${PRIVATE_REDPANDA_TLS_ENABLED} + sasl: + - mechanism: "REDPANDA_CLOUD_SERVICE_ACCOUNT" + topic: "redpanda.otel_traces" +---- +-- + +gRPC:: ++ +-- +The `otlp_grpc` input component: + +* Exposes an OpenTelemetry Collector gRPC receiver +* Accepts traces via the OTLP gRPC protocol +* Converts incoming OTLP data into individual Redpanda OTEL v1 Protobuf messages + +The following example shows a minimal pipeline configuration. Redpanda Cloud automatically injects authentication handling. + +[,yaml] +---- +input: + otlp_grpc: {} + +output: + redpanda: + seed_brokers: + - "${PRIVATE_REDPANDA_BROKERS}" + tls: + enabled: ${PRIVATE_REDPANDA_TLS_ENABLED} + sasl: + - mechanism: "REDPANDA_CLOUD_SERVICE_ACCOUNT" + topic: "redpanda.otel_traces" +---- + +NOTE: Clients must include the authentication token in gRPC metadata as `authorization: Bearer `. +-- +==== + +The OTLP input automatically handles format conversion, so no processors are needed for basic trace ingestion. Each span becomes a separate message in the `redpanda.otel_traces` topic. + +=== Deploy the pipeline in Redpanda Cloud + +. In the *Connect* page of your Redpanda Cloud cluster, click *Create Pipeline*. +. For the input, select the *otlp_http* (or *otlp_grpc*) component. +. Skip to *Add a topic* and select `redpanda.otel_traces` from the list of existing topics. Leave the default advanced settings. +. In the *Add permissions* step, create a service account with write access to the `redpanda.otel_traces` topic. +. In the *Create pipeline* step, enter a name for your pipeline and paste the configuration. Redpanda Cloud automatically handles authentication for incoming requests. + +== Send traces from your custom agent + +Configure your custom agent to send OpenTelemetry traces to the pipeline endpoint. After deploying the pipeline, you can find its URL in the Redpanda Cloud UI on the pipeline details page. + +[cols="1,3", options="header"] +|=== +| Transport | URL Format + +| HTTP +| `+https://.pipelines..clusters.rdpa.co/v1/traces+` + +| gRPC +| `.pipelines..clusters.rdpa.co:443` +|=== + +=== Authenticate to the pipeline + +The OTLP pipeline uses the same authentication mechanism as the Redpanda Cloud API. Obtain an access token using your service account credentials as described in xref:redpanda-cloud:security:cloud-authentication.adoc#authenticate-to-the-cloud-api[Authenticate to the Cloud API]. + +Include the token in your requests: + +* HTTP: Set the `Authorization` header to `Bearer ` +* gRPC: Set the `authorization` metadata field to `Bearer ` + +=== Configure your OTEL exporter + +Install the OpenTelemetry SDK for your language and configure the OTLP exporter to target your Redpanda Connect pipeline endpoint. + +The exporter configuration requires: + +* Endpoint: Your pipeline's URL (the SDK adds `/v1/traces` automatically for HTTP) +* Headers: Authorization header with your bearer token +* Protocol: HTTP to match the `otlp_http` input (or gRPC for `otlp_grpc`) + +[tabs] +====== +HTTP:: ++ +-- +.View Python example +[%collapsible] +==== +[,python] +---- +from opentelemetry import trace +from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter +from opentelemetry.sdk.trace import TracerProvider +from opentelemetry.sdk.trace.export import BatchSpanProcessor +from opentelemetry.sdk.resources import Resource + +# Configure resource attributes to identify your agent +resource = Resource(attributes={ + "service.name": "my-custom-agent", + "service.version": "1.0.0" +}) + +# Configure the OTLP HTTP exporter +exporter = OTLPSpanExporter( + endpoint="https://.pipelines..clusters.rdpa.co/v1/traces", + headers={"Authorization": "Bearer YOUR_TOKEN"} +) + +# Set up tracing with batch processing +provider = TracerProvider(resource=resource) +processor = BatchSpanProcessor(exporter) +provider.add_span_processor(processor) +trace.set_tracer_provider(provider) + +# Use the tracer with GenAI semantic conventions +tracer = trace.get_tracer(__name__) +with tracer.start_as_current_span( + "invoke_agent my-assistant", + kind=trace.SpanKind.INTERNAL +) as span: + # Set GenAI semantic convention attributes + span.set_attribute("gen_ai.operation.name", "invoke_agent") + span.set_attribute("gen_ai.agent.name", "my-assistant") + span.set_attribute("gen_ai.provider.name", "openai") + span.set_attribute("gen_ai.request.model", "gpt-4") + + # Your agent logic here + result = process_request() + + # Set token usage if available + span.set_attribute("gen_ai.usage.input_tokens", 150) + span.set_attribute("gen_ai.usage.output_tokens", 75) +---- +==== + +.View Node.js example +[%collapsible] +==== +[,javascript] +---- +const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node'); +const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http'); +const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base'); +const { Resource } = require('@opentelemetry/resources'); +const { trace, SpanKind } = require('@opentelemetry/api'); + +// Configure resource +const resource = new Resource({ + 'service.name': 'my-custom-agent', + 'service.version': '1.0.0' +}); + +// Configure OTLP HTTP exporter +const exporter = new OTLPTraceExporter({ + url: 'https://.pipelines..clusters.rdpa.co/v1/traces', + headers: { + 'Authorization': 'Bearer YOUR_TOKEN' + } +}); + +// Set up provider +const provider = new NodeTracerProvider({ resource }); +provider.addSpanProcessor(new BatchSpanProcessor(exporter)); +provider.register(); + +// Use the tracer with GenAI semantic conventions +const tracer = trace.getTracer('my-agent'); +const span = tracer.startSpan('invoke_agent my-assistant', { + kind: SpanKind.INTERNAL +}); + +// Set GenAI semantic convention attributes +span.setAttribute('gen_ai.operation.name', 'invoke_agent'); +span.setAttribute('gen_ai.agent.name', 'my-assistant'); +span.setAttribute('gen_ai.provider.name', 'openai'); +span.setAttribute('gen_ai.request.model', 'gpt-4'); + +// Your agent logic +processRequest().then(result => { + // Set token usage if available + span.setAttribute('gen_ai.usage.input_tokens', 150); + span.setAttribute('gen_ai.usage.output_tokens', 75); + span.end(); +}); +---- +==== + +.View Go example +[%collapsible] +==== +[,go] +---- +package main + +import ( + "context" + "log" + + "go.opentelemetry.io/otel" + "go.opentelemetry.io/otel/attribute" + "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp" + "go.opentelemetry.io/otel/sdk/resource" + sdktrace "go.opentelemetry.io/otel/sdk/trace" + semconv "go.opentelemetry.io/otel/semconv/v1.26.0" + "go.opentelemetry.io/otel/trace" +) + +func main() { + ctx := context.Background() + + // Configure OTLP HTTP exporter + exporter, err := otlptracehttp.New(ctx, + otlptracehttp.WithEndpoint(".pipelines..clusters.rdpa.co"), + otlptracehttp.WithHeaders(map[string]string{ + "Authorization": "Bearer YOUR_TOKEN", + }), + ) + if err != nil { + log.Fatalf("Failed to create exporter: %v", err) + } + + // Configure resource + res, _ := resource.New(ctx, + resource.WithAttributes( + semconv.ServiceName("my-custom-agent"), + semconv.ServiceVersion("1.0.0"), + ), + ) + + // Set up tracer provider + tp := sdktrace.NewTracerProvider( + sdktrace.WithBatcher(exporter), + sdktrace.WithResource(res), + ) + defer tp.Shutdown(ctx) + otel.SetTracerProvider(tp) + + tracer := tp.Tracer("my-agent") + + // Create span with GenAI semantic conventions + _, span := tracer.Start(ctx, "invoke_agent my-assistant", + trace.WithSpanKind(trace.SpanKindInternal), + ) + span.SetAttributes( + attribute.String("gen_ai.operation.name", "invoke_agent"), + attribute.String("gen_ai.agent.name", "my-assistant"), + attribute.String("gen_ai.provider.name", "openai"), + attribute.String("gen_ai.request.model", "gpt-4"), + attribute.Int("gen_ai.usage.input_tokens", 150), + attribute.Int("gen_ai.usage.output_tokens", 75), + ) + span.End() + + tp.ForceFlush(ctx) +} +---- +==== +-- + +gRPC:: ++ +-- +.View Python example +[%collapsible] +==== +[,python] +---- +from opentelemetry import trace +from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter +from opentelemetry.sdk.trace import TracerProvider +from opentelemetry.sdk.trace.export import BatchSpanProcessor +from opentelemetry.sdk.resources import Resource + +resource = Resource(attributes={ + "service.name": "my-custom-agent", + "service.version": "1.0.0" +}) + +# gRPC endpoint without https:// prefix +exporter = OTLPSpanExporter( + endpoint=".pipelines..clusters.rdpa.co:443", + headers={"authorization": "Bearer YOUR_TOKEN"} +) + +provider = TracerProvider(resource=resource) +provider.add_span_processor(BatchSpanProcessor(exporter)) +trace.set_tracer_provider(provider) +---- +==== + +.View Node.js example +[%collapsible] +==== +[,javascript] +---- +const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node'); +const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc'); +const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base'); +const { Resource } = require('@opentelemetry/resources'); + +const resource = new Resource({ + 'service.name': 'my-custom-agent', + 'service.version': '1.0.0' +}); + +// gRPC exporter with TLS +const exporter = new OTLPTraceExporter({ + url: 'https://.pipelines..clusters.rdpa.co:443', + headers: { + 'authorization': 'Bearer YOUR_TOKEN' + } +}); + +const provider = new NodeTracerProvider({ resource }); +provider.addSpanProcessor(new BatchSpanProcessor(exporter)); +provider.register(); +---- +==== + +.View Go example +[%collapsible] +==== +[,go] +---- +package main + +import ( + "context" + + "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc" + "google.golang.org/grpc" + "google.golang.org/grpc/credentials" +) + +func createGRPCExporter(ctx context.Context) (*otlptracegrpc.Exporter, error) { + return otlptracegrpc.New(ctx, + otlptracegrpc.WithEndpoint(".pipelines..clusters.rdpa.co:443"), + otlptracegrpc.WithDialOption(grpc.WithTransportCredentials(credentials.NewTLS(nil))), + otlptracegrpc.WithHeaders(map[string]string{ + "authorization": "Bearer YOUR_TOKEN", + }), + ) +} +---- +==== +-- +====== + +TIP: Use environment variables for the endpoint URL and authentication token to keep credentials out of your code. + +=== Use recommended semantic conventions + +The Transcripts view recognizes https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-agent-spans/[OpenTelemetry semantic conventions for GenAI operations^]. Following these conventions ensures your traces display correctly with proper attribution, token usage, and operation identification. + +==== Required attributes for agent operations + +Following the OpenTelemetry semantic conventions, agent spans should include these attributes: + +* Operation identification: +** `gen_ai.operation.name` - Set to `"invoke_agent"` for agent execution spans +** `gen_ai.agent.name` - Human-readable name of your agent (displayed in Transcripts view) +* LLM provider details: +** `gen_ai.provider.name` - LLM provider identifier (e.g., `"openai"`, `"anthropic"`, `"gcp.vertex_ai"`) +** `gen_ai.request.model` - Model name (e.g., `"gpt-4"`, `"claude-sonnet-4"`) +* Token usage (for cost tracking): +** `gen_ai.usage.input_tokens` - Number of input tokens consumed +** `gen_ai.usage.output_tokens` - Number of output tokens generated +* Session correlation: +** `gen_ai.conversation.id` - Identifier linking related agent invocations in the same conversation + +==== Required attributes for proper display + +Set these attributes on your spans for proper display and filtering in the Transcripts view: + +[cols="2,3", options="header"] +|=== +| Attribute | Purpose + +| `gen_ai.operation.name` +| Set to `"invoke_agent"` for agent execution spans + +| `gen_ai.agent.name` +| Human-readable name displayed in Transcripts view + +| `gen_ai.provider.name` +| LLM provider (e.g., `"openai"`, `"anthropic"`) + +| `gen_ai.request.model` +| Model name (e.g., `"gpt-4"`, `"claude-sonnet-4"`) + +| `gen_ai.usage.input_tokens` / `gen_ai.usage.output_tokens` +| Token counts for cost tracking + +| `gen_ai.conversation.id` +| Links related agent invocations in the same conversation +|=== + +See the code examples earlier in this page for how to set these attributes in Python, Node.js, or Go. + +=== Validate trace format + +Before deploying to production, verify your traces match the expected format: + +. Run your agent locally and enable debug logging in your OpenTelemetry SDK to inspect outgoing spans. +. Verify required fields are present: + * `traceId`, `spanId`, `name` + * `startTimeUnixNano`, `endTimeUnixNano` + * `instrumentationScope` with a `name` field + * `status` with a `code` field (0 for success, 2 for error) +. Check that `service.name` is set in the resource attributes to identify your agent in the Transcripts view. +. Verify GenAI semantic convention attributes if you want proper display in the Transcripts view: + * `gen_ai.operation.name` set to `"invoke_agent"` for agent spans + * `gen_ai.agent.name` for agent identification + * Token usage attributes if tracking costs + +== Verify trace ingestion + +After deploying your pipeline and configuring your custom agent, verify traces are flowing correctly. + +=== Consume traces from the topic + +Check that traces are being published to the `redpanda.otel_traces` topic: + +[,bash] +---- +rpk topic consume redpanda.otel_traces --offset end -n 10 +---- + +You can also view the `redpanda.otel_traces` topic in the *Topics* page of Redpanda Cloud UI. + +Look for spans with your custom `instrumentationScope.name` to identify traces from your agent. + +=== View traces in Transcripts + +After your custom agent sends traces through the pipeline, they appear in your cluster's *Agentic AI > Transcripts* view alongside traces from Remote MCP servers, declarative agents, and AI Gateway. + +==== Identify custom agent transcripts + +Custom agent transcripts are identified by the `service.name` resource attribute, which differs from Redpanda's built-in services (`ai-agent` for declarative agents, `mcp-{server-id}` for MCP servers). See xref:ai-agents:observability/concepts.adoc#cross-service-transcripts[Cross-service transcripts] to understand how the `service.name` attribute identifies transcript sources. + +Your custom agent transcripts display with: + +* **Service name** in the service filter dropdown (from your `service.name` resource attribute) +* **Agent name** in span details (from the `gen_ai.agent.name` attribute) +* **Operation names** like `"invoke_agent my-assistant"` indicating agent executions + +For detailed instructions on filtering, searching, and navigating transcripts in the UI, see xref:ai-agents:observability/transcripts.adoc[View Transcripts]. + +==== Token usage tracking + +If your spans include the recommended token usage attributes (`gen_ai.usage.input_tokens` and `gen_ai.usage.output_tokens`), they display in the summary panel's token usage section. This enables cost tracking alongside Remote MCP server and declarative agent transcripts. + +== Troubleshooting + +If traces from your custom agent aren't appearing in the Transcripts view, use these diagnostic steps to identify and resolve common ingestion issues. + +=== Pipeline not receiving requests + +If your custom agent cannot reach the ingestion endpoint: + +. Verify the endpoint URL format: + * HTTP: `\https://.pipelines..clusters.rdpa.co/v1/traces` + * gRPC: `.pipelines..clusters.rdpa.co:443` (no `https://` prefix for gRPC clients) +. Check network connectivity and firewall rules. +. Ensure authentication tokens are valid and properly formatted in the `Authorization: Bearer ` header (HTTP) or `authorization` metadata field (gRPC). +. Verify the Content-Type header matches your data format (`application/x-protobuf` or `application/json`). +. Review pipeline logs for connection errors or authentication failures. + +=== Traces not appearing in topic + +If requests succeed but traces do not appear in `redpanda.otel_traces`: + +. Check pipeline output configuration. +. Verify topic permissions. +. Validate trace format matches OTLP specification. + +== Limitations + +* The `otlp_http` and `otlp_grpc` inputs accept only traces, logs, and metrics, not profiles. +* Only traces are published to the `redpanda.otel_traces` topic. +* Exceeded rate limits return HTTP 429 (HTTP) or ResourceExhausted status (gRPC). + +== Next steps + +* xref:ai-agents:observability/transcripts.adoc[] +* xref:ai-agents:agents/monitor-agents.adoc[Observability for declarative agents] +* xref:develop:connect/components/inputs/otlp_http.adoc[OTLP HTTP input reference] - Complete configuration options for the `otlp_http` component +* xref:develop:connect/components/inputs/otlp_grpc.adoc[OTLP gRPC input reference] - Alternative gRPC-based trace ingestion \ No newline at end of file diff --git a/modules/ai-agents/pages/observability/transcripts.adoc b/modules/ai-agents/pages/observability/transcripts.adoc new file mode 100644 index 000000000..94cf5bcb4 --- /dev/null +++ b/modules/ai-agents/pages/observability/transcripts.adoc @@ -0,0 +1,130 @@ += View Transcripts +:description: Filter and navigate the Transcripts interface to investigate end-to-end agent execution records stored on Redpanda's immutable log. +:page-topic-type: how-to +:personas: agent_developer, platform_admin +:learning-objective-1: Filter transcripts to find specific execution traces +:learning-objective-2: Use the timeline interactively to navigate to specific time periods +:learning-objective-3: Navigate between detail views to inspect span information at different levels + +Use the Transcripts view to investigate end-to-end execution records for agents, MCP servers, and AI Gateway. Each transcript captures the complete lifecycle of an agentic behavior on Redpanda's immutable distributed log. Filter by operation type, inspect span details, and trace issues across your agentic systems. + +For conceptual background on spans and trace structure, see xref:ai-agents:observability/concepts.adoc[]. + +After reading this page, you will be able to: + +* [ ] {learning-objective-1} +* [ ] {learning-objective-2} +* [ ] {learning-objective-3} + +== Prerequisites + +* xref:ai-agents:agents/create-agent.adoc[Running agent] or xref:ai-agents:mcp/remote/quickstart.adoc[MCP server] with at least one execution +* Access to the Transcripts view (requires appropriate permissions to read the `redpanda.otel_traces` glossterm:topic[]) + +== Navigate the Transcripts interface + +=== Filter transcripts + +Use filters to narrow down transcripts and quickly locate specific executions. When you use any of the filters, the transcript table updates to show only matching results. + +The Transcripts view provides several quick-filter buttons: + +* *Service*: Isolate operations from a particular component in your agentic data plane (agents, MCP servers, or AI Gateway) +* *LLM Calls*: Inspect large language model (LLM) invocations, including chat completions and embeddings +* *Tool Calls*: View tool executions by agents +* *Agent Spans*: Inspect agent invocation and reasoning +* *Errors Only*: Filter for failed operations or errors +* *Slow (>5s)*: Isolate operations that exceeded five seconds in duration, useful for performance investigation + +You can combine multiple filters to narrow results further. For example, use *Tool Calls* and *Errors Only* together to investigate failed tool executions. + +Toggle *Full traces* on to see the complete execution context, in grayed-out text, for the filtered transcripts in the table. + +==== Filter by attribute + +Click the *Attribute* button to query exact matches on specific span metadata such as the following: + +* Agent names +* LLM model names, for example, `gemini-3-flash-preview` +* Tool names +* Span and trace IDs + +You can add multiple attribute filters to refine results. + +=== Use the interactive timeline + +Use the timeline visualization to quickly identify when errors began or patterns changed, and navigate directly to transcripts from specific time windows when investigating issues that occurred at known times. + +Click on any bar in the timeline to zoom into transcripts from that specific time period. The transcript table automatically scrolls to show operations from the time bucket in view. + +[NOTE] +==== +When viewing time ranges with many transcripts (hundreds or thousands), the table displays a subset of the data to maintain performance and usability. The timeline bar indicates the actual time range of data currently loaded into view, which may be narrower than your selected time range. + +Refer to the timeline header to check the exact range and count of visible transcripts, for example, "Showing 100 of 299 transcripts from 13:17 to 15:16". +==== + +== Inspect span details + +The transcript table shows: + +* **Time**: When the glossterm:span[] started (sortable) +* **Span**: Span type and name with hierarchical tree structure +* **Duration**: Total time or relative duration shown as visual bars + +To view nested operations, expand any parent span. To learn more about span hierarchies and cross-service traces, see xref:ai-agents:observability/concepts.adoc[]. + +Click any span to view details in the panel: + +* **Summary tab**: High-level overview with token usage, operation counts, and conversation history. +* **Attributes tab**: Structured metadata for debugging (see xref:ai-agents:observability/concepts.adoc#key-attributes-by-layer[standard attributes by layer]). +* **Raw data tab**: Complete glossterm:OpenTelemetry[] span in JSON format. You can also view raw transcript data in the `redpanda.otel_traces` topic. + +[NOTE] +==== +Rows labeled "awaiting root — waiting for parent span" indicate incomplete glossterm:trace[,traces]. This occurs when child spans arrive before parent spans due to network latency or service failures. Consistent "awaiting root" entries suggest instrumentation issues. +==== + +== Common investigation tasks + +The following patterns demonstrate how to use the Transcripts view for understanding and troubleshooting your agentic systems. + +=== Debug errors + +. Use *Errors Only* to filter for failed operations, or review the timeline to identify and zoom in to when errors began occurring. +. Expand error spans to examine the failure context. +. Check preceding tool call arguments and LLM responses for root cause. + +=== Investigate performance issues + +. Use the *Slow (>5s)* filter to identify operations with high latency. +. Expand slow spans to identify bottlenecks in the execution tree. +. Compare duration bars across similar operations to spot anomalies. + +=== Analyze tool usage + +. Apply the *Tool Calls* filter and optionally use the *Attribute* filter to focus on a specific tool. +. Review tool execution frequency in the timeline. +. Click individual tool call spans to inspect arguments and responses. +.. Check the Description field to understand tool invocation context. +.. Use the Arguments field to verify correct parameter passing. + +=== Monitor LLM interactions + +. Click *LLM Calls* to focus on model invocations and optionally filter by model name and provider using the *Attribute* filter. +. Review token usage patterns across different time periods. +. Examine conversation history to understand model behavior. +. Spot unexpected model calls or token consumption spikes. + +=== Trace multi-service operations + +. Locate the parent agent or gateway span in the transcript table. +. Use the *Attribute* filter to follow the trace ID through agent and MCP server boundaries. +. Expand the transcript tree to reveal child spans across services. +. Review durations to understand where latency occurs in distributed calls. + +== Next steps + +* xref:ai-agents:agents/monitor-agents.adoc[] +* xref:ai-agents:mcp/remote/monitor-mcp-servers.adoc[] +* xref:ai-agents:agents/troubleshooting.adoc[] \ No newline at end of file