Skip to content

Implement on_agent_error_callback and support AGENT_ERROR event in BigQuery Analytics Plugin #4863

@evekhm

Description

@evekhm

Currently, when an execution run throws an unhandled exception or the system crashes mid-invocation (e.g. timeout, SIGTERM), the BigQueryAgentAnalyticsPlugin only leaves dangling INVOCATION_STARTING and AGENT_STARTING events in the database. There is no AGENT_ERROR or INVOCATION_ERROR emitted or supported.

Because a .COMPLETED event natively carries the status and latency duration, losing it means these crashed calls appear implicitly successful or as dangling threads in basic analytics. Furthermore, because these crashed calls never emit latency metadata, they are excluded from average latency calculations, artificially skewing executive dashboards to look faster and more reliable than the agent system actually is.

This feature is highly impactful for building comprehensive observability pipelines. Without native execution error tracking, we are forced to artificially reconstruct crash statuses via time-boundary SQL logic. This allows severe failures to masquerade as non-events (false positives) and completely breaks accurate system latency reporting.

Describe the Solution You'd Like

  1. Introduce an on_agent_error_callback(agent, error) and on_run_error_callback(invocation_context, error) at the framework lifecycle level to catch and broadcast agent-level and end-to-end invocation crashes, similarly to how on_tool_error_callback and on_model_error_callback currently operate.
  2. Update the BigQueryAgentAnalyticsPlugin (at https://github.com/google/adk-python/tree/main/src/google/adk/plugins/bigquery_agent_analytics_plugin.py) to properly ingest and log AGENT_ERROR and INVOCATION_ERROR events mapped from these new callbacks.

Describe Alternatives You've Considered

Currently, developers must implement custom SQL logic on top of the BigQuery tables to manually flag dangling events. For example, joining STARTING events against COMPLETED events and checking if the time difference exceeds a hardcoded threshold (e.g., > 10 minutes) before classifying it as a generic timeout error.
This is an imprecise workaround since it loses the original python exception stack trace entirely, requires arbitrary time constraints, and doesn't solve the fact that the baseline ADK framework swallowed a fatal error natively.

Proposed API / Implementation

Add on_agent_error_callback and on_run_error_callback interfaces to BasePlugin. Invoke on_run_error_callback inside the base exception handlers of the Runner.run_async()/InvocationContext flow, and invoke on_agent_error_callback for individual sub-agent LlmAgent.run_async() failures.
Then inside BigQueryAgentAnalyticsPlugin, add "AGENT_ERROR" and "INVOCATION_ERROR" to _EVENT_TYPES and map the incoming error trace directly to BigQuery's error_message column.

Metadata

Metadata

Labels

bq[Component] This issue is related to Big Query integrationneeds review[Status] The PR/issue is awaiting review from the maintainer

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions