Add tool failure evaluator for IntermediateStep and ATIF formats#1816
Add tool failure evaluator for IntermediateStep and ATIF formats#1816ericevans-nv wants to merge 2 commits intoNVIDIA:developfrom
Conversation
…r and ATIF converter Signed-off-by: Eric Evans <194135482+ericevans-nv@users.noreply.github.com>
…eat/tool-error-observability
WalkthroughThis PR introduces comprehensive tool error handling and failure evaluation across the framework. It adds error extraction in the ATIF converter (batch and streaming paths), creates a new tool failure evaluator plugin to measure tool call success rates, integrates tool error callbacks into the LangChain handler, and provides extensive test coverage. Changes
Sequence Diagram(s)sequenceDiagram
participant Agent
participant Converter
participant ErrorParser
participant ATIF
Agent->>Converter: Tool output with status="error"
Converter->>ErrorParser: Extract error metadata
ErrorParser->>ErrorParser: Parse error_type from<br/>"<identifier>: <message>"
ErrorParser-->>Converter: {error, error_type,<br/>error_message, status}
Converter->>ATIF: Attach error metadata<br/>to step.extra["tool_errors"]
sequenceDiagram
participant Input as Input<br/>(Legacy/ATIF)
participant Evaluator
participant ErrorDetector
participant Aggregator
participant Output as Output<br/>(ToolFailureReasoning)
Input->>Evaluator: Trajectory with tool calls
Evaluator->>ErrorDetector: For each tool call,<br/>detect failure via:<br/>1. step.extra["tool_errors"]<br/>2. Parsed observation dict<br/>3. Exception pattern heuristic
ErrorDetector-->>Evaluator: Error status &<br/>error message
Evaluator->>Aggregator: Aggregate per-tool<br/>calls and failures
Aggregator->>Aggregator: Calculate success_rate =<br/>(total - failed) / total
Aggregator-->>Output: ToolFailureReasoning<br/>with metrics & summaries
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment Tip You can disable sequence diagrams in the walkthrough.Disable the |
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (1)
packages/nvidia_nat_core/tests/nat/utils/test_atif_converter.py (1)
20-20: Use a dict or stub in the test instead of importingToolMessageto reduce test coupling.
atif_converter.pyonly accessesstatusandcontentattributes, so a simple dict will exercise the same code path. Note thatlangchain-coreis not explicitly declared inpackages/nvidia_nat_core/pyproject.tomldespite being imported throughout the source code (via transitive dependencies). The test should avoid hardwiring this dependency; use a minimal test object instead.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_core/tests/nat/utils/test_atif_converter.py` at line 20, Replace the direct import of ToolMessage in the test with a minimal test double (either a plain dict or a small stub object) that exposes the two attributes the production code uses: status and content; update the test in test_atif_converter.py to construct and pass this dict/stub into the atif_converter call (the same function/class under test in atif_converter.py) instead of a ToolMessage instance so the test no longer depends on langchain_core via ToolMessage.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/nvidia_nat_core/src/nat/utils/atif_converter.py`:
- Around line 104-106: Remove the literal "TODO" from the docstring for
_extract_tool_error; instead rephrase the note to a neutral statement such as
"Return a model instead of a plain dict once ATIF spec adds error support" or
move that implementation note to an external tracker, so the docstring no longer
contains TODO/FIXME placeholder text and continues to describe the current
behavior of _extract_tool_error.
In
`@packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.py`:
- Around line 101-103: The loop that currently skips agent steps with empty
step.tool_calls (in the evaluator iterating "for step in steps") must be changed
to also inspect step.extra["tool_errors"] and treat each orphaned tool_error as
a failed tool call; specifically, where the code checks "if not step.tool_calls
or step.source != 'agent': continue", remove or adjust the early continue to
allow processing when step.extra.get('tool_errors') exists, count each entry in
step.extra["tool_errors"] toward the failure total, and add the same
regression/recording logic you use for normal tool_calls (e.g., increment
failure counters and mark the step as regressed) so unmatched tool_errors are
not ignored.
- Around line 105-123: The current logic pairs observations and tool_errors by
list position and tool name (step.tool_calls / observations index and
tool_error_entry["tool"]), which misattributes failures when the same tool is
called multiple times or results arrive out-of-order; change to correlate at the
call level by matching atif_tool_call.tool_call_id (or similar call id on the
tool call object) against ObservationResult.source_call_id for observations and
against tool_error_entry["tool_call_id"] (or add/expect that key) for structured
errors, build a lookup map from source_call_id->ObservationResult to get each
call's observation regardless of position, and when matching a tool_errors entry
consume it exactly once (e.g., remove or mark it as used) so entries aren’t
reused across multiple atif_tool_call instances; update handling around
observation_content, is_error, matching_extra, failed_tool_calls, and
per_tool_summary to use these per-call matches.
In
`@packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/models.py`:
- Around line 39-42: The Field "attempts" in models.py is documented as "Ordered
list of every call to this tool" but ToolFailureEvaluator builds attempts = [a
for a in attempts if a.error is not None], so update the mismatch: either modify
ToolFailureEvaluator (the logic that filters attempts) to include successful
calls as well, or change the Field description for attempts (and any related
docstrings) to state explicitly that it contains only failed attempts; make this
change where attempts: list[_ToolCall] = Field(...) is declared and adjust
tests/docs referencing ToolFailureEvaluator accordingly so schema and
implementation match.
In
`@packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/register.py`:
- Around line 30-34: The max_concurrency Pydantic field currently allows
zero/negative values and must be validated at the config boundary: update the
Field definition for max_concurrency in register.py (the max_concurrency: int =
Field(...)) to enforce a positive bound (e.g. add gt=0) so invalid CLI/user
configs are rejected immediately while keeping the default 8; no runtime
scheduling changes needed—just add the Pydantic constraint to the existing Field
(or add an equivalent `@validator` on max_concurrency) to enforce >0.
In `@packages/nvidia_nat_langchain/tests/test_langchain_callback_handler.py`:
- Around line 316-318: Replace the temporary list comprehension in _get_tool_end
with a generator passed to next to avoid creating an intermediate list and
satisfy Ruff: inside function _get_tool_end(stats:
list[IntermediateStepPayload]) use next(s for s in stats if s.event_type ==
IntermediateStepType.TOOL_END) (or next((s for s in stats if s.event_type ==
IntermediateStepType.TOOL_END), None) if you want to return None when missing)
so you still return the first IntermediateStepPayload matching
IntermediateStepType.TOOL_END without building a list.
---
Nitpick comments:
In `@packages/nvidia_nat_core/tests/nat/utils/test_atif_converter.py`:
- Line 20: Replace the direct import of ToolMessage in the test with a minimal
test double (either a plain dict or a small stub object) that exposes the two
attributes the production code uses: status and content; update the test in
test_atif_converter.py to construct and pass this dict/stub into the
atif_converter call (the same function/class under test in atif_converter.py)
instead of a ToolMessage instance so the test no longer depends on
langchain_core via ToolMessage.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: bdb0bc8d-0b5f-4330-a44a-4660a278e89e
📒 Files selected for processing (10)
packages/nvidia_nat_core/src/nat/utils/atif_converter.pypackages/nvidia_nat_core/tests/nat/utils/test_atif_converter.pypackages/nvidia_nat_eval/src/nat/plugins/eval/register.pypackages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/__init__.pypackages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.pypackages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/models.pypackages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/register.pypackages/nvidia_nat_eval/tests/eval/evaluator/test_tool_failure_evaluator.pypackages/nvidia_nat_langchain/src/nat/plugins/langchain/callback_handler.pypackages/nvidia_nat_langchain/tests/test_langchain_callback_handler.py
| def _extract_tool_error(output: Any) -> dict[str, str] | None: | ||
| """Extract error metadata from a tool output for ``step.extra["tool_errors"]``.""" | ||
| # TODO: return a model instead of a plain dict once ATIF spec adds error support |
There was a problem hiding this comment.
Remove the TODO from this helper comment.
Vale rejects TODO in source comments, so this will keep tripping docs/comment lint until it is reworded or tracked externally.
♻️ Suggested edit
- # TODO: return a model instead of a plain dict once ATIF spec adds error support
+ # Keep this as a plain dict until the ATIF spec grows typed error fields.As per coding guidelines "Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like 'lorem ipsum'."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/nvidia_nat_core/src/nat/utils/atif_converter.py` around lines 104 -
106, Remove the literal "TODO" from the docstring for _extract_tool_error;
instead rephrase the note to a neutral statement such as "Return a model instead
of a plain dict once ATIF spec adds error support" or move that implementation
note to an external tracker, so the docstring no longer contains TODO/FIXME
placeholder text and continues to describe the current behavior of
_extract_tool_error.
| tool_error: dict[str, str] | None = _extract_tool_error(raw_output) | ||
|
|
||
| if tool_error is not None: | ||
| tool_error["tool"] = tool_name | ||
| extra: dict[str, Any] | None = ({"tool_errors": [tool_error]} if tool_error else None) | ||
|
|
||
| if pending is not None: | ||
| pending.tool_calls.append(tc) | ||
| pending.observations.append(obs) | ||
| if tool_error: | ||
| pending.extra.setdefault("tool_errors", []).append(tool_error) | ||
| pending.tool_ancestry.append(_atif_ancestry_from_ist(ist)) | ||
| else: | ||
| extra = _atif_step_extra_model_from_ist(ist).model_dump(exclude_none=True) | ||
| if tool_error: | ||
| extra.setdefault("tool_errors", []).append(tool_error) |
There was a problem hiding this comment.
tool_errors needs a per-call identifier, not just the tool name.
Cross-file, packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.py currently matches step.extra["tool_errors"] by tool. If a single agent step contains two calls to the same function and only one fails, both iterations will hit the same metadata entry and both calls will be counted as failures. Please persist the generated call_id/source_call_id with each error record here and have the evaluator match on that stable ID first. A regression test with same-name tools and only one failure would pin this down.
🧩 Partial converter-side fix
- if tool_error is not None:
- tool_error["tool"] = tool_name
+ if tool_error is not None:
+ tool_error["tool"] = tool_name
+ tool_error["tool_call_id"] = call_idA companion change is still needed in packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.py so the ATIF lane consumes tool_call_id instead of keying solely on tool.
Also applies to: 555-568
| for step in steps: | ||
| if not step.tool_calls or step.source != "agent": | ||
| continue |
There was a problem hiding this comment.
Do not drop orphan tool_errors.
Line 102 skips any agent step whose tool_calls array is empty, but this PR now carries orphan tool failures through step.extra["tool_errors"]. Those failures are invisible here, so a sample can return a perfect score even though it already contains recorded tool errors. Please count unmatched tool_errors entries as failed calls and add a regression for a step that has extra["tool_errors"] but no tool_calls.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.py`
around lines 101 - 103, The loop that currently skips agent steps with empty
step.tool_calls (in the evaluator iterating "for step in steps") must be changed
to also inspect step.extra["tool_errors"] and treat each orphaned tool_error as
a failed tool call; specifically, where the code checks "if not step.tool_calls
or step.source != 'agent': continue", remove or adjust the early continue to
allow processing when step.extra.get('tool_errors') exists, count each entry in
step.extra["tool_errors"] toward the failure total, and add the same
regression/recording logic you use for normal tool_calls (e.g., increment
failure counters and mark the step as regressed) so unmatched tool_errors are
not ignored.
| observations: list[ObservationResult] = (step.observation.results if step.observation else []) | ||
|
|
||
| for index, atif_tool_call in enumerate(step.tool_calls): | ||
| observation_content: str = "" | ||
| if index < len(observations) and observations[index].content: | ||
| raw_content: str | list[Any] | None = observations[index].content | ||
| observation_content = raw_content if isinstance(raw_content, str) else str(raw_content) | ||
|
|
||
| is_error: bool = False | ||
| error_content: str = "" | ||
|
|
||
| # Check step.extra["tool_errors"] for structured error metadata | ||
| extra_errors: list[dict[str, Any]] = (step.extra or {}).get("tool_errors", []) | ||
| matching_extra: dict[str, Any] | None = None | ||
| for tool_error_entry in extra_errors: | ||
| if tool_error_entry.get("tool") == atif_tool_call.function_name: | ||
| is_error = True | ||
| matching_extra = tool_error_entry | ||
| break |
There was a problem hiding this comment.
Correlate ATIF failures per call, not per tool name.
Line 107 still pairs observations by position, and Lines 119-123 reuse the first tool_errors entry whose tool matches function_name. If one step invokes the same tool twice, or results arrive out of order, the same failure can be applied to multiple calls and both failed_tool_calls and per_tool_summary become wrong. Please switch this to call-level correlation using tool_call_id / ObservationResult.source_call_id from packages/nvidia_nat_core/src/nat/data_models/atif/observation_result.py lines 27-46, and ensure each structured error entry is consumed at most once.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.py`
around lines 105 - 123, The current logic pairs observations and tool_errors by
list position and tool name (step.tool_calls / observations index and
tool_error_entry["tool"]), which misattributes failures when the same tool is
called multiple times or results arrive out-of-order; change to correlate at the
call level by matching atif_tool_call.tool_call_id (or similar call id on the
tool call object) against ObservationResult.source_call_id for observations and
against tool_error_entry["tool_call_id"] (or add/expect that key) for structured
errors, build a lookup map from source_call_id->ObservationResult to get each
call's observation regardless of position, and when matching a tool_errors entry
consume it exactly once (e.g., remove or mark it as used) so entries aren’t
reused across multiple atif_tool_call instances; update handling around
observation_content, is_error, matching_extra, failed_tool_calls, and
per_tool_summary to use these per-call matches.
| attempts: list[_ToolCall] = Field( | ||
| default_factory=list, | ||
| description="Ordered list of every call to this tool.", | ||
| ) |
There was a problem hiding this comment.
attempts is documented as full history, but the evaluator only emits failed calls.
ToolFailureEvaluator currently builds attempts=[a for a in attempts if a.error is not None], so these descriptions overstate what downstream consumers will actually receive. Either include successful attempts as well, or tighten the schema/docs to say this list only contains failed attempts.
📝 Suggested schema doc update
attempts: list[_ToolCall] = Field(
default_factory=list,
- description="Ordered list of every call to this tool.",
+ description="Ordered list of failed calls to this tool.",
)
@@
per_tool_summary: list[ToolSummary] = Field(
default_factory=list,
- description="Per-tool health summary with attempt history.",
+ description="Per-tool health summary with failed attempt history.",
)As per coding guidelines "Keep documentation in sync with code; the documentation pipeline will fail on Sphinx errors or broken links."
Also applies to: 52-55
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/models.py`
around lines 39 - 42, The Field "attempts" in models.py is documented as
"Ordered list of every call to this tool" but ToolFailureEvaluator builds
attempts = [a for a in attempts if a.error is not None], so update the mismatch:
either modify ToolFailureEvaluator (the logic that filters attempts) to include
successful calls as well, or change the Field description for attempts (and any
related docstrings) to state explicitly that it contains only failed attempts;
make this change where attempts: list[_ToolCall] = Field(...) is declared and
adjust tests/docs referencing ToolFailureEvaluator accordingly so schema and
implementation match.
| max_concurrency: int = Field(default=8, description="Max concurrency for evaluation.") | ||
| enable_atif_evaluator: bool = Field( | ||
| default=False, | ||
| description="Enable ATIF-native evaluation lane alongside the legacy lane.", | ||
| ) |
There was a problem hiding this comment.
Validate max_concurrency at the config boundary.
This is user-supplied eval/CLI input, but 0 and negative values are accepted today and only fail later when the evaluator is scheduled. Add a positive bound on the Pydantic field so bad configs are rejected up front.
🛡️ Suggested change
- max_concurrency: int = Field(default=8, description="Max concurrency for evaluation.")
+ max_concurrency: int = Field(default=8, ge=1, description="Max concurrency for evaluation.")As per coding guidelines "Validate and sanitise all user input, especially in web or CLI interfaces."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/register.py`
around lines 30 - 34, The max_concurrency Pydantic field currently allows
zero/negative values and must be validated at the config boundary: update the
Field definition for max_concurrency in register.py (the max_concurrency: int =
Field(...)) to enforce a positive bound (e.g. add gt=0) so invalid CLI/user
configs are rejected immediately while keeping the default 8; no runtime
scheduling changes needed—just add the Pydantic constraint to the existing Field
(or add an equivalent `@validator` on max_concurrency) to enforce >0.
Description
Summary
Adds a
tool_failureevaluator to thenat evalpipeline that computes per-item tool call success rates and surfaces structured error details through both the legacy IntermediateStep and ATIF evaluation lanes. This required changes across three packages to capture, propagate, and consume tool error signals.Changes
1. LangChain Callback Handler (
nvidia_nat_langchain)on_tool_errorcallback: New handler onLangchainProfilerHandlerthat catches tool exceptions and emits aTOOL_ENDintermediate step with aToolMessage(status="error")output. Previously, tool errors were silently dropped and never reached the evaluation pipeline._run_id_to_tool_namemapping soon_tool_errorcan resolve the tool name (which is only available duringon_tool_start)._extract_tools_schema: Removed the redundant Anthropic-format fallback parser that was duplicating error handling; non-OpenAI schemas now log atdebuglevel and skip gracefully.2. ATIF Converter (
nvidia_nat_core)_extract_tool_errorhelper: New function that inspects tool outputs forstatus="error"and extracts structured error metadata (error,error_type,error_message,status).step.extra["tool_errors"]propagation: BothIntermediateStepToATIFConverter(batch) andATIFStreamConverter(streaming) now injecttool_errorsinto the ATIF step'sextradict when a tool error is detected. This works for both pending (merged into an existing agent step) and orphan (standalone step) tool calls.3. Tool Failure Evaluator (
nvidia_nat_eval)tool_failureevaluator plugin with the following files:evaluator.py—ToolFailureEvaluatorimplementing bothBaseEvaluator(legacy lane) andAtifBaseEvaluator(ATIF lane). Each lane processes its respective trajectory format, detects errors, and produces a self-containedEvalOutputItemper dataset item.models.py— Pydantic models for the reasoning output:ToolFailureReasoning(overall stats),ToolSummary(per-tool breakdown),_ToolCall(individual call with input/output/error).register.py— Plugin registration withToolFailureEvaluatorConfigsupportingmax_concurrencyandenable_atif_evaluatoroptions.__init__.py— Public API exports.step.extra["tool_errors"]— structured metadata from the ATIF converterToolMessagedict parsing —{'status': 'error', 'content': '...'}viaast.literal_eval"XyzError: ..."in observation contentper_tool_summaryonly includes tools that had at least one failureToolSummary.attemptsonly includes failed_ToolCallentriesfailed_toolslists the names of all tools with failuresscoreis(total - failed) / total, unroundedTests
test_tool_failure_evaluator.py— 13 unit tests across two test classes validating model population for both evaluation lanes, including error detection via each ATIF strategy, edge cases (empty trajectory,Nonedata, missing tool name, non-error strings), and mixed success/failure filtering.test_atif_converter.py— New tests for_extract_tool_errorandtool_errorspropagation through both batch and streaming converters.test_langchain_callback_handler.py— New tests foron_tool_errorcallback verifying the emitted intermediate step structure.By Submitting this PR I confirm:
Summary by CodeRabbit
Release Notes
New Features
Tests