Add tool failure evaluator for IntermediateStep and ATIF formats by ericevans-nv · Pull Request #1816 · NVIDIA/NeMo-Agent-Toolkit

ericevans-nv · 2026-03-20T23:41:54Z

Description

Summary

Adds a tool_failure evaluator to the nat eval pipeline that computes per-item tool call success rates and surfaces structured error details through both the legacy IntermediateStep and ATIF evaluation lanes. This required changes across three packages to capture, propagate, and consume tool error signals.

Changes

1. LangChain Callback Handler (`nvidia_nat_langchain`)

on_tool_error callback: New handler on LangchainProfilerHandler that catches tool exceptions and emits a TOOL_END intermediate step with a ToolMessage(status="error") output. Previously, tool errors were silently dropped and never reached the evaluation pipeline.
Tool name tracking: Added _run_id_to_tool_name mapping so on_tool_error can resolve the tool name (which is only available during on_tool_start).
Simplified _extract_tools_schema: Removed the redundant Anthropic-format fallback parser that was duplicating error handling; non-OpenAI schemas now log at debug level and skip gracefully.

2. ATIF Converter (`nvidia_nat_core`)

_extract_tool_error helper: New function that inspects tool outputs for status="error" and extracts structured error metadata (error, error_type, error_message, status).
step.extra["tool_errors"] propagation: Both IntermediateStepToATIFConverter (batch) and ATIFStreamConverter (streaming) now inject tool_errors into the ATIF step's extra dict when a tool error is detected. This works for both pending (merged into an existing agent step) and orphan (standalone step) tool calls.

3. Tool Failure Evaluator (`nvidia_nat_eval`)

New tool_failure evaluator plugin with the following files:
- evaluator.py — ToolFailureEvaluator implementing both BaseEvaluator (legacy lane) and AtifBaseEvaluator (ATIF lane). Each lane processes its respective trajectory format, detects errors, and produces a self-contained EvalOutputItem per dataset item.
- models.py — Pydantic models for the reasoning output: ToolFailureReasoning (overall stats), ToolSummary (per-tool breakdown), _ToolCall (individual call with input/output/error).
- register.py — Plugin registration with ToolFailureEvaluatorConfig supporting max_concurrency and enable_atif_evaluator options.
- __init__.py — Public API exports.
ATIF lane error detection uses three strategies in priority order:
1. step.extra["tool_errors"] — structured metadata from the ATIF converter
2. Stringified ToolMessage dict parsing — {'status': 'error', 'content': '...'} via ast.literal_eval
3. Raw error pattern matching — "XyzError: ..." in observation content
Reasoning output:
- per_tool_summary only includes tools that had at least one failure
- ToolSummary.attempts only includes failed _ToolCall entries
- failed_tools lists the names of all tools with failures
- score is (total - failed) / total, unrounded

Tests

test_tool_failure_evaluator.py — 13 unit tests across two test classes validating model population for both evaluation lanes, including error detection via each ATIF strategy, edge cases (empty trajectory, None data, missing tool name, non-error strings), and mixed success/failure filtering.
test_atif_converter.py — New tests for _extract_tool_error and tool_errors propagation through both batch and streaming converters.
test_langchain_callback_handler.py — New tests for on_tool_error callback verifying the emitted intermediate step structure.

By Submitting this PR I confirm:

I am familiar with the Contributing Guidelines.
We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
- Any contribution which contains commits that are not Signed-Off will not be accepted.
When the PR is ready for review, new or existing tests cover these changes.
When the PR is ready for review, the documentation is up to date with these changes.

Summary by CodeRabbit

Release Notes

New Features
- Tool failure evaluator now computes tool call success rates and generates per-tool failure analysis with structured error detection
- Trajectories now include enhanced error metadata capturing tool failures through multiple detection signals and error patterns
Tests
- Comprehensive test coverage added for tool error handling and failure evaluation across trajectory formats

…r and ATIF converter Signed-off-by: Eric Evans <194135482+ericevans-nv@users.noreply.github.com>

…eat/tool-error-observability

coderabbitai · 2026-03-20T23:42:09Z

Walkthrough

This PR introduces comprehensive tool error handling and failure evaluation across the framework. It adds error extraction in the ATIF converter (batch and streaming paths), creates a new tool failure evaluator plugin to measure tool call success rates, integrates tool error callbacks into the LangChain handler, and provides extensive test coverage.

Changes

Cohort / File(s)	Summary
ATIF Converter Error Handling `packages/nvidia_nat_core/src/nat/utils/atif_converter.py`	Added `_extract_tool_error()` helper to detect and parse tool output errors with status `"error"`. Updated batch and streaming converters to preserve raw tool output and attach extracted error metadata to step `extra["tool_errors"]` list when an error is detected.
ATIF Converter Tests `packages/nvidia_nat_core/tests/nat/utils/test_atif_converter.py`	Added `error_trajectory` fixture and `TestToolErrorATIFConversion` suite validating ATIF conversion of tool errors, including error type/message parsing, consistent behavior between batch and streaming paths, and correct handling when no errors are present.
Tool Failure Evaluator Plugin `packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/__init__.py`, `evaluator.py`, `models.py`, `register.py`	New plugin module implementing `ToolFailureEvaluator` that computes tool-call success rates by detecting and classifying tool failures. Includes Pydantic models for reasoning output (`ToolFailureReasoning`, `ToolSummary`, `_ToolCall`) and registration logic supporting both legacy and ATIF trajectory formats with multiple error-detection paths.
Tool Failure Evaluator Registration `packages/nvidia_nat_eval/src/nat/plugins/eval/register.py`	Added import of `register_tool_failure_evaluator` to enable plugin registration alongside existing dataset loaders.
Tool Failure Evaluator Tests `packages/nvidia_nat_eval/tests/eval/evaluator/test_tool_failure_evaluator.py`	Comprehensive test suite covering empty trajectories, all-failure scenarios, mixed success/failure cases, and multiple error-detection paths. Tests validate correct population of failure metrics, per-tool summaries, and filtering of attempts to only include failed calls.
LangChain Callback Handler `packages/nvidia_nat_langchain/src/nat/plugins/langchain/callback_handler.py`	Added `on_tool_error` callback to record tool failures as `IntermediateStepPayload` with `TOOL_END` event type. Constructs `ToolMessage` with `status="error"` and tracks tool names via `_run_id_to_tool_name` state. Simplified tool schema handling by removing non-OpenAI format re-mapping.
LangChain Callback Handler Tests `packages/nvidia_nat_langchain/tests/test_langchain_callback_handler.py`	Added pytest tests for `on_tool_error` callback, validating emission of `IntermediateStepPayload` with correct `TOOL_END` event type, `ToolMessage` status/content, and tool-name fallback behavior.

Sequence Diagram(s)

sequenceDiagram
    participant Agent
    participant Converter
    participant ErrorParser
    participant ATIF
    
    Agent->>Converter: Tool output with status="error"
    Converter->>ErrorParser: Extract error metadata
    ErrorParser->>ErrorParser: Parse error_type from<br/>"<identifier>: <message>"
    ErrorParser-->>Converter: {error, error_type,<br/>error_message, status}
    Converter->>ATIF: Attach error metadata<br/>to step.extra["tool_errors"]

sequenceDiagram
    participant Input as Input<br/>(Legacy/ATIF)
    participant Evaluator
    participant ErrorDetector
    participant Aggregator
    participant Output as Output<br/>(ToolFailureReasoning)
    
    Input->>Evaluator: Trajectory with tool calls
    Evaluator->>ErrorDetector: For each tool call,<br/>detect failure via:<br/>1. step.extra["tool_errors"]<br/>2. Parsed observation dict<br/>3. Exception pattern heuristic
    ErrorDetector-->>Evaluator: Error status &<br/>error message
    Evaluator->>Aggregator: Aggregate per-tool<br/>calls and failures
    Aggregator->>Aggregator: Calculate success_rate =<br/>(total - failed) / total
    Aggregator-->>Output: ToolFailureReasoning<br/>with metrics & summaries

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and concisely summarizes the main change: adding a tool failure evaluator supporting both IntermediateStep and ATIF formats.
Docstring Coverage	✅ Passed	Docstring coverage is 85.45% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can disable sequence diagrams in the walkthrough.

Disable the reviews.sequence_diagrams setting to disable sequence diagrams in the walkthrough.

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (1)

packages/nvidia_nat_core/tests/nat/utils/test_atif_converter.py (1)
20-20: Use a dict or stub in the test instead of importing ToolMessage to reduce test coupling.

atif_converter.py only accesses status and content attributes, so a simple dict will exercise the same code path. Note that langchain-core is not explicitly declared in packages/nvidia_nat_core/pyproject.toml despite being imported throughout the source code (via transitive dependencies). The test should avoid hardwiring this dependency; use a minimal test object instead.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/nvidia_nat_core/tests/nat/utils/test_atif_converter.py` at line 20,
Replace the direct import of ToolMessage in the test with a minimal test double
(either a plain dict or a small stub object) that exposes the two attributes the
production code uses: status and content; update the test in
test_atif_converter.py to construct and pass this dict/stub into the
atif_converter call (the same function/class under test in atif_converter.py)
instead of a ToolMessage instance so the test no longer depends on
langchain_core via ToolMessage.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/nvidia_nat_core/src/nat/utils/atif_converter.py`:
- Around line 104-106: Remove the literal "TODO" from the docstring for
_extract_tool_error; instead rephrase the note to a neutral statement such as
"Return a model instead of a plain dict once ATIF spec adds error support" or
move that implementation note to an external tracker, so the docstring no longer
contains TODO/FIXME placeholder text and continues to describe the current
behavior of _extract_tool_error.

In
`@packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.py`:
- Around line 101-103: The loop that currently skips agent steps with empty
step.tool_calls (in the evaluator iterating "for step in steps") must be changed
to also inspect step.extra["tool_errors"] and treat each orphaned tool_error as
a failed tool call; specifically, where the code checks "if not step.tool_calls
or step.source != 'agent': continue", remove or adjust the early continue to
allow processing when step.extra.get('tool_errors') exists, count each entry in
step.extra["tool_errors"] toward the failure total, and add the same
regression/recording logic you use for normal tool_calls (e.g., increment
failure counters and mark the step as regressed) so unmatched tool_errors are
not ignored.
- Around line 105-123: The current logic pairs observations and tool_errors by
list position and tool name (step.tool_calls / observations index and
tool_error_entry["tool"]), which misattributes failures when the same tool is
called multiple times or results arrive out-of-order; change to correlate at the
call level by matching atif_tool_call.tool_call_id (or similar call id on the
tool call object) against ObservationResult.source_call_id for observations and
against tool_error_entry["tool_call_id"] (or add/expect that key) for structured
errors, build a lookup map from source_call_id->ObservationResult to get each
call's observation regardless of position, and when matching a tool_errors entry
consume it exactly once (e.g., remove or mark it as used) so entries aren’t
reused across multiple atif_tool_call instances; update handling around
observation_content, is_error, matching_extra, failed_tool_calls, and
per_tool_summary to use these per-call matches.

In
`@packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/models.py`:
- Around line 39-42: The Field "attempts" in models.py is documented as "Ordered
list of every call to this tool" but ToolFailureEvaluator builds attempts = [a
for a in attempts if a.error is not None], so update the mismatch: either modify
ToolFailureEvaluator (the logic that filters attempts) to include successful
calls as well, or change the Field description for attempts (and any related
docstrings) to state explicitly that it contains only failed attempts; make this
change where attempts: list[_ToolCall] = Field(...) is declared and adjust
tests/docs referencing ToolFailureEvaluator accordingly so schema and
implementation match.

In
`@packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/register.py`:
- Around line 30-34: The max_concurrency Pydantic field currently allows
zero/negative values and must be validated at the config boundary: update the
Field definition for max_concurrency in register.py (the max_concurrency: int =
Field(...)) to enforce a positive bound (e.g. add gt=0) so invalid CLI/user
configs are rejected immediately while keeping the default 8; no runtime
scheduling changes needed—just add the Pydantic constraint to the existing Field
(or add an equivalent `@validator` on max_concurrency) to enforce >0.

In `@packages/nvidia_nat_langchain/tests/test_langchain_callback_handler.py`:
- Around line 316-318: Replace the temporary list comprehension in _get_tool_end
with a generator passed to next to avoid creating an intermediate list and
satisfy Ruff: inside function _get_tool_end(stats:
list[IntermediateStepPayload]) use next(s for s in stats if s.event_type ==
IntermediateStepType.TOOL_END) (or next((s for s in stats if s.event_type ==
IntermediateStepType.TOOL_END), None) if you want to return None when missing)
so you still return the first IntermediateStepPayload matching
IntermediateStepType.TOOL_END without building a list.

---

Nitpick comments:
In `@packages/nvidia_nat_core/tests/nat/utils/test_atif_converter.py`:
- Line 20: Replace the direct import of ToolMessage in the test with a minimal
test double (either a plain dict or a small stub object) that exposes the two
attributes the production code uses: status and content; update the test in
test_atif_converter.py to construct and pass this dict/stub into the
atif_converter call (the same function/class under test in atif_converter.py)
instead of a ToolMessage instance so the test no longer depends on
langchain_core via ToolMessage.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bdb0bc8d-0b5f-4330-a44a-4660a278e89e

📥 Commits

Reviewing files that changed from the base of the PR and between 7629460 and 13dc6bc.

📒 Files selected for processing (10)

packages/nvidia_nat_core/src/nat/utils/atif_converter.py
packages/nvidia_nat_core/tests/nat/utils/test_atif_converter.py
packages/nvidia_nat_eval/src/nat/plugins/eval/register.py
packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/__init__.py
packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.py
packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/models.py
packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/register.py
packages/nvidia_nat_eval/tests/eval/evaluator/test_tool_failure_evaluator.py
packages/nvidia_nat_langchain/src/nat/plugins/langchain/callback_handler.py
packages/nvidia_nat_langchain/tests/test_langchain_callback_handler.py

coderabbitai · 2026-03-20T23:56:08Z

packages/nvidia_nat_core/src/nat/utils/atif_converter.py

+def _extract_tool_error(output: Any) -> dict[str, str] | None:
+    """Extract error metadata from a tool output for ``step.extra["tool_errors"]``."""
+    # TODO: return a model instead of a plain dict once ATIF spec adds error support


⚠️ Potential issue | 🟡 Minor

Remove the TODO from this helper comment.

Vale rejects TODO in source comments, so this will keep tripping docs/comment lint until it is reworded or tracked externally.

♻️ Suggested edit

- # TODO: return a model instead of a plain dict once ATIF spec adds error support + # Keep this as a plain dict until the ATIF spec grows typed error fields.

As per coding guidelines "Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like 'lorem ipsum'."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_core/src/nat/utils/atif_converter.py` around lines 104 - 106, Remove the literal "TODO" from the docstring for _extract_tool_error; instead rephrase the note to a neutral statement such as "Return a model instead of a plain dict once ATIF spec adds error support" or move that implementation note to an external tracker, so the docstring no longer contains TODO/FIXME placeholder text and continues to describe the current behavior of _extract_tool_error.

coderabbitai · 2026-03-20T23:56:08Z

packages/nvidia_nat_core/src/nat/utils/atif_converter.py

+                tool_error: dict[str, str] | None = _extract_tool_error(raw_output)
+
+                if tool_error is not None:
+                    tool_error["tool"] = tool_name
+                extra: dict[str, Any] | None = ({"tool_errors": [tool_error]} if tool_error else None)
+
                if pending is not None:
                    pending.tool_calls.append(tc)
                    pending.observations.append(obs)
+                    if tool_error:
+                        pending.extra.setdefault("tool_errors", []).append(tool_error)
                    pending.tool_ancestry.append(_atif_ancestry_from_ist(ist))
                else:
                    extra = _atif_step_extra_model_from_ist(ist).model_dump(exclude_none=True)
+                    if tool_error:
+                        extra.setdefault("tool_errors", []).append(tool_error)


⚠️ Potential issue | 🟠 Major

tool_errors needs a per-call identifier, not just the tool name.

Cross-file, packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.py currently matches step.extra["tool_errors"] by tool. If a single agent step contains two calls to the same function and only one fails, both iterations will hit the same metadata entry and both calls will be counted as failures. Please persist the generated call_id/source_call_id with each error record here and have the evaluator match on that stable ID first. A regression test with same-name tools and only one failure would pin this down.

🧩 Partial converter-side fix

- if tool_error is not None: - tool_error["tool"] = tool_name + if tool_error is not None: + tool_error["tool"] = tool_name + tool_error["tool_call_id"] = call_id

A companion change is still needed in packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.py so the ATIF lane consumes tool_call_id instead of keying solely on tool.

Also applies to: 555-568

coderabbitai · 2026-03-20T23:56:08Z

packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.py

+        for step in steps:
+            if not step.tool_calls or step.source != "agent":
+                continue


⚠️ Potential issue | 🟠 Major

Do not drop orphan tool_errors.

Line 102 skips any agent step whose tool_calls array is empty, but this PR now carries orphan tool failures through step.extra["tool_errors"]. Those failures are invisible here, so a sample can return a perfect score even though it already contains recorded tool errors. Please count unmatched tool_errors entries as failed calls and add a regression for a step that has extra["tool_errors"] but no tool_calls.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.py` around lines 101 - 103, The loop that currently skips agent steps with empty step.tool_calls (in the evaluator iterating "for step in steps") must be changed to also inspect step.extra["tool_errors"] and treat each orphaned tool_error as a failed tool call; specifically, where the code checks "if not step.tool_calls or step.source != 'agent': continue", remove or adjust the early continue to allow processing when step.extra.get('tool_errors') exists, count each entry in step.extra["tool_errors"] toward the failure total, and add the same regression/recording logic you use for normal tool_calls (e.g., increment failure counters and mark the step as regressed) so unmatched tool_errors are not ignored.

coderabbitai · 2026-03-20T23:56:08Z

packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.py

+            observations: list[ObservationResult] = (step.observation.results if step.observation else [])
+
+            for index, atif_tool_call in enumerate(step.tool_calls):
+                observation_content: str = ""
+                if index < len(observations) and observations[index].content:
+                    raw_content: str | list[Any] | None = observations[index].content
+                    observation_content = raw_content if isinstance(raw_content, str) else str(raw_content)
+
+                is_error: bool = False
+                error_content: str = ""
+
+                # Check step.extra["tool_errors"] for structured error metadata
+                extra_errors: list[dict[str, Any]] = (step.extra or {}).get("tool_errors", [])
+                matching_extra: dict[str, Any] | None = None
+                for tool_error_entry in extra_errors:
+                    if tool_error_entry.get("tool") == atif_tool_call.function_name:
+                        is_error = True
+                        matching_extra = tool_error_entry
+                        break


⚠️ Potential issue | 🟠 Major

Correlate ATIF failures per call, not per tool name.

Line 107 still pairs observations by position, and Lines 119-123 reuse the first tool_errors entry whose tool matches function_name. If one step invokes the same tool twice, or results arrive out of order, the same failure can be applied to multiple calls and both failed_tool_calls and per_tool_summary become wrong. Please switch this to call-level correlation using tool_call_id / ObservationResult.source_call_id from packages/nvidia_nat_core/src/nat/data_models/atif/observation_result.py lines 27-46, and ensure each structured error entry is consumed at most once.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.py` around lines 105 - 123, The current logic pairs observations and tool_errors by list position and tool name (step.tool_calls / observations index and tool_error_entry["tool"]), which misattributes failures when the same tool is called multiple times or results arrive out-of-order; change to correlate at the call level by matching atif_tool_call.tool_call_id (or similar call id on the tool call object) against ObservationResult.source_call_id for observations and against tool_error_entry["tool_call_id"] (or add/expect that key) for structured errors, build a lookup map from source_call_id->ObservationResult to get each call's observation regardless of position, and when matching a tool_errors entry consume it exactly once (e.g., remove or mark it as used) so entries aren’t reused across multiple atif_tool_call instances; update handling around observation_content, is_error, matching_extra, failed_tool_calls, and per_tool_summary to use these per-call matches.

coderabbitai · 2026-03-20T23:56:08Z

packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/models.py

+    attempts: list[_ToolCall] = Field(
+        default_factory=list,
+        description="Ordered list of every call to this tool.",
+    )


⚠️ Potential issue | 🟡 Minor

attempts is documented as full history, but the evaluator only emits failed calls.

ToolFailureEvaluator currently builds attempts=[a for a in attempts if a.error is not None], so these descriptions overstate what downstream consumers will actually receive. Either include successful attempts as well, or tighten the schema/docs to say this list only contains failed attempts.

📝 Suggested schema doc update

attempts: list[_ToolCall] = Field( default_factory=list, - description="Ordered list of every call to this tool.", + description="Ordered list of failed calls to this tool.", ) @@ per_tool_summary: list[ToolSummary] = Field( default_factory=list, - description="Per-tool health summary with attempt history.", + description="Per-tool health summary with failed attempt history.", )

As per coding guidelines "Keep documentation in sync with code; the documentation pipeline will fail on Sphinx errors or broken links."

Also applies to: 52-55

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/models.py` around lines 39 - 42, The Field "attempts" in models.py is documented as "Ordered list of every call to this tool" but ToolFailureEvaluator builds attempts = [a for a in attempts if a.error is not None], so update the mismatch: either modify ToolFailureEvaluator (the logic that filters attempts) to include successful calls as well, or change the Field description for attempts (and any related docstrings) to state explicitly that it contains only failed attempts; make this change where attempts: list[_ToolCall] = Field(...) is declared and adjust tests/docs referencing ToolFailureEvaluator accordingly so schema and implementation match.

coderabbitai · 2026-03-20T23:56:08Z

packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/register.py

+    max_concurrency: int = Field(default=8, description="Max concurrency for evaluation.")
+    enable_atif_evaluator: bool = Field(
+        default=False,
+        description="Enable ATIF-native evaluation lane alongside the legacy lane.",
+    )


⚠️ Potential issue | 🟡 Minor

Validate max_concurrency at the config boundary.

This is user-supplied eval/CLI input, but 0 and negative values are accepted today and only fail later when the evaluator is scheduled. Add a positive bound on the Pydantic field so bad configs are rejected up front.

🛡️ Suggested change

- max_concurrency: int = Field(default=8, description="Max concurrency for evaluation.") + max_concurrency: int = Field(default=8, ge=1, description="Max concurrency for evaluation.")

As per coding guidelines "Validate and sanitise all user input, especially in web or CLI interfaces."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/register.py` around lines 30 - 34, The max_concurrency Pydantic field currently allows zero/negative values and must be validated at the config boundary: update the Field definition for max_concurrency in register.py (the max_concurrency: int = Field(...)) to enforce a positive bound (e.g. add gt=0) so invalid CLI/user configs are rejected immediately while keeping the default 8; no runtime scheduling changes needed—just add the Pydantic constraint to the existing Field (or add an equivalent `@validator` on max_concurrency) to enforce >0.

packages/nvidia_nat_langchain/tests/test_langchain_callback_handler.py

ericevans-nv added 2 commits March 20, 2026 18:27

Add tool_failure evaluator with error capture through callback handle…

d7018a9

…r and ATIF converter Signed-off-by: Eric Evans <194135482+ericevans-nv@users.noreply.github.com>

Merge branch 'develop' of github.com:NVIDIA/NeMo-Agent-Toolkit into f…

13dc6bc

…eat/tool-error-observability

ericevans-nv self-assigned this Mar 20, 2026

ericevans-nv requested a review from a team as a code owner March 20, 2026 23:41

ericevans-nv added feature request New feature or request non-breaking Non-breaking change labels Mar 20, 2026

coderabbitai bot reviewed Mar 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tool failure evaluator for IntermediateStep and ATIF formats#1816

Add tool failure evaluator for IntermediateStep and ATIF formats#1816
ericevans-nv wants to merge 2 commits intoNVIDIA:developfrom
ericevans-nv:feat/tool-error-observability

ericevans-nv commented Mar 20, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 20, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 20, 2026

Uh oh!

coderabbitai bot Mar 20, 2026

Uh oh!

coderabbitai bot Mar 20, 2026

Uh oh!

coderabbitai bot Mar 20, 2026

Uh oh!

coderabbitai bot Mar 20, 2026

Uh oh!

coderabbitai bot Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ericevans-nv commented Mar 20, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Summary

Changes

1. LangChain Callback Handler (nvidia_nat_langchain)

2. ATIF Converter (nvidia_nat_core)

3. Tool Failure Evaluator (nvidia_nat_eval)

Tests

By Submitting this PR I confirm:

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ericevans-nv commented Mar 20, 2026 •

edited by coderabbitai bot

Loading

1. LangChain Callback Handler (`nvidia_nat_langchain`)

2. ATIF Converter (`nvidia_nat_core`)

3. Tool Failure Evaluator (`nvidia_nat_eval`)

coderabbitai bot commented Mar 20, 2026 •

edited

Loading