Skip to content

Add tool failure evaluator for IntermediateStep and ATIF formats#1816

Open
ericevans-nv wants to merge 2 commits intoNVIDIA:developfrom
ericevans-nv:feat/tool-error-observability
Open

Add tool failure evaluator for IntermediateStep and ATIF formats#1816
ericevans-nv wants to merge 2 commits intoNVIDIA:developfrom
ericevans-nv:feat/tool-error-observability

Conversation

@ericevans-nv
Copy link
Contributor

@ericevans-nv ericevans-nv commented Mar 20, 2026

Description

Summary

Adds a tool_failure evaluator to the nat eval pipeline that computes per-item tool call success rates and surfaces structured error details through both the legacy IntermediateStep and ATIF evaluation lanes. This required changes across three packages to capture, propagate, and consume tool error signals.

Changes

1. LangChain Callback Handler (nvidia_nat_langchain)

  • on_tool_error callback: New handler on LangchainProfilerHandler that catches tool exceptions and emits a TOOL_END intermediate step with a ToolMessage(status="error") output. Previously, tool errors were silently dropped and never reached the evaluation pipeline.
  • Tool name tracking: Added _run_id_to_tool_name mapping so on_tool_error can resolve the tool name (which is only available during on_tool_start).
  • Simplified _extract_tools_schema: Removed the redundant Anthropic-format fallback parser that was duplicating error handling; non-OpenAI schemas now log at debug level and skip gracefully.

2. ATIF Converter (nvidia_nat_core)

  • _extract_tool_error helper: New function that inspects tool outputs for status="error" and extracts structured error metadata (error, error_type, error_message, status).
  • step.extra["tool_errors"] propagation: Both IntermediateStepToATIFConverter (batch) and ATIFStreamConverter (streaming) now inject tool_errors into the ATIF step's extra dict when a tool error is detected. This works for both pending (merged into an existing agent step) and orphan (standalone step) tool calls.

3. Tool Failure Evaluator (nvidia_nat_eval)

  • New tool_failure evaluator plugin with the following files:
    • evaluator.pyToolFailureEvaluator implementing both BaseEvaluator (legacy lane) and AtifBaseEvaluator (ATIF lane). Each lane processes its respective trajectory format, detects errors, and produces a self-contained EvalOutputItem per dataset item.
    • models.py — Pydantic models for the reasoning output: ToolFailureReasoning (overall stats), ToolSummary (per-tool breakdown), _ToolCall (individual call with input/output/error).
    • register.py — Plugin registration with ToolFailureEvaluatorConfig supporting max_concurrency and enable_atif_evaluator options.
    • __init__.py — Public API exports.
  • ATIF lane error detection uses three strategies in priority order:
    1. step.extra["tool_errors"] — structured metadata from the ATIF converter
    2. Stringified ToolMessage dict parsing — {'status': 'error', 'content': '...'} via ast.literal_eval
    3. Raw error pattern matching — "XyzError: ..." in observation content
  • Reasoning output:
    • per_tool_summary only includes tools that had at least one failure
    • ToolSummary.attempts only includes failed _ToolCall entries
    • failed_tools lists the names of all tools with failures
    • score is (total - failed) / total, unrounded

Tests

  • test_tool_failure_evaluator.py — 13 unit tests across two test classes validating model population for both evaluation lanes, including error detection via each ATIF strategy, edge cases (empty trajectory, None data, missing tool name, non-error strings), and mixed success/failure filtering.
  • test_atif_converter.py — New tests for _extract_tool_error and tool_errors propagation through both batch and streaming converters.
  • test_langchain_callback_handler.py — New tests for on_tool_error callback verifying the emitted intermediate step structure.

By Submitting this PR I confirm:

  • I am familiar with the Contributing Guidelines.
  • We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
    • Any contribution which contains commits that are not Signed-Off will not be accepted.
  • When the PR is ready for review, new or existing tests cover these changes.
  • When the PR is ready for review, the documentation is up to date with these changes.

Summary by CodeRabbit

Release Notes

  • New Features

    • Tool failure evaluator now computes tool call success rates and generates per-tool failure analysis with structured error detection
    • Trajectories now include enhanced error metadata capturing tool failures through multiple detection signals and error patterns
  • Tests

    • Comprehensive test coverage added for tool error handling and failure evaluation across trajectory formats

…r and ATIF converter

Signed-off-by: Eric Evans <194135482+ericevans-nv@users.noreply.github.com>
@ericevans-nv ericevans-nv self-assigned this Mar 20, 2026
@ericevans-nv ericevans-nv requested a review from a team as a code owner March 20, 2026 23:41
@ericevans-nv ericevans-nv added feature request New feature or request non-breaking Non-breaking change labels Mar 20, 2026
@coderabbitai
Copy link

coderabbitai bot commented Mar 20, 2026

Walkthrough

This PR introduces comprehensive tool error handling and failure evaluation across the framework. It adds error extraction in the ATIF converter (batch and streaming paths), creates a new tool failure evaluator plugin to measure tool call success rates, integrates tool error callbacks into the LangChain handler, and provides extensive test coverage.

Changes

Cohort / File(s) Summary
ATIF Converter Error Handling
packages/nvidia_nat_core/src/nat/utils/atif_converter.py
Added _extract_tool_error() helper to detect and parse tool output errors with status "error". Updated batch and streaming converters to preserve raw tool output and attach extracted error metadata to step extra["tool_errors"] list when an error is detected.
ATIF Converter Tests
packages/nvidia_nat_core/tests/nat/utils/test_atif_converter.py
Added error_trajectory fixture and TestToolErrorATIFConversion suite validating ATIF conversion of tool errors, including error type/message parsing, consistent behavior between batch and streaming paths, and correct handling when no errors are present.
Tool Failure Evaluator Plugin
packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/__init__.py, evaluator.py, models.py, register.py
New plugin module implementing ToolFailureEvaluator that computes tool-call success rates by detecting and classifying tool failures. Includes Pydantic models for reasoning output (ToolFailureReasoning, ToolSummary, _ToolCall) and registration logic supporting both legacy and ATIF trajectory formats with multiple error-detection paths.
Tool Failure Evaluator Registration
packages/nvidia_nat_eval/src/nat/plugins/eval/register.py
Added import of register_tool_failure_evaluator to enable plugin registration alongside existing dataset loaders.
Tool Failure Evaluator Tests
packages/nvidia_nat_eval/tests/eval/evaluator/test_tool_failure_evaluator.py
Comprehensive test suite covering empty trajectories, all-failure scenarios, mixed success/failure cases, and multiple error-detection paths. Tests validate correct population of failure metrics, per-tool summaries, and filtering of attempts to only include failed calls.
LangChain Callback Handler
packages/nvidia_nat_langchain/src/nat/plugins/langchain/callback_handler.py
Added on_tool_error callback to record tool failures as IntermediateStepPayload with TOOL_END event type. Constructs ToolMessage with status="error" and tracks tool names via _run_id_to_tool_name state. Simplified tool schema handling by removing non-OpenAI format re-mapping.
LangChain Callback Handler Tests
packages/nvidia_nat_langchain/tests/test_langchain_callback_handler.py
Added pytest tests for on_tool_error callback, validating emission of IntermediateStepPayload with correct TOOL_END event type, ToolMessage status/content, and tool-name fallback behavior.

Sequence Diagram(s)

sequenceDiagram
    participant Agent
    participant Converter
    participant ErrorParser
    participant ATIF
    
    Agent->>Converter: Tool output with status="error"
    Converter->>ErrorParser: Extract error metadata
    ErrorParser->>ErrorParser: Parse error_type from<br/>"<identifier>: <message>"
    ErrorParser-->>Converter: {error, error_type,<br/>error_message, status}
    Converter->>ATIF: Attach error metadata<br/>to step.extra["tool_errors"]
Loading
sequenceDiagram
    participant Input as Input<br/>(Legacy/ATIF)
    participant Evaluator
    participant ErrorDetector
    participant Aggregator
    participant Output as Output<br/>(ToolFailureReasoning)
    
    Input->>Evaluator: Trajectory with tool calls
    Evaluator->>ErrorDetector: For each tool call,<br/>detect failure via:<br/>1. step.extra["tool_errors"]<br/>2. Parsed observation dict<br/>3. Exception pattern heuristic
    ErrorDetector-->>Evaluator: Error status &<br/>error message
    Evaluator->>Aggregator: Aggregate per-tool<br/>calls and failures
    Aggregator->>Aggregator: Calculate success_rate =<br/>(total - failed) / total
    Aggregator-->>Output: ToolFailureReasoning<br/>with metrics & summaries
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the main change: adding a tool failure evaluator supporting both IntermediateStep and ATIF formats.
Docstring Coverage ✅ Passed Docstring coverage is 85.45% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can disable sequence diagrams in the walkthrough.

Disable the reviews.sequence_diagrams setting to disable sequence diagrams in the walkthrough.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (1)
packages/nvidia_nat_core/tests/nat/utils/test_atif_converter.py (1)

20-20: Use a dict or stub in the test instead of importing ToolMessage to reduce test coupling.

atif_converter.py only accesses status and content attributes, so a simple dict will exercise the same code path. Note that langchain-core is not explicitly declared in packages/nvidia_nat_core/pyproject.toml despite being imported throughout the source code (via transitive dependencies). The test should avoid hardwiring this dependency; use a minimal test object instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/nvidia_nat_core/tests/nat/utils/test_atif_converter.py` at line 20,
Replace the direct import of ToolMessage in the test with a minimal test double
(either a plain dict or a small stub object) that exposes the two attributes the
production code uses: status and content; update the test in
test_atif_converter.py to construct and pass this dict/stub into the
atif_converter call (the same function/class under test in atif_converter.py)
instead of a ToolMessage instance so the test no longer depends on
langchain_core via ToolMessage.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/nvidia_nat_core/src/nat/utils/atif_converter.py`:
- Around line 104-106: Remove the literal "TODO" from the docstring for
_extract_tool_error; instead rephrase the note to a neutral statement such as
"Return a model instead of a plain dict once ATIF spec adds error support" or
move that implementation note to an external tracker, so the docstring no longer
contains TODO/FIXME placeholder text and continues to describe the current
behavior of _extract_tool_error.

In
`@packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.py`:
- Around line 101-103: The loop that currently skips agent steps with empty
step.tool_calls (in the evaluator iterating "for step in steps") must be changed
to also inspect step.extra["tool_errors"] and treat each orphaned tool_error as
a failed tool call; specifically, where the code checks "if not step.tool_calls
or step.source != 'agent': continue", remove or adjust the early continue to
allow processing when step.extra.get('tool_errors') exists, count each entry in
step.extra["tool_errors"] toward the failure total, and add the same
regression/recording logic you use for normal tool_calls (e.g., increment
failure counters and mark the step as regressed) so unmatched tool_errors are
not ignored.
- Around line 105-123: The current logic pairs observations and tool_errors by
list position and tool name (step.tool_calls / observations index and
tool_error_entry["tool"]), which misattributes failures when the same tool is
called multiple times or results arrive out-of-order; change to correlate at the
call level by matching atif_tool_call.tool_call_id (or similar call id on the
tool call object) against ObservationResult.source_call_id for observations and
against tool_error_entry["tool_call_id"] (or add/expect that key) for structured
errors, build a lookup map from source_call_id->ObservationResult to get each
call's observation regardless of position, and when matching a tool_errors entry
consume it exactly once (e.g., remove or mark it as used) so entries aren’t
reused across multiple atif_tool_call instances; update handling around
observation_content, is_error, matching_extra, failed_tool_calls, and
per_tool_summary to use these per-call matches.

In
`@packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/models.py`:
- Around line 39-42: The Field "attempts" in models.py is documented as "Ordered
list of every call to this tool" but ToolFailureEvaluator builds attempts = [a
for a in attempts if a.error is not None], so update the mismatch: either modify
ToolFailureEvaluator (the logic that filters attempts) to include successful
calls as well, or change the Field description for attempts (and any related
docstrings) to state explicitly that it contains only failed attempts; make this
change where attempts: list[_ToolCall] = Field(...) is declared and adjust
tests/docs referencing ToolFailureEvaluator accordingly so schema and
implementation match.

In
`@packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/register.py`:
- Around line 30-34: The max_concurrency Pydantic field currently allows
zero/negative values and must be validated at the config boundary: update the
Field definition for max_concurrency in register.py (the max_concurrency: int =
Field(...)) to enforce a positive bound (e.g. add gt=0) so invalid CLI/user
configs are rejected immediately while keeping the default 8; no runtime
scheduling changes needed—just add the Pydantic constraint to the existing Field
(or add an equivalent `@validator` on max_concurrency) to enforce >0.

In `@packages/nvidia_nat_langchain/tests/test_langchain_callback_handler.py`:
- Around line 316-318: Replace the temporary list comprehension in _get_tool_end
with a generator passed to next to avoid creating an intermediate list and
satisfy Ruff: inside function _get_tool_end(stats:
list[IntermediateStepPayload]) use next(s for s in stats if s.event_type ==
IntermediateStepType.TOOL_END) (or next((s for s in stats if s.event_type ==
IntermediateStepType.TOOL_END), None) if you want to return None when missing)
so you still return the first IntermediateStepPayload matching
IntermediateStepType.TOOL_END without building a list.

---

Nitpick comments:
In `@packages/nvidia_nat_core/tests/nat/utils/test_atif_converter.py`:
- Line 20: Replace the direct import of ToolMessage in the test with a minimal
test double (either a plain dict or a small stub object) that exposes the two
attributes the production code uses: status and content; update the test in
test_atif_converter.py to construct and pass this dict/stub into the
atif_converter call (the same function/class under test in atif_converter.py)
instead of a ToolMessage instance so the test no longer depends on
langchain_core via ToolMessage.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bdb0bc8d-0b5f-4330-a44a-4660a278e89e

📥 Commits

Reviewing files that changed from the base of the PR and between 7629460 and 13dc6bc.

📒 Files selected for processing (10)
  • packages/nvidia_nat_core/src/nat/utils/atif_converter.py
  • packages/nvidia_nat_core/tests/nat/utils/test_atif_converter.py
  • packages/nvidia_nat_eval/src/nat/plugins/eval/register.py
  • packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/__init__.py
  • packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.py
  • packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/models.py
  • packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/register.py
  • packages/nvidia_nat_eval/tests/eval/evaluator/test_tool_failure_evaluator.py
  • packages/nvidia_nat_langchain/src/nat/plugins/langchain/callback_handler.py
  • packages/nvidia_nat_langchain/tests/test_langchain_callback_handler.py

Comment on lines +104 to +106
def _extract_tool_error(output: Any) -> dict[str, str] | None:
"""Extract error metadata from a tool output for ``step.extra["tool_errors"]``."""
# TODO: return a model instead of a plain dict once ATIF spec adds error support
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove the TODO from this helper comment.

Vale rejects TODO in source comments, so this will keep tripping docs/comment lint until it is reworded or tracked externally.

♻️ Suggested edit
-    # TODO: return a model instead of a plain dict once ATIF spec adds error support
+    # Keep this as a plain dict until the ATIF spec grows typed error fields.

As per coding guidelines "Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like 'lorem ipsum'."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/nvidia_nat_core/src/nat/utils/atif_converter.py` around lines 104 -
106, Remove the literal "TODO" from the docstring for _extract_tool_error;
instead rephrase the note to a neutral statement such as "Return a model instead
of a plain dict once ATIF spec adds error support" or move that implementation
note to an external tracker, so the docstring no longer contains TODO/FIXME
placeholder text and continues to describe the current behavior of
_extract_tool_error.

Comment on lines +369 to +384
tool_error: dict[str, str] | None = _extract_tool_error(raw_output)

if tool_error is not None:
tool_error["tool"] = tool_name
extra: dict[str, Any] | None = ({"tool_errors": [tool_error]} if tool_error else None)

if pending is not None:
pending.tool_calls.append(tc)
pending.observations.append(obs)
if tool_error:
pending.extra.setdefault("tool_errors", []).append(tool_error)
pending.tool_ancestry.append(_atif_ancestry_from_ist(ist))
else:
extra = _atif_step_extra_model_from_ist(ist).model_dump(exclude_none=True)
if tool_error:
extra.setdefault("tool_errors", []).append(tool_error)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

tool_errors needs a per-call identifier, not just the tool name.

Cross-file, packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.py currently matches step.extra["tool_errors"] by tool. If a single agent step contains two calls to the same function and only one fails, both iterations will hit the same metadata entry and both calls will be counted as failures. Please persist the generated call_id/source_call_id with each error record here and have the evaluator match on that stable ID first. A regression test with same-name tools and only one failure would pin this down.

🧩 Partial converter-side fix
-                if tool_error is not None:
-                    tool_error["tool"] = tool_name
+                if tool_error is not None:
+                    tool_error["tool"] = tool_name
+                    tool_error["tool_call_id"] = call_id

A companion change is still needed in packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.py so the ATIF lane consumes tool_call_id instead of keying solely on tool.

Also applies to: 555-568

Comment on lines +101 to +103
for step in steps:
if not step.tool_calls or step.source != "agent":
continue
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Do not drop orphan tool_errors.

Line 102 skips any agent step whose tool_calls array is empty, but this PR now carries orphan tool failures through step.extra["tool_errors"]. Those failures are invisible here, so a sample can return a perfect score even though it already contains recorded tool errors. Please count unmatched tool_errors entries as failed calls and add a regression for a step that has extra["tool_errors"] but no tool_calls.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.py`
around lines 101 - 103, The loop that currently skips agent steps with empty
step.tool_calls (in the evaluator iterating "for step in steps") must be changed
to also inspect step.extra["tool_errors"] and treat each orphaned tool_error as
a failed tool call; specifically, where the code checks "if not step.tool_calls
or step.source != 'agent': continue", remove or adjust the early continue to
allow processing when step.extra.get('tool_errors') exists, count each entry in
step.extra["tool_errors"] toward the failure total, and add the same
regression/recording logic you use for normal tool_calls (e.g., increment
failure counters and mark the step as regressed) so unmatched tool_errors are
not ignored.

Comment on lines +105 to +123
observations: list[ObservationResult] = (step.observation.results if step.observation else [])

for index, atif_tool_call in enumerate(step.tool_calls):
observation_content: str = ""
if index < len(observations) and observations[index].content:
raw_content: str | list[Any] | None = observations[index].content
observation_content = raw_content if isinstance(raw_content, str) else str(raw_content)

is_error: bool = False
error_content: str = ""

# Check step.extra["tool_errors"] for structured error metadata
extra_errors: list[dict[str, Any]] = (step.extra or {}).get("tool_errors", [])
matching_extra: dict[str, Any] | None = None
for tool_error_entry in extra_errors:
if tool_error_entry.get("tool") == atif_tool_call.function_name:
is_error = True
matching_extra = tool_error_entry
break
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Correlate ATIF failures per call, not per tool name.

Line 107 still pairs observations by position, and Lines 119-123 reuse the first tool_errors entry whose tool matches function_name. If one step invokes the same tool twice, or results arrive out of order, the same failure can be applied to multiple calls and both failed_tool_calls and per_tool_summary become wrong. Please switch this to call-level correlation using tool_call_id / ObservationResult.source_call_id from packages/nvidia_nat_core/src/nat/data_models/atif/observation_result.py lines 27-46, and ensure each structured error entry is consumed at most once.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/evaluator.py`
around lines 105 - 123, The current logic pairs observations and tool_errors by
list position and tool name (step.tool_calls / observations index and
tool_error_entry["tool"]), which misattributes failures when the same tool is
called multiple times or results arrive out-of-order; change to correlate at the
call level by matching atif_tool_call.tool_call_id (or similar call id on the
tool call object) against ObservationResult.source_call_id for observations and
against tool_error_entry["tool_call_id"] (or add/expect that key) for structured
errors, build a lookup map from source_call_id->ObservationResult to get each
call's observation regardless of position, and when matching a tool_errors entry
consume it exactly once (e.g., remove or mark it as used) so entries aren’t
reused across multiple atif_tool_call instances; update handling around
observation_content, is_error, matching_extra, failed_tool_calls, and
per_tool_summary to use these per-call matches.

Comment on lines +39 to +42
attempts: list[_ToolCall] = Field(
default_factory=list,
description="Ordered list of every call to this tool.",
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

attempts is documented as full history, but the evaluator only emits failed calls.

ToolFailureEvaluator currently builds attempts=[a for a in attempts if a.error is not None], so these descriptions overstate what downstream consumers will actually receive. Either include successful attempts as well, or tighten the schema/docs to say this list only contains failed attempts.

📝 Suggested schema doc update
     attempts: list[_ToolCall] = Field(
         default_factory=list,
-        description="Ordered list of every call to this tool.",
+        description="Ordered list of failed calls to this tool.",
     )
@@
     per_tool_summary: list[ToolSummary] = Field(
         default_factory=list,
-        description="Per-tool health summary with attempt history.",
+        description="Per-tool health summary with failed attempt history.",
     )

As per coding guidelines "Keep documentation in sync with code; the documentation pipeline will fail on Sphinx errors or broken links."

Also applies to: 52-55

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/models.py`
around lines 39 - 42, The Field "attempts" in models.py is documented as
"Ordered list of every call to this tool" but ToolFailureEvaluator builds
attempts = [a for a in attempts if a.error is not None], so update the mismatch:
either modify ToolFailureEvaluator (the logic that filters attempts) to include
successful calls as well, or change the Field description for attempts (and any
related docstrings) to state explicitly that it contains only failed attempts;
make this change where attempts: list[_ToolCall] = Field(...) is declared and
adjust tests/docs referencing ToolFailureEvaluator accordingly so schema and
implementation match.

Comment on lines +30 to +34
max_concurrency: int = Field(default=8, description="Max concurrency for evaluation.")
enable_atif_evaluator: bool = Field(
default=False,
description="Enable ATIF-native evaluation lane alongside the legacy lane.",
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Validate max_concurrency at the config boundary.

This is user-supplied eval/CLI input, but 0 and negative values are accepted today and only fail later when the evaluator is scheduled. Add a positive bound on the Pydantic field so bad configs are rejected up front.

🛡️ Suggested change
-    max_concurrency: int = Field(default=8, description="Max concurrency for evaluation.")
+    max_concurrency: int = Field(default=8, ge=1, description="Max concurrency for evaluation.")

As per coding guidelines "Validate and sanitise all user input, especially in web or CLI interfaces."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@packages/nvidia_nat_eval/src/nat/plugins/eval/tool_failure_evaluator/register.py`
around lines 30 - 34, The max_concurrency Pydantic field currently allows
zero/negative values and must be validated at the config boundary: update the
Field definition for max_concurrency in register.py (the max_concurrency: int =
Field(...)) to enforce a positive bound (e.g. add gt=0) so invalid CLI/user
configs are rejected immediately while keeping the default 8; no runtime
scheduling changes needed—just add the Pydantic constraint to the existing Field
(or add an equivalent `@validator` on max_concurrency) to enforce >0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant