Skip to content

feature: nvidia-nat-benchmarks plugin#1771

Open
bbednarski9 wants to merge 10 commits intoNVIDIA:developfrom
bbednarski9:bb/nat-benchmark-subpackage
Open

feature: nvidia-nat-benchmarks plugin#1771
bbednarski9 wants to merge 10 commits intoNVIDIA:developfrom
bbednarski9:bb/nat-benchmark-subpackage

Conversation

@bbednarski9
Copy link
Contributor

@bbednarski9 bbednarski9 commented Mar 8, 2026

Description

This PR creates a new NAT "benchmarks" plugin (installed with nvidia-nat-benchmarks[tooltalk,bfcl,byob,agent-leaderboard]). This plugin is an extension of the optional nvidia-nat-eval plugin - it manages integrations with NeMo Evaluator and HuggingFace benchmarks for academic/public agentic performance benchmarks. Functionality is defined by the following three axis of agentic benchmarks:

  • Axis 1: Schema Injection: How Tool Definitions Reach the Model
  • Axis 2: Execution Model: What Happens When the Model Calls a Tool
  • Axis 3: Turn Structure: How Many LLM Calls per Evaluation Item

Summary

  • bridges four tool-calling benchmarks (ToolTalk, BFCL, Agent Leaderboard v2, BYOB) into NAT's nat eval pipeline with full trajectory capture, token usage, and latency breakdowns
  • Provides example packages under examples/benchmarks/ with READMEs, configs, and installation instructions for each benchmark
  • Adds shuffle and shuffle_seed fields to EvalGeneralConfig for randomizing evaluation order

Motivation

NAT users need to evaluate their agent workflows against standard function-calling benchmarks, but NeMo Evaluator's endpoint-centric integration model treats NAT as a black-box LLM — discarding the multi-step workflow traces, tool call sequences, and intermediate-step observability that make NAT evaluations valuable. This package extracts the portable parts from each benchmark (datasets and scoring functions) and runs them inside NAT's own evaluation pipeline.

What's included

New package: packages/nvidia_nat_benchmarks/

Four benchmark integrations, each providing a dataset adapter, one or more workflows, and an evaluator:

Benchmark Workflows Evaluator Metrics
ToolTalk Multi-turn Native FC + ToolExecutor simulation tooltalk_evaluator recall, action_precision, bad_action_rate, success, soft_success
BFCL AST prompting, Native FC, ReAct with ToolIntentBuffer bfcl_evaluator per-item accuracy via ast_checker
Agent Leaderboard v2 Multi-turn Native FC + stub tools from dataset agent_leaderboard_tsq Tool Selection Quality F1 (precision, recall over tool name sets)
BYOB Any NAT workflow byob_evaluator Configurable via NeMo Evaluator's scorer library (exact_match, f1_token, bleu, rouge, etc.)

New examples: examples/benchmarks/{tooltalk,bfcl,agent_leaderboard,byob}/

Core NAT changes:

  • EvalGeneralConfig: Added shuffle: bool and shuffle_seed: int | None fields
  • DatasetHandler: Accepts shuffle/shuffle_seed and applies df.sample(frac=1) after dedup
  • evaluate.py: Passes shuffle config to DatasetHandler
  • evaluate.md: Documents the shuffle feature

Dependencies

nvidia-nat-benchmarks
├── [tooltalk]           → nvidia-tooltalk
├── [bfcl]               → tree-sitter~=0.21.0 (nvidia-bfcl installed separately with --no-deps)
├── [byob]               → nemo-evaluator
├── [agent-leaderboard]  → datasets~=4.4
└── [all]                → all of the above

Closes

By Submitting this PR I confirm:

  • I am familiar with the Contributing Guidelines.
  • We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
    • Any contribution which contains commits that are not Signed-Off will not be accepted.
  • When the PR is ready for review, new or existing tests cover these changes.
  • When the PR is ready for review, the documentation is up to date with these changes.

Summary by CodeRabbit

  • New Features

    • Added Agent Leaderboard v2, BFCL, BYOB, and ToolTalk benchmark evaluation workflows with runnable configs and commands.
    • Optional evaluation-item shuffling with reproducible seed control (shuffle & shuffle_seed).
    • Two-pass "resume where you left off" workflow and example commands to resume interrupted runs.
  • Documentation

    • Comprehensive READMEs and example configs for each benchmark detailing setup, usage, metrics, and result interpretation.

Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
@bbednarski9 bbednarski9 requested a review from a team as a code owner March 8, 2026 23:13
@bbednarski9 bbednarski9 added feature request New feature or request non-breaking Non-breaking change labels Mar 8, 2026
@coderabbitai
Copy link

coderabbitai bot commented Mar 8, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds four benchmark suites (Agent Leaderboard v2, BFCL, BYOB, ToolTalk) with dataset loaders, workflows, evaluators, configs, docs, and tests; and adds optional evaluation-item shuffling (with reproducible seed) to the eval pipeline.

Changes

Cohort / File(s) Summary
Documentation
docs/source/improve-workflows/evaluate.md, examples/benchmarks/agent_leaderboard/README.md, examples/benchmarks/bfcl/README.md, examples/benchmarks/byob/README.md, examples/benchmarks/tooltalk/README.md
Added/expanded docs: shuffle evaluation order (eval.general.shuffle, shuffle_seed), resume workflow guidance, and comprehensive benchmark usage guides.
Agent Leaderboard
examples/benchmarks/agent_leaderboard/pyproject.toml, examples/benchmarks/agent_leaderboard/src/.../configs/eval_*.yml, packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/agent_leaderboard/{__init__.py,config.py,dataset.py,evaluator.py,workflow.py}, packages/nvidia_nat_benchmarks/tests/agent_leaderboard/*
New multi-domain benchmark: dataset loader (HuggingFace + local), AgentLeaderboard configs/workflow, TSQ evaluator (tool-selection F1), registrations, and comprehensive unit tests.
BFCL
examples/benchmarks/bfcl/pyproject.toml, examples/benchmarks/bfcl/src/.../configs/eval_*.yml, packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/bfcl/{__init__.py,config.py,dataset.py,evaluator.py,tool_intent_stubs.py,workflow_ast.py,workflow_fc.py,workflow_react.py}, packages/nvidia_nat_benchmarks/tests/bfcl/*
Added BFCL v3 benchmark suite with AST/FC/ReAct workflows, dataset loader, AST-based evaluator, tool-intent stubs, helpers, and extensive tests (unit, integration, regression).
BYOB
examples/benchmarks/byob/pyproject.toml, examples/benchmarks/byob/src/.../configs/eval_*.yml, packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/byob/{__init__.py,config.py,dataset.py,evaluator.py}, packages/nvidia_nat_benchmarks/tests/byob/*
Bring-Your-Own-Benchmark integration: config, dataset loader, in-process evaluator adapter that invokes BYOB scorers, and tests + sample benchmark fixture.
ToolTalk
examples/benchmarks/tooltalk/pyproject.toml, examples/benchmarks/tooltalk/src/.../configs/eval_*.yml, packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/tooltalk/{__init__.py,config.py,dataset.py,evaluator.py,workflow.py}, packages/nvidia_nat_benchmarks/tests/tooltalk/*
ToolTalk benchmark: dataset loader, workflow that binds tools and replays conversations, evaluator computing recall/precision/success metrics, API exposure modes, and integration/regression tests.
Common benchmark utilities
packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/{register.py,common/{bfcl_helpers.py,eval_helpers.py,tool_intent_stubs.py}}, packages/nvidia_nat_benchmarks/tests/common/*
Shared helpers: BFCL schema conversion, generic evaluator loop (run_evaluator_loop), tool-intent stub system (scenario isolation, stub creation), and tests validating helpers and config enums.
Eval pipeline: shuffling
packages/nvidia_nat_core/src/nat/data_models/evaluate_config.py, packages/nvidia_nat_eval/src/nat/plugins/eval/dataset_handler/dataset_handler.py, packages/nvidia_nat_eval/src/nat/plugins/eval/runtime/evaluate.py
Added EvalGeneralConfig.shuffle and shuffle_seed; DatasetHandler accepts shuffle and seed and shuffles DataFrame when enabled; evaluation runtime passes config flags into DatasetHandler.
Project & packaging
packages/nvidia_nat_benchmarks/pyproject.toml, pyproject.toml, .pre-commit-config.yaml, ci/vale/styles/config/vocabularies/nat/accept.txt
Added nvidia-nat-benchmarks package manifest, added benchmark example entries to root pyproject, uv-lock bench hook, and accepted vocabulary terms.
Tests & CI scaffolding
packages/nvidia_nat_benchmarks/tests/**/*, packages/nvidia_nat_core/tests/nat/builder/test_per_user_function_group_eval_isolation.py
Comprehensive test additions across benchmarks and a new per-user function-group eval isolation test.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Runner as Eval Runner
participant Dataset as DatasetHandler
participant Workflow as Workflow (LLM)
participant Tools as Tool Stubs / DB
participant Evaluator as Evaluator
participant Results as Output Files

Runner->>Dataset: load dataset (shuffle?, seed)
Dataset-->>Runner: dataframe rows
Runner->>Workflow: run per-item input
Workflow->>Tools: bind tool schemas / invoke stubs
Tools-->>Workflow: canned responses / DB lookups
Workflow-->>Runner: workflow_output (predictions)
Runner->>Evaluator: evaluate(workflow_output, ground_truth)
Evaluator-->>Runner: per-item scores + reasoning
Runner->>Results: write workflow_output.json, <br> evaluator output.json

Note: rectangles use default colors; components represent key actors in evaluation flow.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
packages/nvidia_nat_eval/src/nat/plugins/eval/runtime/evaluate.py (1)

677-684: ⚠️ Potential issue | 🟠 Major

Record shuffle and shuffle_seed in config_metadata.json.

These flags now change dataset order, but write_configuration() still omits them from the saved run metadata. Two evaluations with different item orderings will therefore look identical after the fact, which makes benchmark results much harder to reproduce or compare.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/nvidia_nat_eval/src/nat/plugins/eval/runtime/evaluate.py` around
lines 677 - 684, The run metadata currently omits the dataset ordering options;
update the code that builds/writes config_metadata.json (the
write_configuration() path that saves run metadata) to include the eval flags
from eval_config.general.shuffle and eval_config.general.shuffle_seed (store
them as keys "shuffle" and "shuffle_seed" or similar) so the DatasetHandler
parameters (see the DatasetHandler(...) instantiation using shuffle and
shuffle_seed) are persisted; ensure these values are read from
self.eval_config.general when assembling the metadata object before writing to
config_metadata.json.
🟡 Minor comments (18)
packages/nvidia_nat_core/tests/nat/builder/test_per_user_function_group_eval_isolation.py-175-176 (1)

175-176: ⚠️ Potential issue | 🟡 Minor

Make SessionManager.shutdown() exception-safe.

Cleanup only runs on the happy path right now. If any session() or ainvoke() call raises before the explicit shutdown, the cleanup task and per-user builders can survive into later tests. A try/finally or yielding fixture would make teardown deterministic.

As per coding guidelines, "Any frequently repeated code should be extracted into pytest fixtures."

Also applies to: 194-194, 213-214, 226-226, 246-247, 271-271

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@packages/nvidia_nat_core/tests/nat/builder/test_per_user_function_group_eval_isolation.py`
around lines 175 - 176, Tests leak per-user builders when exceptions occur
because cleanup only runs on the happy path; make teardown deterministic by
either (A) making SessionManager.shutdown() exception-safe (catch and log any
exceptions and ensure all cleanup steps still run inside
SessionManager.shutdown) — update the shutdown implementation to wrap each
cleanup step (e.g., closing per-user builders and cancelling cleanup tasks) in
try/except so one failing step won't abort the rest — or (B) refactor the
repeated setup/teardown into a pytest fixture that yields the
SessionManager/WorkflowBuilder and performs shutdown in a finally block after
yield; use the existing symbols SessionManager.create, SessionManager.shutdown,
session(), ainvoke(), and WorkflowBuilder.from_config to locate the code to
change.
packages/nvidia_nat_core/tests/nat/builder/test_per_user_function_group_eval_isolation.py-218-224 (1)

218-224: ⚠️ Potential issue | 🟡 Minor

Cross a session boundary in the "same user_id" case.

This only proves state is shared within one session_manager.session(...) block. If SessionManager rebuilt the per-user builder on every session() enter, this test would still pass. Re-open the same user_id in a second session and assert both count and instance_id carry over.

🔁 Suggested assertion shape
             async with session_manager.session(user_id=user_id) as session:
                 fns = await session.workflow.function_groups["counter"].get_accessible_functions()
                 increment = fns["counter__increment"]
-                get_count = fns["counter__get_count"]
-
-                await increment.ainvoke(IncrementInput(n=5))
-                result = await get_count.ainvoke(IncrementInput(n=0))
+                first = await increment.ainvoke(IncrementInput(n=5))
+
+            async with session_manager.session(user_id=user_id) as session:
+                fns = await session.workflow.function_groups["counter"].get_accessible_functions()
+                get_count = fns["counter__get_count"]
+                result = await get_count.ainvoke(IncrementInput(n=0))
 
             await session_manager.shutdown()
 
         # Within the same session, state accumulates
         assert result.count == 5, f"Expected count 5, got {result.count}"
+        assert result.instance_id == first.instance_id
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@packages/nvidia_nat_core/tests/nat/builder/test_per_user_function_group_eval_isolation.py`
around lines 218 - 224, The test only verifies state within a single session
context; update it to reopen a new session for the same user_id using
session_manager.session(user_id=user_id) a second time and re-resolve the
counter function group (function_groups["counter"]) to fetch increment and
get_count again, then invoke get_count (and/or increment if desired) with
IncrementInput and assert that both the numeric count and the instance
identifier (e.g., instance_id returned by the counter) are the same as the
values observed in the first session, proving state and instance_id persist
across separate session() entries for the same user_id.
examples/benchmarks/agent_leaderboard/src/nat_benchmark_agent_leaderboard/configs/eval_banking.yml-17-17 (1)

17-17: ⚠️ Potential issue | 🟡 Minor

Incorrect path in usage comment.

The usage comment references examples/benchmarks/agent_leaderboard/configs/eval_banking.yml, but this file is located at examples/benchmarks/agent_leaderboard/src/nat_benchmark_agent_leaderboard/configs/eval_banking.yml.

Proposed fix
-# Usage: nat eval --config_file examples/benchmarks/agent_leaderboard/configs/eval_banking.yml
+# Usage: nat eval --config_file examples/benchmarks/agent_leaderboard/src/nat_benchmark_agent_leaderboard/configs/eval_banking.yml
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@examples/benchmarks/agent_leaderboard/src/nat_benchmark_agent_leaderboard/configs/eval_banking.yml`
at line 17, The usage comment currently points to the wrong config path; update
the comment line that reads "Usage: nat eval --config_file
examples/benchmarks/agent_leaderboard/configs/eval_banking.yml" to the correct
relative path
"examples/benchmarks/agent_leaderboard/src/nat_benchmark_agent_leaderboard/configs/eval_banking.yml"
so the usage example matches the file's actual location.
packages/nvidia_nat_benchmarks/src/nat/meta/pypi.md-21-21 (1)

21-21: ⚠️ Potential issue | 🟡 Minor

Incomplete Markdown link syntax.

The text [`NeMo Agent Toolkit Benchmarks`] uses reference-style link syntax but lacks a corresponding link definition. This will render as plain text with brackets rather than a hyperlink.

Consider either:

  • Converting to an inline link: [NeMo Agent Toolkit Benchmarks](URL_HERE)
  • Or removing the brackets if no link is intended: NeMo Agent Toolkit Benchmarks
Proposed fix (assuming no link intended)
-This is a subpackage for [`NeMo Agent Toolkit Benchmarks`], which integrates key agentic workflow evaluations from [NeMo Evaluator](https://github.com/NVIDIA-NeMo/Evaluator/tree/main) and [HuggingFace datasets](https://huggingface.co/datasets) into the Toolkit for rapid benchmark evaluation.
+This is a subpackage for NeMo Agent Toolkit Benchmarks, which integrates key agentic workflow evaluations from [NeMo Evaluator](https://github.com/NVIDIA-NeMo/Evaluator/tree/main) and [HuggingFace datasets](https://huggingface.co/datasets) into the Toolkit for rapid benchmark evaluation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/nvidia_nat_benchmarks/src/nat/meta/pypi.md` at line 21, The Markdown
uses a reference-style link token [`NeMo Agent Toolkit Benchmarks`] without a
matching link definition; update the text in the nat/meta/pypi.md line
containing that token by either converting it to an inline link (e.g., replace
[`NeMo Agent Toolkit Benchmarks`] with an inline form and add the target URL) or
remove the square-bracket markers to render plain text (replace [`NeMo Agent
Toolkit Benchmarks`] with NeMo Agent Toolkit Benchmarks); ensure the replacement
preserves surrounding punctuation and spacing.
examples/benchmarks/tooltalk/src/nat_benchmark_tooltalk/configs/eval_easy.yml-17-17 (1)

17-17: ⚠️ Potential issue | 🟡 Minor

Incorrect path in usage comment.

The usage comment references examples/benchmarks/tooltalk/configs/eval_easy.yml, but this file is located at examples/benchmarks/tooltalk/src/nat_benchmark_tooltalk/configs/eval_easy.yml.

Proposed fix
-# Usage: nat eval --config_file examples/benchmarks/tooltalk/configs/eval_easy.yml
+# Usage: nat eval --config_file examples/benchmarks/tooltalk/src/nat_benchmark_tooltalk/configs/eval_easy.yml
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@examples/benchmarks/tooltalk/src/nat_benchmark_tooltalk/configs/eval_easy.yml`
at line 17, The usage comment at the top of eval_easy.yml contains an incorrect
repository path; update the comment string to reference this file's actual
repository-relative location by inserting the missing src directory segment (so
the usage comment matches where eval_easy.yml actually lives) — edit the comment
line in eval_easy.yml that starts with "Usage:" to the corrected path.
examples/benchmarks/bfcl/src/nat_benchmark_bfcl/configs/eval_react_simple.yml-18-18 (1)

18-18: ⚠️ Potential issue | 🟡 Minor

Fix the usage path in the comment.

The usage comment references examples/benchmarks/bfcl/configs/ but this file is located at examples/benchmarks/bfcl/src/nat_benchmark_bfcl/configs/. Update the path to match the actual file location.

📝 Proposed fix
-# Usage: nat eval --config_file examples/benchmarks/bfcl/configs/eval_react_simple.yml
+# Usage: nat eval --config_file examples/benchmarks/bfcl/src/nat_benchmark_bfcl/configs/eval_react_simple.yml
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@examples/benchmarks/bfcl/src/nat_benchmark_bfcl/configs/eval_react_simple.yml`
at line 18, Update the usage comment at the top of eval_react_simple.yml to
point to the actual file location: replace the incorrect path segment
"examples/benchmarks/bfcl/configs/" with the correct
"examples/benchmarks/bfcl/src/nat_benchmark_bfcl/configs/" so the usage line in
eval_react_simple.yml accurately reflects the file's real path.
examples/benchmarks/agent_leaderboard/src/nat_benchmark_agent_leaderboard/configs/eval_all_domains.yml-17-17 (1)

17-17: ⚠️ Potential issue | 🟡 Minor

Fix the usage path in the comment.

The usage comment references examples/benchmarks/agent_leaderboard/configs/ but this file is located at examples/benchmarks/agent_leaderboard/src/nat_benchmark_agent_leaderboard/configs/. Update the path to match the actual file location.

📝 Proposed fix
-# Usage: nat eval --config_file examples/benchmarks/agent_leaderboard/configs/eval_all_domains.yml
+# Usage: nat eval --config_file examples/benchmarks/agent_leaderboard/src/nat_benchmark_agent_leaderboard/configs/eval_all_domains.yml
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@examples/benchmarks/agent_leaderboard/src/nat_benchmark_agent_leaderboard/configs/eval_all_domains.yml`
at line 17, The usage comment at the top of eval_all_domains.yml points to the
wrong directory; open eval_all_domains.yml and update the "# Usage: nat eval
..." comment to reference the file's actual location under the
src/nat_benchmark_agent_leaderboard/configs directory (so the path includes
src/nat_benchmark_agent_leaderboard/configs/eval_all_domains.yml) so the example
command shows the correct config file path.
examples/benchmarks/agent_leaderboard/README.md-72-85 (1)

72-85: ⚠️ Potential issue | 🟡 Minor

Add language specifiers to fenced code blocks.

Several code blocks are missing language specifiers, which impacts rendering and accessibility. The static analysis tool flagged lines 82 and 130.

📝 Proposed fixes

For the output block at line 73-79, add text or console:

 **Expected output:**
-```
+```text
 INFO - Loading agent leaderboard v2 dataset from Hugging Face...

For the blank line comment block around line 130:

-```
+```bash
 unset AGENT_LEADERBOARD_LIMIT

Also applies to: 127-131

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/benchmarks/agent_leaderboard/README.md` around lines 72 - 85, Update
the README fenced code blocks to include language specifiers: tag the expected
output block (the multiline INFO output) with a plain text specifier such as
"text" or "console", and tag the shell snippets that set/unset environment
variables (the export AGENT_LEADERBOARD_DATA line and the unset
AGENT_LEADERBOARD_LIMIT line) with "bash" so they render correctly; locate the
blocks by searching for the INFO output and the AGENT_LEADERBOARD_DATA /
AGENT_LEADERBOARD_LIMIT examples in README.md and add the appropriate language
labels to the opening triple-backticks.
examples/benchmarks/agent_leaderboard/README.md-22-22 (1)

22-22: ⚠️ Potential issue | 🟡 Minor

Spell out "NeMo Agent Toolkit" instead of "NAT".

Per documentation guidelines, avoid using "NAT" as an acronym in Markdown files. Spell it out as "NeMo Agent Toolkit" unless referring to code identifiers in backticks.

📝 Proposed fix
-Evaluate NAT agent workflows against the [Galileo Agent Leaderboard v2](https://huggingface.co/datasets/galileo-ai/agent-leaderboard-v2) benchmark. This benchmark tests whether an agent can select the correct tools for real-world use cases across multiple domains.
+Evaluate NeMo Agent Toolkit agent workflows against the [Galileo Agent Leaderboard v2](https://huggingface.co/datasets/galileo-ai/agent-leaderboard-v2) benchmark. This benchmark tests whether an agent can select the correct tools for real-world use cases across multiple domains.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/benchmarks/agent_leaderboard/README.md` at line 22, Replace the
acronym "NAT" with the full name "NeMo Agent Toolkit" in the README sentence
that currently reads "Evaluate NAT agent workflows..." (leave any backticked
code identifiers unchanged); update the sentence to "Evaluate NeMo Agent Toolkit
agent workflows against the Galileo Agent Leaderboard v2 benchmark" so the
acronym is spelled out per documentation guidelines.
examples/benchmarks/agent_leaderboard/README.md-138-145 (1)

138-145: ⚠️ Potential issue | 🟡 Minor

Documentation style issues.

  1. Line 138: Uses "NAT's eval config" — avoid possessive "'s" with inanimate objects and spell out "NeMo Agent Toolkit"
  2. Line 145: Uses "scenario's user goals" — rephrase to avoid possessive with inanimate object
📝 Proposed fixes
-This example uses the **Tool Selection Quality (TSQ)** evaluator (`_type: agent_leaderboard_tsq` in the eval config). It compares the tool calls the agent made (captured by the workflow via `ToolIntentBuffer`) against the expected tool calls derived from the scenario's user goals.
+This example uses the **Tool Selection Quality (TSQ)** evaluator (`_type: agent_leaderboard_tsq` in the eval config). It compares the tool calls the agent made (captured by the workflow via `ToolIntentBuffer`) against the expected tool calls derived from the user goals of each scenario.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/benchmarks/agent_leaderboard/README.md` around lines 138 - 145,
Replace possessive phrasing: change "NAT's eval config" to "NeMo Agent Toolkit
eval config" (reference: Tool Selection Quality (TSQ) evaluator `_type:
agent_leaderboard_tsq`) and rephrase "scenario's user goals" to "user goals in
the scenario" (reference: how the workflow captures tool calls via
`ToolIntentBuffer`) so the README avoids possessive "'s" with inanimate objects.
packages/nvidia_nat_benchmarks/configs/agent_leaderboard_eval.yaml-1-11 (1)

1-11: ⚠️ Potential issue | 🟡 Minor

Missing Apache 2.0 license header.

This YAML config file is missing the SPDX license header that's present in other config files (e.g., eval_ast_simple.yml). Per coding guidelines, all code should include the Apache 2.0 header.

📝 Proposed fix - add at the beginning of the file
+# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 # Agent Leaderboard v2 evaluation config for NAT
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/nvidia_nat_benchmarks/configs/agent_leaderboard_eval.yaml` around
lines 1 - 11, Add the Apache-2.0 SPDX license header to the top of the YAML
config file (agent_leaderboard_eval.yaml) to match other configs like
eval_ast_simple.yml; insert a single-line SPDX identifier (e.g.,
"SPDX-License-Identifier: Apache-2.0") followed by the standard Apache 2.0
copyright block used across the repo so the file has the same license header
format as other config files.
packages/nvidia_nat_benchmarks/tests/agent_leaderboard/test_evaluator.py-24-43 (1)

24-43: ⚠️ Potential issue | 🟡 Minor

Incorrect return type annotation.

The function signature declares tuple[EvalInputItem, list] as the return type, but the function only returns item (an EvalInputItem). Additionally, type hints are missing on the parameters.

🐛 Proposed fix
-def _make_entry(expected_tools: list[str], predicted_tools: list[str]) -> tuple[EvalInputItem, list]:
+def _make_entry(expected_tools: list[str], predicted_tools: list[str]) -> EvalInputItem:
     """Create a test item with expected and predicted tool calls."""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/nvidia_nat_benchmarks/tests/agent_leaderboard/test_evaluator.py`
around lines 24 - 43, The _make_entry function's signature incorrectly declares
a return type of tuple[EvalInputItem, list] and lacks parameter type hints;
update the signature of _make_entry(expected_tools: list[str], predicted_tools:
list[str]) -> EvalInputItem (or use Sequence[str] if preferred) to reflect that
it returns a single EvalInputItem, and ensure the parameter annotations are
added so expected_tools and predicted_tools are typed as lists of strings; no
other changes to the function body are required.
examples/benchmarks/byob/README.md-137-149 (1)

137-149: ⚠️ Potential issue | 🟡 Minor

Add language specifiers to fenced code blocks.

Static analysis flags several code blocks without language specifiers (MD040). For expected output blocks, use text or console.

📝 Example fix for expected output block
 **Expected output:**
-```
+```text
 INFO - Starting evaluation run with config file: .../eval_exact_match.yml
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/benchmarks/byob/README.md` around lines 137 - 149, The fenced
"Expected output" code blocks lack language specifiers (MD040); update the
triple-backtick fences in the README's "Expected output" section to include a
language specifier such as text or console (e.g., ```text) so static analysis no
longer flags the code blocks; search for the "Expected output:" block and
replace the opening backticks with a specifier for each expected-output fenced
block.
examples/benchmarks/byob/README.md-22-32 (1)

22-32: ⚠️ Potential issue | 🟡 Minor

Replace "NAT" with "NeMo Agent Toolkit" per coding guidelines.

The coding guidelines specify: "Documentation in Markdown files should not use NAT as an acronym, always spell out NeMo Agent Toolkit." Multiple occurrences on these lines use "NAT" without backticks.

📝 Example fixes
-Run [NeMo Evaluator BYOB](https://docs.nvidia.com/nemo/evaluator/latest/) benchmarks directly on NAT agent workflows — without re-implementing the dataset loader or scorer logic.
+Run [NeMo Evaluator BYOB](https://docs.nvidia.com/nemo/evaluator/latest/) benchmarks directly on NeMo Agent Toolkit workflows — without re-implementing the dataset loader or scorer logic.
-**This integration reuses the benchmark definition, dataset loading, and scorer functions from NeMo Evaluator, but replaces the model-calling step with NAT's workflow execution.** This means:
+**This integration reuses the benchmark definition, dataset loading, and scorer functions from NeMo Evaluator, but replaces the model-calling step with NeMo Agent Toolkit workflow execution.** This means:
-- **NAT handles the agent execution** — instead of calling a model endpoint, NAT runs its own workflow
+- **NeMo Agent Toolkit handles the agent execution** — instead of calling a model endpoint, NeMo Agent Toolkit runs its own workflow
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/benchmarks/byob/README.md` around lines 22 - 32, Replace all
occurrences of the acronym "NAT" in the README paragraph with the full term
"NeMo Agent Toolkit" (no backticks), e.g. change "NAT agent workflows", "NAT
handles the agent execution", and "NAT handles all LLM calls upstream" to "NeMo
Agent Toolkit agent workflows", "NeMo Agent Toolkit handles the agent
execution", and "NeMo Agent Toolkit handles all LLM calls upstream"
respectively; ensure there are no leftover "NAT" usages or inline code
formatting for these phrases.
packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/agent_leaderboard/dataset.py-165-171 (1)

165-171: ⚠️ Potential issue | 🟡 Minor

Type mismatch: config.domains is list[AgentLeaderboardDomain], not list[str].

The lambda passes domains=config.domains directly, but load_agent_leaderboard_dataset expects domains: list[str] | None. While StrEnum values may work due to string inheritance, explicit conversion would be safer and consistent with the parser() method in AgentLeaderboardDatasetConfig (line 67 of config.py) which does [str(d) for d in self.domains].

🔧 Proposed fix
     yield DatasetLoaderInfo(
         config=config,
-        load_fn=lambda fp, **kw: load_agent_leaderboard_dataset(fp, domains=config.domains, limit=config.limit, **kw),
+        load_fn=lambda fp, **kw: load_agent_leaderboard_dataset(
+            fp, domains=[str(d) for d in config.domains], limit=config.limit, **kw
+        ),
         description="Galileo Agent Leaderboard v2 dataset loader",
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/agent_leaderboard/dataset.py`
around lines 165 - 171, register_agent_leaderboard_dataset_loader passes
config.domains (type list[AgentLeaderboardDomain]) directly to
load_agent_leaderboard_dataset which expects list[str] | None; change the lambda
to convert domains to strings (e.g., use [str(d) for d in config.domains] or
None when config.domains is falsy) so the call to load_agent_leaderboard_dataset
receives list[str] consistent with AgentLeaderboardDatasetConfig.parser()
behavior.
packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/byob/evaluator.py-40-46 (1)

40-46: ⚠️ Potential issue | 🟡 Minor

Unused target_field parameter - potential missing functionality.

The target_field parameter is passed from bench.target_field but never used in the function body. This may indicate intended functionality that wasn't implemented (e.g., extracting the target from a specific field in metadata).

🔧 If target_field should specify where to extract target from metadata
     # Get target value
-    target = item.expected_output_obj
+    target = metadata.get(target_field, item.expected_output_obj) if target_field else item.expected_output_obj
     if isinstance(target, str):

Alternatively, if target_field is intentionally unused, remove it from the signature and the partial() call in byob_evaluator_function.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/byob/evaluator.py`
around lines 40 - 46, The parameter target_field on _evaluate_single is unused;
either implement extracting the target from the provided EvalInputItem using
that field or remove the parameter and its use in
byob_evaluator_function.partial(). If the intent is to read the ground-truth
from item.metadata, update _evaluate_single (the function named
_evaluate_single) to pull target = item.metadata.get(target_field) (or handle
nested keys as your metadata model requires), use that value when calling
scorer_fn/when populating EvalOutputItem, and validate/match types; otherwise
remove target_field from _evaluate_single's signature and remove the
corresponding argument passed in the partial() call inside
byob_evaluator_function so signatures align.
packages/nvidia_nat_benchmarks/tests/bfcl/test_tool_intent_stubs.py-200-200 (1)

200-200: ⚠️ Potential issue | 🟡 Minor

Remove the unused desc unpack target.

This leaves a new RUF059 warning behind. Rename it to _/_desc or stop unpacking the description.

🧹 Minimal fix
-        stub_fn, _, desc = create_tool_stub_function(schema, buf, canned_response="OK")
+        stub_fn, _, _ = create_tool_stub_function(schema, buf, canned_response="OK")

As per coding guidelines, "Use ruff (via 'ruff check --fix') for linting with configuration in pyproject.toml; fix warnings unless explicitly ignored in pyproject.toml."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/nvidia_nat_benchmarks/tests/bfcl/test_tool_intent_stubs.py` at line
200, The test unpacks three values from create_tool_stub_function into stub_fn,
_, desc but never uses desc causing a RUF059 warning; modify the unpacking in
the test (where create_tool_stub_function(schema, buf, canned_response="OK") is
called) to either drop the description entirely or rename it to a throwaway
variable (e.g., replace desc with _ or _desc) so the unused variable is
suppressed; ensure you only change the unpack target and do not alter
create_tool_stub_function itself.
packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/common/tool_intent_stubs.py-173-177 (1)

173-177: ⚠️ Potential issue | 🟡 Minor

Add proper type hints to match the actual return value.

The return annotation tuple[callable, BaseModel | None, str] is incorrect on three counts:

  1. callable is the runtime builtin, not a typing construct; should be Callable from collections.abc
  2. The function returns the PermissiveToolInput class itself, not an instance; should be type[PermissiveToolInput] not BaseModel | None
  3. The first element is an async function, which should be expressed as Callable[[object], Awaitable[str]] to capture the awaitable nature

Line 218 returns tool_stub_fn, PermissiveToolInput, tool_description, where tool_stub_fn is defined at line 198 as async def tool_stub_fn(input_params: object) -> str.

Suggested fix
+from collections.abc import Awaitable, Callable
@@
-) -> tuple[callable, BaseModel | None, str]:
+) -> tuple[Callable[[object], Awaitable[str]], type[PermissiveToolInput], str]:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/common/tool_intent_stubs.py`
around lines 173 - 177, The return type of create_tool_stub_function is wrong;
update its annotation to reflect it returns an async callable (tool_stub_fn),
the class PermissiveToolInput (the type, not an instance), and a str. Change the
signature to return tuple[Callable[[object], Awaitable[str]],
type[PermissiveToolInput], str] (import Callable from collections.abc and
Awaitable from typing if needed) and ensure references to ToolIntentBuffer,
tool_stub_fn and PermissiveToolInput remain unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c0dd3aaa-6d1c-46ae-9f1c-7e26ccea2398

📥 Commits

Reviewing files that changed from the base of the PR and between 0aeb76d and 72c78b9.

📒 Files selected for processing (96)
  • docs/source/improve-workflows/evaluate.md
  • examples/benchmarks/agent_leaderboard/README.md
  • examples/benchmarks/agent_leaderboard/configs
  • examples/benchmarks/agent_leaderboard/pyproject.toml
  • examples/benchmarks/agent_leaderboard/src/nat_benchmark_agent_leaderboard/__init__.py
  • examples/benchmarks/agent_leaderboard/src/nat_benchmark_agent_leaderboard/configs/eval_all_domains.yml
  • examples/benchmarks/agent_leaderboard/src/nat_benchmark_agent_leaderboard/configs/eval_banking.yml
  • examples/benchmarks/agent_leaderboard/src/nat_benchmark_agent_leaderboard/register.py
  • examples/benchmarks/bfcl/README.md
  • examples/benchmarks/bfcl/configs
  • examples/benchmarks/bfcl/pyproject.toml
  • examples/benchmarks/bfcl/src/nat_benchmark_bfcl/__init__.py
  • examples/benchmarks/bfcl/src/nat_benchmark_bfcl/configs/eval_ast_simple.yml
  • examples/benchmarks/bfcl/src/nat_benchmark_bfcl/configs/eval_fc_simple.yml
  • examples/benchmarks/bfcl/src/nat_benchmark_bfcl/configs/eval_react_simple.yml
  • examples/benchmarks/bfcl/src/nat_benchmark_bfcl/register.py
  • examples/benchmarks/byob/README.md
  • examples/benchmarks/byob/configs
  • examples/benchmarks/byob/pyproject.toml
  • examples/benchmarks/byob/src/nat_benchmark_byob/__init__.py
  • examples/benchmarks/byob/src/nat_benchmark_byob/configs/eval_exact_match.yml
  • examples/benchmarks/byob/src/nat_benchmark_byob/register.py
  • examples/benchmarks/tooltalk/README.md
  • examples/benchmarks/tooltalk/configs
  • examples/benchmarks/tooltalk/pyproject.toml
  • examples/benchmarks/tooltalk/src/nat_benchmark_tooltalk/__init__.py
  • examples/benchmarks/tooltalk/src/nat_benchmark_tooltalk/configs/eval_easy.yml
  • examples/benchmarks/tooltalk/src/nat_benchmark_tooltalk/register.py
  • packages/nvidia_nat_benchmarks/configs/agent_leaderboard_eval.yaml
  • packages/nvidia_nat_benchmarks/configs/bfcl_ast_eval.yaml
  • packages/nvidia_nat_benchmarks/configs/tooltalk_eval.yaml
  • packages/nvidia_nat_benchmarks/pyproject.toml
  • packages/nvidia_nat_benchmarks/src/nat/meta/pypi.md
  • packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/__init__.py
  • packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/agent_leaderboard/__init__.py
  • packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/agent_leaderboard/config.py
  • packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/agent_leaderboard/dataset.py
  • packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/agent_leaderboard/evaluator.py
  • packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/agent_leaderboard/workflow.py
  • packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/bfcl/__init__.py
  • packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/bfcl/config.py
  • packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/bfcl/dataset.py
  • packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/bfcl/evaluator.py
  • packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/bfcl/tool_intent_stubs.py
  • packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/bfcl/workflow_ast.py
  • packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/bfcl/workflow_fc.py
  • packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/bfcl/workflow_react.py
  • packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/byob/__init__.py
  • packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/byob/config.py
  • packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/byob/dataset.py
  • packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/byob/evaluator.py
  • packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/common/__init__.py
  • packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/common/bfcl_helpers.py
  • packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/common/eval_helpers.py
  • packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/common/tool_intent_stubs.py
  • packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/register.py
  • packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/tooltalk/__init__.py
  • packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/tooltalk/config.py
  • packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/tooltalk/dataset.py
  • packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/tooltalk/evaluator.py
  • packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/tooltalk/workflow.py
  • packages/nvidia_nat_benchmarks/tests/__init__.py
  • packages/nvidia_nat_benchmarks/tests/agent_leaderboard/__init__.py
  • packages/nvidia_nat_benchmarks/tests/agent_leaderboard/conftest.py
  • packages/nvidia_nat_benchmarks/tests/agent_leaderboard/test_dataset.py
  • packages/nvidia_nat_benchmarks/tests/agent_leaderboard/test_evaluator.py
  • packages/nvidia_nat_benchmarks/tests/agent_leaderboard/test_evaluator_comprehensive.py
  • packages/nvidia_nat_benchmarks/tests/bfcl/__init__.py
  • packages/nvidia_nat_benchmarks/tests/bfcl/conftest.py
  • packages/nvidia_nat_benchmarks/tests/bfcl/test_dataset.py
  • packages/nvidia_nat_benchmarks/tests/bfcl/test_evaluator.py
  • packages/nvidia_nat_benchmarks/tests/bfcl/test_evaluator_comprehensive.py
  • packages/nvidia_nat_benchmarks/tests/bfcl/test_integration.py
  • packages/nvidia_nat_benchmarks/tests/bfcl/test_regression.py
  • packages/nvidia_nat_benchmarks/tests/bfcl/test_tool_intent_stubs.py
  • packages/nvidia_nat_benchmarks/tests/byob/__init__.py
  • packages/nvidia_nat_benchmarks/tests/byob/conftest.py
  • packages/nvidia_nat_benchmarks/tests/byob/fixtures/sample_benchmark.py
  • packages/nvidia_nat_benchmarks/tests/byob/test_evaluator.py
  • packages/nvidia_nat_benchmarks/tests/byob/test_evaluator_comprehensive.py
  • packages/nvidia_nat_benchmarks/tests/byob/test_integration.py
  • packages/nvidia_nat_benchmarks/tests/common/__init__.py
  • packages/nvidia_nat_benchmarks/tests/common/test_bfcl_helpers.py
  • packages/nvidia_nat_benchmarks/tests/common/test_config_enums.py
  • packages/nvidia_nat_benchmarks/tests/common/test_eval_helpers.py
  • packages/nvidia_nat_benchmarks/tests/tooltalk/__init__.py
  • packages/nvidia_nat_benchmarks/tests/tooltalk/conftest.py
  • packages/nvidia_nat_benchmarks/tests/tooltalk/test_dataset.py
  • packages/nvidia_nat_benchmarks/tests/tooltalk/test_evaluator.py
  • packages/nvidia_nat_benchmarks/tests/tooltalk/test_evaluator_comprehensive.py
  • packages/nvidia_nat_benchmarks/tests/tooltalk/test_integration.py
  • packages/nvidia_nat_benchmarks/tests/tooltalk/test_regression.py
  • packages/nvidia_nat_core/src/nat/data_models/evaluate_config.py
  • packages/nvidia_nat_core/tests/nat/builder/test_per_user_function_group_eval_isolation.py
  • packages/nvidia_nat_eval/src/nat/plugins/eval/dataset_handler/dataset_handler.py
  • packages/nvidia_nat_eval/src/nat/plugins/eval/runtime/evaluate.py

Comment on lines +155 to +170
@field_validator('input_params', mode='before')
@classmethod
def parse_string_to_dict(cls, v: Any) -> dict[str, Any]:
"""Convert JSON string to dict if needed."""
if isinstance(v, str):
try:
normalized = v.replace("'", '"')
return json.loads(normalized)
except json.JSONDecodeError:
logger.warning("Failed to parse input_params string as JSON: %s", v[:100])
return {}
elif isinstance(v, dict):
return v
else:
logger.warning("Unexpected input_params type: %s", type(v))
return {}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Blind quote normalization corrupts some valid tool inputs, then drops them.

v.replace("'", '"') turns a valid payload like {"name": "O'Reilly"} into invalid JSON. The failure path then logs the raw tool input and collapses it to {}, which both leaks user-supplied data and loses the stripped-string fallback behavior that the existing ReAct path relies on. Please try json.loads() on the original string first, then use a safer fallback parser or preserve the stripped raw string instead of zeroing the params.

Based on learnings, the existing ReAct path intentionally preserves stripped tool_input when structured parsing fails, and as per coding guidelines, "Validate and sanitise all user input, especially in web or CLI interfaces."

Also applies to: 198-207

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/common/tool_intent_stubs.py`
around lines 155 - 170, The parse_string_to_dict field validator is currently
replacing single quotes before attempting JSON decode which corrupts valid
inputs like {"name":"O'Reilly"} and then returns {} while logging the raw input;
change it to first try json.loads on the original string, and only if that fails
attempt a safer fallback (e.g., ast.literal_eval or a controlled
quote-normalization) or preserve the stripped raw string in a non-sensitive
container (e.g., return {"_raw": v.strip()}) so consumers (the ReAct path) can
still access the fallback tool_input; also avoid logging raw user input—log a
truncated/sanitized preview instead. Apply the same fix pattern to the similar
validator block around the other occurrence noted (the block at the other
location referenced in the comment).

Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/common/tool_intent_stubs.py (2)

155-175: ⚠️ Potential issue | 🟠 Major

Don't collapse unparsed tool input to {}.

This now handles JSON and Python-style dict strings, but the failure path still returns {}. That drops the original tool input and records an empty intent instead of the fallback value the ReAct path relies on. Return a sanitized raw wrapper instead of zeroing it out.

Based on learnings, the existing ReAct path intentionally preserves stripped tool_input when structured parsing fails as fallback.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/common/tool_intent_stubs.py`
around lines 155 - 175, The parse_string_to_dict field_validator currently
returns an empty dict on parse failure which discards the original tool input;
change the failure path in parse_string_to_dict (the `@field_validator` for
'input_params') to return a sanitized raw-wrapper containing the original
(stripped) string instead of {} so the ReAct fallback can see the raw tool_input
(e.g., return a dict like a single-key wrapper with the stripped original string
and keep the existing logger warning). Ensure you only modify the failure branch
after both json.loads and ast.literal_eval attempts.

84-118: ⚠️ Potential issue | 🟠 Major

Keep ToolIntentBuffer state per scenario.

record() writes the global registry per scenario, but self.intents is still one shared list. Reusing one buffer across evaluation items makes get_intents() leak intents between scenarios, and clear() wipes the local history for all of them. Store the local view per scenario too, or derive get_intents() / clear() from the current-scenario registry.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/common/tool_intent_stubs.py`
around lines 84 - 118, ToolIntentBuffer currently uses a single shared list
self.intents causing cross-scenario leaks; change the internal storage to be
per-scenario (e.g. self.intents_by_scenario: dict[str, list[dict[str, Any]]]) in
__init__, update record(tool_name, parameters) to use get_current_scenario_id()
to append the intent into both self.intents_by_scenario[current_scenario] and
_GLOBAL_INTENT_REGISTRY[current_scenario] (creating empty lists if missing),
make get_intents() return a copy of
self.intents_by_scenario[get_current_scenario_id()] (or [] if none), and make
clear() only clear the list for get_current_scenario_id() in both
self.intents_by_scenario and _GLOBAL_INTENT_REGISTRY and log that scenario; keep
references to the same symbol names (__init__, record, get_intents, clear,
_GLOBAL_INTENT_REGISTRY) so the changes are localized.
🧹 Nitpick comments (20)
examples/benchmarks/tooltalk/README.md (3)

85-88: Add language specifier to fenced code block.

Expected output blocks should have a language specifier like text.

🔧 Suggested fix
 **Expected output:**
-```
+```text
 export TOOLTALK_DATABASE_DIR=/path/to/.venv/lib/python3.11/site-packages/tooltalk/data/databases
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/benchmarks/tooltalk/README.md` around lines 85 - 88, The fenced code
block showing the environment variable exports (TOOLTALK_DATABASE_DIR and
TOOLTALK_DATASET_DIR) lacks a language specifier; add a language tag (e.g.,
text) to the opening triple-backtick so the block becomes a labeled text block
and renders correctly in README.md—update the block containing the two lines
with TOOLTALK_DATABASE_DIR and TOOLTALK_DATASET_DIR to start with ```text.

107-126: Add language specifier to fenced code block.

🔧 Suggested fix
 **Expected output (during run):**
-```
+```text
 INFO - Starting evaluation run with config file: examples/benchmarks/tooltalk/configs/eval_easy.yml
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/benchmarks/tooltalk/README.md` around lines 107 - 126, The fenced
code block showing the "Expected output (during run):" should include a language
specifier to enable proper syntax highlighting; update the opening fence for
that block from ``` to ```text (or another appropriate language) in the README
example block so the lines starting with "INFO - Starting evaluation run..." and
the subsequent output are marked as plain text.

182-188: Add language specifier to fenced code block.

🔧 Suggested fix
 **Example output:**
-```
+```text
 Average success rate: 0.345
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/benchmarks/tooltalk/README.md` around lines 182 - 188, The fenced
code block showing benchmark output uses a bare triple-backtick; update the
opening fence to include a language specifier (e.g., change the opening "```" to
"```text") so syntax highlighters treat the block as plain text; locate the
fenced block containing "Average success rate: 0.345" and modify only its
opening fence.
examples/benchmarks/byob/README.md (6)

138-149: Add language specifier to fenced code block.

Expected output blocks should have a language specifier like text.

🔧 Suggested fix
 **Expected output:**
-```
+```text
 INFO - Starting evaluation run with config file: .../eval_exact_match.yml
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/benchmarks/byob/README.md` around lines 138 - 149, The fenced code
block in the README example lacks a language specifier; update the
triple-backtick fence in the BYOB evaluation output block to include "text"
(i.e., change ``` to ```text) so syntax highlighting/rendering is correct;
locate the output block in examples/benchmarks/byob/README.md (the block that
begins with "INFO - Starting evaluation run...") and add the language tag to the
opening fence.

39-46: Spell out "NeMo Agent Toolkit" instead of "NAT" acronym.

Per coding guidelines, "NAT" should not be used as an acronym in documentation. Replace "NAT" with "NeMo Agent Toolkit" where it refers to the toolkit (not package names in backticks).

📝 Suggested fix
 ├───────────────┬─────────────────────┤
-│ NeMo Evaluator│  NAT Integration    │
+│ NeMo Evaluator│  NeMo Agent Toolkit │
 │ (standalone)  │  (this example)     │
 │               │                     │
 │ load_dataset()│  load_dataset()     │  ← Same function, reused
-│ render_prompt │  NAT workflow       │  ← NAT replaces model calling
+│ render_prompt │  Toolkit workflow   │  ← Toolkit replaces model calling
 │ call_model()  │  (agents, tools)    │
 │ scorer_fn()   │  scorer_fn()        │  ← Same scorer, reused
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/benchmarks/byob/README.md` around lines 39 - 46, Replace occurrences
of the acronym "NAT" that refer to the toolkit in the README text with the full
phrase "NeMo Agent Toolkit" (e.g., change "NAT Integration", "NAT workflow", and
"NAT replaces model calling" to use "NeMo Agent Toolkit") while leaving any
package names or code identifiers in backticks untouched; update the table cells
and surrounding descriptive text accordingly to maintain sentence grammar and
capitalization.

55-55: Avoid possessive 's with inanimate objects.

Per documentation guidelines, rephrase "benchmark's extra config" to avoid possessive form with inanimate objects.

📝 Suggested fix
-- **Dataset from benchmark**: Dataset path, prompt template, and target field come from the benchmark definition — no duplication in the eval config
+- **Dataset from benchmark**: Dataset path, prompt template, and target field come from the benchmark definition — no duplication in the eval config

Note: Line 55 appears fine ("benchmark definition"). The issue is at line 293: "Benchmark's extra config" → "The extra config of the benchmark".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/benchmarks/byob/README.md` at line 55, Replace the possessive phrase
"Benchmark's `extra` config" with a non-possessive form such as "The `extra`
config of the benchmark" (search for the exact string "Benchmark's `extra`
config" in the README and update it) so inanimate object wording follows the
documentation guideline.

170-172: Avoid possessive 's with inanimate objects.

Rephrase "scorer's output dict" to "the output dict of the scorer" or "the scorer output dict".

📝 Suggested fix
-The `score_field` parameter controls which key from the scorer's output dict becomes the item's primary score. For `exact_match` and `contains`, this is `correct` (boolean → 1.0 or 0.0). For `f1_token`, you'd set `score_field: f1` to get the F1 float. The `average_score` in the output is the mean of all items' primary scores.
+The `score_field` parameter controls which key from the scorer output dict becomes the primary score of the item. For `exact_match` and `contains`, this is `correct` (boolean → 1.0 or 0.0). For `f1_token`, you'd set `score_field: f1` to get the F1 float. The `average_score` in the output is the mean of primary scores across all items.
-
-The metrics available in each item's `reasoning` dict depend entirely on what the scorer function returns — the evaluator passes through the full scorer output without modification.
+
+The metrics available in the `reasoning` dict of each item depend entirely on what the scorer function returns — the evaluator passes through the full scorer output without modification.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/benchmarks/byob/README.md` around lines 170 - 172, Update the
wording to avoid using a possessive with an inanimate object: replace "scorer's
output dict" with "the output dict of the scorer" (or "the scorer output dict")
in the README text that explains score_field and reasoning, so the sentence
reads e.g. "The metrics available in each item's `reasoning` dict depend
entirely on the output dict of the scorer — the evaluator passes through the
full scorer output without modification."

293-293: Avoid possessive 's with inanimate objects.

Rephrase "Benchmark's extra config" to "The extra config of the benchmark".

📝 Suggested fix
-| `config` | `Dict[str, Any]` | Benchmark's `extra` config |
+| `config` | `Dict[str, Any]` | The `extra` config of the benchmark |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/benchmarks/byob/README.md` at line 293, Change the possessive phrase
"Benchmark's `extra` config" to use non-possessive wording: replace that cell
value (the table row for `config` / `Dict[str, Any]`) so it reads "The `extra`
config of the benchmark" instead of "Benchmark's `extra` config".

189-194: Add language specifier to fenced code block.

🔧 Suggested fix
 **Example output:**
-```
+```text
 Average score: 0.667
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/benchmarks/byob/README.md` around lines 189 - 194, The fenced code
block showing the benchmark output (the block starting with "Average score:
0.667" and the lines "0: score=1.0 correct=True" etc.) is missing a language
specifier; update the opening fence to include a language (for example use
"text" or "console" like ```text) so markdown renderers apply correct formatting
and avoid syntax highlighting issues.
pyproject.toml (2)

266-270: Same alphabetical ordering issue in uv.sources.

For consistency, the uv.sources section should also be alphabetically sorted, placing nat_automated_description_generation before the nat_benchmark_* entries.

🔧 Suggested ordering fix
 nat_autogen_demo = { path = "examples/frameworks/nat_autogen_demo", editable = true }
+nat_automated_description_generation = { path = "examples/custom_functions/automated_description_generation", editable = true }
 nat_benchmark_agent_leaderboard = { path = "examples/benchmarks/agent_leaderboard", editable = true }
 nat_benchmark_bfcl = { path = "examples/benchmarks/bfcl", editable = true }
 nat_benchmark_byob = { path = "examples/benchmarks/byob", editable = true }
 nat_benchmark_tooltalk = { path = "examples/benchmarks/tooltalk", editable = true }
-nat_automated_description_generation = { path = "examples/custom_functions/automated_description_generation", editable = true }
 nat_currency_agent_a2a = { path = "examples/A2A/currency_agent_a2a", editable = true }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` around lines 266 - 270, The entries in uv.sources are not
alphabetically ordered: move the nat_automated_description_generation entry so
it appears before the nat_benchmark_* entries. Specifically, within the
uv.sources block reorder so nat_automated_description_generation appears above
nat_benchmark_agent_leaderboard, nat_benchmark_bfcl, nat_benchmark_byob, and
nat_benchmark_tooltalk to maintain consistent alphabetical sorting.

126-130: Examples list is not sorted alphabetically.

The benchmark entries are placed between nat_autogen_demo and nat_automated_description_generation, but alphabetically nat_automated_description_generation should come before any nat_benchmark_* entries since "automated" < "benchmark".

🔧 Suggested ordering fix
   "nat_autogen_demo",
+  "nat_automated_description_generation",
   "nat_benchmark_agent_leaderboard",
   "nat_benchmark_bfcl",
   "nat_benchmark_byob",
   "nat_benchmark_tooltalk",
-  "nat_automated_description_generation",
   "nat_currency_agent_a2a",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` around lines 126 - 130, The examples list in pyproject.toml
is not alphabetically ordered; move "nat_automated_description_generation" so it
appears before the "nat_benchmark_*" entries (e.g.,
nat_benchmark_agent_leaderboard, nat_benchmark_bfcl, nat_benchmark_byob,
nat_benchmark_tooltalk) so the sequence is alphabetically sorted; update the
examples array to place nat_automated_description_generation immediately before
the nat_benchmark_* items.
examples/benchmarks/agent_leaderboard/README.md (4)

138-138: Avoid possessive 's with inanimate objects.

Per documentation guidelines, avoid using possessive 's with inanimate objects. Consider rewording "scenario's user goals" to "user goals from the scenario" or "user goals of the scenario".

📝 Suggested fix
-This example uses the **Tool Selection Quality (TSQ)** evaluator (`_type: agent_leaderboard_tsq` in the eval config). It compares the tool calls the agent made (captured by the workflow via `ToolIntentBuffer`) against the expected tool calls derived from the scenario's user goals.
+This example uses the **Tool Selection Quality (TSQ)** evaluator (`_type: agent_leaderboard_tsq` in the eval config). It compares the tool calls the agent made (captured by the workflow via `ToolIntentBuffer`) against the expected tool calls derived from the user goals of the scenario.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/benchmarks/agent_leaderboard/README.md` at line 138, Reword the
phrase "scenario's user goals" to avoid possessive 's with an inanimate object:
update the README line that describes the Tool Selection Quality (TSQ) evaluator
(`_type: agent_leaderboard_tsq`) and the ToolIntentBuffer comparison to use
"user goals from the scenario" or "user goals of the scenario" so the sentence
reads e.g. "...against the expected tool calls derived from the user goals from
the scenario."

191-203: Add language specifier to fenced code block.

This example output block should have a language specifier.

🔧 Suggested fix
-```
+```text
 Average TSQ F1: 0.650
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/benchmarks/agent_leaderboard/README.md` around lines 191 - 203,
Update the fenced code block that starts with the example output containing
"Average TSQ F1: 0.650" to include a language specifier (use "text") so the
block begins with ```text instead of ```; locate the fenced block around the TSQ
output (the lines listing "Average TSQ F1", "Total scenarios", and the scenario
entries) and change the opening fence to ```text to satisfy the documentation
lint rule.

73-79: Add language specifier to fenced code block.

Static analysis flagged this code block as missing a language specifier. For expected terminal output, use text or plaintext.

🔧 Suggested fix
-```
+```text
 INFO - Loading agent leaderboard v2 dataset from Hugging Face...
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/benchmarks/agent_leaderboard/README.md` around lines 73 - 79, The
fenced code block showing the example terminal output (the block beginning with
"INFO - Loading agent leaderboard v2 dataset from Hugging Face...") is missing a
language specifier; update the triple-backtick fence to include a language tag
such as text or plaintext (e.g., change ``` to ```text) so the block is
correctly marked as plain text in README.md.

112-123: Add language specifier to fenced code block.

This expected output block should have a language specifier.

🔧 Suggested fix
-```
+```text
 INFO - Starting evaluation run with config file: .../eval_banking.yml
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/benchmarks/agent_leaderboard/README.md` around lines 112 - 123, The
fenced code block in the README example output is missing a language specifier;
update the opening fence for that block (the triple backticks before "INFO -
Starting evaluation run...") to include a language tag such as "text" so
Markdown renders it correctly (e.g., change ``` to ```text for the code block
that shows the INFO/LOG output and evaluation summary).
examples/benchmarks/bfcl/README.md (5)

213-225: Add language specifier to fenced code block.

🔧 Suggested fix
 **Expected output:**
-```
+```text
 INFO - Starting evaluation run with config file: .../eval_react_simple.yml
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/benchmarks/bfcl/README.md` around lines 213 - 225, The fenced code
block showing the "Expected output" lacks a language specifier; update the
opening triple-backtick for that block to include a language tag (e.g., change
``` to ```text) so the log lines like "INFO - Starting evaluation run with
config file: .../eval_react_simple.yml" and the rest of the sample output render
as plain text; locate the fenced block containing the example output and add the
specifier to its opening ``` marker.

184-196: Add language specifier to fenced code block.

🔧 Suggested fix
 **Expected output:**
-```
+```text
 INFO - Starting evaluation run with config file: .../eval_fc_simple.yml
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/benchmarks/bfcl/README.md` around lines 184 - 196, The fenced code
block under the "Expected output:" section in README.md is missing a language
specifier; update the triple-backtick fence that precedes the log sample (the
block showing "INFO - Starting evaluation run...") to use a language tag (for
example use ```text) so syntax highlighters render it correctly while leaving
the block contents unchanged.

278-287: Add language specifier to fenced code block.

🔧 Suggested fix
 **Example output:**
-```
+```text
 Accuracy: 0.850
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/benchmarks/bfcl/README.md` around lines 278 - 287, The fenced code
block showing the example output lacks a language specifier; update the block
delimiter for the "Example output" snippet (the triple-backtick surrounding the
Accuracy/ PASS/ FAIL lines) to include a language tag (e.g., replace ``` with
```text) so syntax highlighters treat it as plain text while preserving the
existing content.

157-168: Add language specifier to fenced code block.

🔧 Suggested fix
 **Expected output:**
-```
+```text
 INFO - Starting evaluation run with config file: .../eval_ast_simple.yml
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/benchmarks/bfcl/README.md` around lines 157 - 168, Update the fenced
code block in examples/benchmarks/bfcl/README.md that contains the evaluation
log so it includes a language specifier (use "text") on the opening fence;
replace the opening ``` with ```text for the block that starts with "INFO -
Starting evaluation run with config file: .../eval_ast_simple.yml" so the block
is properly labeled and rendered.

130-133: Add language specifier to fenced code block.

Expected output blocks should have a language specifier like text.

🔧 Suggested fix
 **Expected output:**
-```
+```text
 Dataset dir:  /path/to/.venv/lib/python3.11/site-packages/bfcl/data
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/benchmarks/bfcl/README.md` around lines 130 - 133, The fenced code
block showing "Dataset dir:  /path/to/.venv/..." and "Answers dir: 
/path/to/.venv/..." in the README lacks a language specifier; update that fenced
block (the triple-backtick block containing those two lines) to use a language
tag (e.g., add "text" after the opening ```), so the block becomes ```text ...
``` to ensure proper formatting.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/common/tool_intent_stubs.py`:
- Around line 121-143: The functions get_global_intents and clear_global_intents
currently use the literal default "current" and should instead accept
scenario_id: str | None = None and resolve None by calling
get_current_scenario_id(); update both functions (get_global_intents and
clear_global_intents) to: change the signature to accept Optional scenario_id,
call get_current_scenario_id() when scenario_id is None, then use the resolved
id when reading/writing _GLOBAL_INTENT_REGISTRY and logging so the active
scenario bucket is used rather than the literal "current".
- Around line 183-187: The public API return annotation on
create_tool_stub_function is wrong: change its signature to return
tuple[Callable[[object], Coroutine[Any, Any, str]], type[PermissiveToolInput] |
None, str] (the first element is an async call signature, the second is the
PermissiveToolInput class type, not an instance). Import Callable from
collections.abc and Coroutine and Any from typing (or typing.Any) and ensure
PermissiveToolInput is referenced in the annotation (use a forward reference or
real symbol) so pyright validates the function create_tool_stub_function
correctly.

---

Duplicate comments:
In
`@packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/common/tool_intent_stubs.py`:
- Around line 155-175: The parse_string_to_dict field_validator currently
returns an empty dict on parse failure which discards the original tool input;
change the failure path in parse_string_to_dict (the `@field_validator` for
'input_params') to return a sanitized raw-wrapper containing the original
(stripped) string instead of {} so the ReAct fallback can see the raw tool_input
(e.g., return a dict like a single-key wrapper with the stripped original string
and keep the existing logger warning). Ensure you only modify the failure branch
after both json.loads and ast.literal_eval attempts.
- Around line 84-118: ToolIntentBuffer currently uses a single shared list
self.intents causing cross-scenario leaks; change the internal storage to be
per-scenario (e.g. self.intents_by_scenario: dict[str, list[dict[str, Any]]]) in
__init__, update record(tool_name, parameters) to use get_current_scenario_id()
to append the intent into both self.intents_by_scenario[current_scenario] and
_GLOBAL_INTENT_REGISTRY[current_scenario] (creating empty lists if missing),
make get_intents() return a copy of
self.intents_by_scenario[get_current_scenario_id()] (or [] if none), and make
clear() only clear the list for get_current_scenario_id() in both
self.intents_by_scenario and _GLOBAL_INTENT_REGISTRY and log that scenario; keep
references to the same symbol names (__init__, record, get_intents, clear,
_GLOBAL_INTENT_REGISTRY) so the changes are localized.

---

Nitpick comments:
In `@examples/benchmarks/agent_leaderboard/README.md`:
- Line 138: Reword the phrase "scenario's user goals" to avoid possessive 's
with an inanimate object: update the README line that describes the Tool
Selection Quality (TSQ) evaluator (`_type: agent_leaderboard_tsq`) and the
ToolIntentBuffer comparison to use "user goals from the scenario" or "user goals
of the scenario" so the sentence reads e.g. "...against the expected tool calls
derived from the user goals from the scenario."
- Around line 191-203: Update the fenced code block that starts with the example
output containing "Average TSQ F1: 0.650" to include a language specifier (use
"text") so the block begins with ```text instead of ```; locate the fenced block
around the TSQ output (the lines listing "Average TSQ F1", "Total scenarios",
and the scenario entries) and change the opening fence to ```text to satisfy the
documentation lint rule.
- Around line 73-79: The fenced code block showing the example terminal output
(the block beginning with "INFO - Loading agent leaderboard v2 dataset from
Hugging Face...") is missing a language specifier; update the triple-backtick
fence to include a language tag such as text or plaintext (e.g., change ``` to
```text) so the block is correctly marked as plain text in README.md.
- Around line 112-123: The fenced code block in the README example output is
missing a language specifier; update the opening fence for that block (the
triple backticks before "INFO - Starting evaluation run...") to include a
language tag such as "text" so Markdown renders it correctly (e.g., change ```
to ```text for the code block that shows the INFO/LOG output and evaluation
summary).

In `@examples/benchmarks/bfcl/README.md`:
- Around line 213-225: The fenced code block showing the "Expected output" lacks
a language specifier; update the opening triple-backtick for that block to
include a language tag (e.g., change ``` to ```text) so the log lines like "INFO
- Starting evaluation run with config file: .../eval_react_simple.yml" and the
rest of the sample output render as plain text; locate the fenced block
containing the example output and add the specifier to its opening ``` marker.
- Around line 184-196: The fenced code block under the "Expected output:"
section in README.md is missing a language specifier; update the triple-backtick
fence that precedes the log sample (the block showing "INFO - Starting
evaluation run...") to use a language tag (for example use ```text) so syntax
highlighters render it correctly while leaving the block contents unchanged.
- Around line 278-287: The fenced code block showing the example output lacks a
language specifier; update the block delimiter for the "Example output" snippet
(the triple-backtick surrounding the Accuracy/ PASS/ FAIL lines) to include a
language tag (e.g., replace ``` with ```text) so syntax highlighters treat it as
plain text while preserving the existing content.
- Around line 157-168: Update the fenced code block in
examples/benchmarks/bfcl/README.md that contains the evaluation log so it
includes a language specifier (use "text") on the opening fence; replace the
opening ``` with ```text for the block that starts with "INFO - Starting
evaluation run with config file: .../eval_ast_simple.yml" so the block is
properly labeled and rendered.
- Around line 130-133: The fenced code block showing "Dataset dir: 
/path/to/.venv/..." and "Answers dir:  /path/to/.venv/..." in the README lacks a
language specifier; update that fenced block (the triple-backtick block
containing those two lines) to use a language tag (e.g., add "text" after the
opening ```), so the block becomes ```text ... ``` to ensure proper formatting.

In `@examples/benchmarks/byob/README.md`:
- Around line 138-149: The fenced code block in the README example lacks a
language specifier; update the triple-backtick fence in the BYOB evaluation
output block to include "text" (i.e., change ``` to ```text) so syntax
highlighting/rendering is correct; locate the output block in
examples/benchmarks/byob/README.md (the block that begins with "INFO - Starting
evaluation run...") and add the language tag to the opening fence.
- Around line 39-46: Replace occurrences of the acronym "NAT" that refer to the
toolkit in the README text with the full phrase "NeMo Agent Toolkit" (e.g.,
change "NAT Integration", "NAT workflow", and "NAT replaces model calling" to
use "NeMo Agent Toolkit") while leaving any package names or code identifiers in
backticks untouched; update the table cells and surrounding descriptive text
accordingly to maintain sentence grammar and capitalization.
- Line 55: Replace the possessive phrase "Benchmark's `extra` config" with a
non-possessive form such as "The `extra` config of the benchmark" (search for
the exact string "Benchmark's `extra` config" in the README and update it) so
inanimate object wording follows the documentation guideline.
- Around line 170-172: Update the wording to avoid using a possessive with an
inanimate object: replace "scorer's output dict" with "the output dict of the
scorer" (or "the scorer output dict") in the README text that explains
score_field and reasoning, so the sentence reads e.g. "The metrics available in
each item's `reasoning` dict depend entirely on the output dict of the scorer —
the evaluator passes through the full scorer output without modification."
- Line 293: Change the possessive phrase "Benchmark's `extra` config" to use
non-possessive wording: replace that cell value (the table row for `config` /
`Dict[str, Any]`) so it reads "The `extra` config of the benchmark" instead of
"Benchmark's `extra` config".
- Around line 189-194: The fenced code block showing the benchmark output (the
block starting with "Average score: 0.667" and the lines "0: score=1.0
correct=True" etc.) is missing a language specifier; update the opening fence to
include a language (for example use "text" or "console" like ```text) so
markdown renderers apply correct formatting and avoid syntax highlighting
issues.

In `@examples/benchmarks/tooltalk/README.md`:
- Around line 85-88: The fenced code block showing the environment variable
exports (TOOLTALK_DATABASE_DIR and TOOLTALK_DATASET_DIR) lacks a language
specifier; add a language tag (e.g., text) to the opening triple-backtick so the
block becomes a labeled text block and renders correctly in README.md—update the
block containing the two lines with TOOLTALK_DATABASE_DIR and
TOOLTALK_DATASET_DIR to start with ```text.
- Around line 107-126: The fenced code block showing the "Expected output
(during run):" should include a language specifier to enable proper syntax
highlighting; update the opening fence for that block from ``` to ```text (or
another appropriate language) in the README example block so the lines starting
with "INFO - Starting evaluation run..." and the subsequent output are marked as
plain text.
- Around line 182-188: The fenced code block showing benchmark output uses a
bare triple-backtick; update the opening fence to include a language specifier
(e.g., change the opening "```" to "```text") so syntax highlighters treat the
block as plain text; locate the fenced block containing "Average success rate:
0.345" and modify only its opening fence.

In `@pyproject.toml`:
- Around line 266-270: The entries in uv.sources are not alphabetically ordered:
move the nat_automated_description_generation entry so it appears before the
nat_benchmark_* entries. Specifically, within the uv.sources block reorder so
nat_automated_description_generation appears above
nat_benchmark_agent_leaderboard, nat_benchmark_bfcl, nat_benchmark_byob, and
nat_benchmark_tooltalk to maintain consistent alphabetical sorting.
- Around line 126-130: The examples list in pyproject.toml is not alphabetically
ordered; move "nat_automated_description_generation" so it appears before the
"nat_benchmark_*" entries (e.g., nat_benchmark_agent_leaderboard,
nat_benchmark_bfcl, nat_benchmark_byob, nat_benchmark_tooltalk) so the sequence
is alphabetically sorted; update the examples array to place
nat_automated_description_generation immediately before the nat_benchmark_*
items.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9bf982d9-d5b1-43ba-944f-838d6fa58d02

📥 Commits

Reviewing files that changed from the base of the PR and between 72c78b9 and ab4c303.

📒 Files selected for processing (11)
  • ci/vale/styles/config/vocabularies/nat/accept.txt
  • examples/benchmarks/agent_leaderboard/README.md
  • examples/benchmarks/bfcl/README.md
  • examples/benchmarks/byob/README.md
  • examples/benchmarks/tooltalk/README.md
  • packages/nvidia_nat_benchmarks/configs/agent_leaderboard_eval.yaml
  • packages/nvidia_nat_benchmarks/configs/bfcl_ast_eval.yaml
  • packages/nvidia_nat_benchmarks/configs/tooltalk_eval.yaml
  • packages/nvidia_nat_benchmarks/pyproject.toml
  • packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/common/tool_intent_stubs.py
  • pyproject.toml
🚧 Files skipped from review as they are similar to previous changes (4)
  • packages/nvidia_nat_benchmarks/configs/tooltalk_eval.yaml
  • packages/nvidia_nat_benchmarks/configs/bfcl_ast_eval.yaml
  • packages/nvidia_nat_benchmarks/configs/agent_leaderboard_eval.yaml
  • packages/nvidia_nat_benchmarks/pyproject.toml

Comment on lines +121 to +143
def get_global_intents(scenario_id: str = "current") -> list[dict[str, Any]]:
"""Retrieve tool intents from the global registry.

This allows evaluators to access intents without needing builder access.

Args:
scenario_id: Identifier for the scenario

Returns:
List of tool intents
"""
return _GLOBAL_INTENT_REGISTRY.get(scenario_id, []).copy()


def clear_global_intents(scenario_id: str = "current") -> None:
"""Clear intents from global registry.

Args:
scenario_id: Identifier for the scenario to clear
"""
if scenario_id in _GLOBAL_INTENT_REGISTRY:
_GLOBAL_INTENT_REGISTRY[scenario_id] = []
logger.debug("Cleared global intents for scenario %s", scenario_id)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Resolve omitted scenario_id through get_current_scenario_id().

The default "current" is used as a literal registry key here. After set_current_scenario_id("foo"), calling get_global_intents() or clear_global_intents() with no argument still targets "current" instead of the active scenario, which can silently read or clear the wrong bucket. Prefer scenario_id: str | None = None and resolve None via get_current_scenario_id().

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/common/tool_intent_stubs.py`
around lines 121 - 143, The functions get_global_intents and
clear_global_intents currently use the literal default "current" and should
instead accept scenario_id: str | None = None and resolve None by calling
get_current_scenario_id(); update both functions (get_global_intents and
clear_global_intents) to: change the signature to accept Optional scenario_id,
call get_current_scenario_id() when scenario_id is None, then use the resolved
id when reading/writing _GLOBAL_INTENT_REGISTRY and logging so the active
scenario bucket is used rather than the literal "current".

Comment on lines +183 to +187
def create_tool_stub_function(
tool_schema: dict[str, Any],
intent_buffer: ToolIntentBuffer,
canned_response: str | None = None,
) -> tuple[callable, BaseModel | None, str]:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

cat -n packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/common/tool_intent_stubs.py | head -250

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 11039


🏁 Script executed:

rg "class PermissiveToolInput" packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/common/tool_intent_stubs.py -A 5

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 348


🏁 Script executed:

rg "def create_tool_stub_function" packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/common/tool_intent_stubs.py -A 30

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 1449


🏁 Script executed:

sed -n '220,235p' packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/common/tool_intent_stubs.py

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 696


Fix the return type annotation for the public API.

The annotation tuple[callable, BaseModel | None, str] is invalid. callable (lowercase) is not a valid type; the first element is an async function requiring Callable[[object], Coroutine[Any, Any, str]]; and the second element returns the class PermissiveToolInput itself, not an instance, requiring type[PermissiveToolInput] | None. Update the annotation to:

-> tuple[Callable[[object], Coroutine[Any, Any, str]], type[PermissiveToolInput] | None, str]:

This requires importing Callable and Coroutine from collections.abc and typing respectively. Per coding guidelines, all public APIs must have correct Python 3.11+ type hints and pass pyright validation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/common/tool_intent_stubs.py`
around lines 183 - 187, The public API return annotation on
create_tool_stub_function is wrong: change its signature to return
tuple[Callable[[object], Coroutine[Any, Any, str]], type[PermissiveToolInput] |
None, str] (the first element is an async call signature, the second is the
PermissiveToolInput class type, not an instance). Import Callable from
collections.abc and Coroutine and Any from typing (or typing.Any) and ensure
PermissiveToolInput is referenced in the annotation (use a forward reference or
real symbol) so pyright validates the function create_tool_stub_function
correctly.

Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
@bbednarski9 bbednarski9 requested a review from a team as a code owner March 9, 2026 03:55
Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
examples/benchmarks/byob/pyproject.toml (2)

27-32: Publish the example README in package metadata.

This example already has a README in the PR. Adding readme = "README.md" makes the installed package more self-describing on package indexes and in metadata views.

Suggested change
 [project]
 name = "nat_benchmark_byob"
 dynamic = ["version", "dependencies"]
 requires-python = ">=3.11,<3.14"
 description = "BYOB (Bring Your Own Benchmark) evaluation example for NeMo Agent Toolkit"
+readme = "README.md"
 classifiers = ["Programming Language :: Python"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/benchmarks/byob/pyproject.toml` around lines 27 - 32, Add the README
to the package metadata by adding the readme field under the [project] table in
pyproject.toml (where name, dynamic, requires-python, description, classifiers
are defined); set readme = "README.md" so the package includes and exposes the
example README in its installed metadata and on package indexes.

34-38: Remove unused test extra from nvidia-nat dependency.

Line 36 includes nvidia-nat[test] but this example has no Python source code and performs no test-only imports at runtime. Dropping the test extra would reduce unnecessary dependency bloat.

Note: Other examples in the repository consistently include this extra; this change would deviate from that pattern.

Suggested change
 [tool.setuptools_dynamic_dependencies]
 dependencies = [
-  "nvidia-nat[eval,langchain,test] == {version}",
+  "nvidia-nat[eval,langchain] == {version}",
   "nvidia-nat-benchmarks[byob] == {version}",
 ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/benchmarks/byob/pyproject.toml` around lines 34 - 38, Update the
dependencies entry under [tool.setuptools_dynamic_dependencies] to remove the
unused "test" extra from the "nvidia-nat" spec: replace
"nvidia-nat[eval,langchain,test] == {version}" with "nvidia-nat[eval,langchain]
== {version}" so the dependencies list no longer includes the unnecessary test
extra.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@examples/benchmarks/byob/pyproject.toml`:
- Around line 27-32: Add the README to the package metadata by adding the readme
field under the [project] table in pyproject.toml (where name, dynamic,
requires-python, description, classifiers are defined); set readme = "README.md"
so the package includes and exposes the example README in its installed metadata
and on package indexes.
- Around line 34-38: Update the dependencies entry under
[tool.setuptools_dynamic_dependencies] to remove the unused "test" extra from
the "nvidia-nat" spec: replace "nvidia-nat[eval,langchain,test] == {version}"
with "nvidia-nat[eval,langchain] == {version}" so the dependencies list no
longer includes the unnecessary test extra.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8f1639a4-974c-4c3b-9aaa-3ac690dc3def

📥 Commits

Reviewing files that changed from the base of the PR and between 385376c and 7be163e.

⛔ Files ignored due to path filters (2)
  • packages/nvidia_nat_benchmarks/uv.lock is excluded by !**/*.lock
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (4)
  • examples/benchmarks/agent_leaderboard/pyproject.toml
  • examples/benchmarks/bfcl/pyproject.toml
  • examples/benchmarks/byob/pyproject.toml
  • examples/benchmarks/tooltalk/pyproject.toml
🚧 Files skipped from review as they are similar to previous changes (2)
  • examples/benchmarks/tooltalk/pyproject.toml
  • examples/benchmarks/bfcl/pyproject.toml

Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
examples/benchmarks/agent_leaderboard/README.md (1)

84-95: Add language specifiers to fenced code blocks.

The output blocks are missing language specifiers. Add text or console to improve rendering and meet markdown best practices.

📝 Proposed fix

For lines 84-95:

 **Expected output:**
-```
+```text
 INFO - Starting evaluation run with config file: .../eval_banking.yml

For lines 163-175:

 **Example output:**
-```
+```text
 Average TSQ F1: 0.650

Also applies to: 163-175

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/benchmarks/agent_leaderboard/README.md` around lines 84 - 95, The
fenced code blocks showing console output (e.g., the block starting with "INFO -
Starting evaluation run with config file: .../eval_banking.yml" and the block
containing "Average TSQ F1: 0.650") should include a language specifier for
proper rendering; update the opening triple backticks for those blocks (and any
other similar output blocks, including the block around lines 163-175) from ```
to ```text (or ```console) so they are explicitly marked as plain text/console
output.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@examples/benchmarks/agent_leaderboard/README.md`:
- Around line 84-95: The fenced code blocks showing console output (e.g., the
block starting with "INFO - Starting evaluation run with config file:
.../eval_banking.yml" and the block containing "Average TSQ F1: 0.650") should
include a language specifier for proper rendering; update the opening triple
backticks for those blocks (and any other similar output blocks, including the
block around lines 163-175) from ``` to ```text (or ```console) so they are
explicitly marked as plain text/console output.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7780df8e-8217-44f9-a80e-21fc8b6651c6

📥 Commits

Reviewing files that changed from the base of the PR and between 7be163e and b3973bf.

📒 Files selected for processing (2)
  • examples/benchmarks/agent_leaderboard/README.md
  • examples/benchmarks/bfcl/pyproject.toml
🚧 Files skipped from review as they are similar to previous changes (1)
  • examples/benchmarks/bfcl/pyproject.toml

@bbednarski9 bbednarski9 added the DO NOT MERGE PR should not be merged; see PR for details label Mar 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

DO NOT MERGE PR should not be merged; see PR for details feature request New feature or request non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant