feature: nvidia-nat-benchmarks plugin#1771
feature: nvidia-nat-benchmarks plugin#1771bbednarski9 wants to merge 10 commits intoNVIDIA:developfrom
Conversation
Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds four benchmark suites (Agent Leaderboard v2, BFCL, BYOB, ToolTalk) with dataset loaders, workflows, evaluators, configs, docs, and tests; and adds optional evaluation-item shuffling (with reproducible seed) to the eval pipeline. Changes
Sequence Diagram(s)mermaid Note: rectangles use default colors; components represent key actors in evaluation flow. Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 6
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
packages/nvidia_nat_eval/src/nat/plugins/eval/runtime/evaluate.py (1)
677-684:⚠️ Potential issue | 🟠 MajorRecord
shuffleandshuffle_seedinconfig_metadata.json.These flags now change dataset order, but
write_configuration()still omits them from the saved run metadata. Two evaluations with different item orderings will therefore look identical after the fact, which makes benchmark results much harder to reproduce or compare.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_eval/src/nat/plugins/eval/runtime/evaluate.py` around lines 677 - 684, The run metadata currently omits the dataset ordering options; update the code that builds/writes config_metadata.json (the write_configuration() path that saves run metadata) to include the eval flags from eval_config.general.shuffle and eval_config.general.shuffle_seed (store them as keys "shuffle" and "shuffle_seed" or similar) so the DatasetHandler parameters (see the DatasetHandler(...) instantiation using shuffle and shuffle_seed) are persisted; ensure these values are read from self.eval_config.general when assembling the metadata object before writing to config_metadata.json.
🟡 Minor comments (18)
packages/nvidia_nat_core/tests/nat/builder/test_per_user_function_group_eval_isolation.py-175-176 (1)
175-176:⚠️ Potential issue | 🟡 MinorMake
SessionManager.shutdown()exception-safe.Cleanup only runs on the happy path right now. If any
session()orainvoke()call raises before the explicit shutdown, the cleanup task and per-user builders can survive into later tests. Atry/finallyor yielding fixture would make teardown deterministic.As per coding guidelines, "Any frequently repeated code should be extracted into pytest fixtures."
Also applies to: 194-194, 213-214, 226-226, 246-247, 271-271
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_core/tests/nat/builder/test_per_user_function_group_eval_isolation.py` around lines 175 - 176, Tests leak per-user builders when exceptions occur because cleanup only runs on the happy path; make teardown deterministic by either (A) making SessionManager.shutdown() exception-safe (catch and log any exceptions and ensure all cleanup steps still run inside SessionManager.shutdown) — update the shutdown implementation to wrap each cleanup step (e.g., closing per-user builders and cancelling cleanup tasks) in try/except so one failing step won't abort the rest — or (B) refactor the repeated setup/teardown into a pytest fixture that yields the SessionManager/WorkflowBuilder and performs shutdown in a finally block after yield; use the existing symbols SessionManager.create, SessionManager.shutdown, session(), ainvoke(), and WorkflowBuilder.from_config to locate the code to change.packages/nvidia_nat_core/tests/nat/builder/test_per_user_function_group_eval_isolation.py-218-224 (1)
218-224:⚠️ Potential issue | 🟡 MinorCross a session boundary in the "same user_id" case.
This only proves state is shared within one
session_manager.session(...)block. IfSessionManagerrebuilt the per-user builder on everysession()enter, this test would still pass. Re-open the sameuser_idin a second session and assert bothcountandinstance_idcarry over.🔁 Suggested assertion shape
async with session_manager.session(user_id=user_id) as session: fns = await session.workflow.function_groups["counter"].get_accessible_functions() increment = fns["counter__increment"] - get_count = fns["counter__get_count"] - - await increment.ainvoke(IncrementInput(n=5)) - result = await get_count.ainvoke(IncrementInput(n=0)) + first = await increment.ainvoke(IncrementInput(n=5)) + + async with session_manager.session(user_id=user_id) as session: + fns = await session.workflow.function_groups["counter"].get_accessible_functions() + get_count = fns["counter__get_count"] + result = await get_count.ainvoke(IncrementInput(n=0)) await session_manager.shutdown() # Within the same session, state accumulates assert result.count == 5, f"Expected count 5, got {result.count}" + assert result.instance_id == first.instance_id🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_core/tests/nat/builder/test_per_user_function_group_eval_isolation.py` around lines 218 - 224, The test only verifies state within a single session context; update it to reopen a new session for the same user_id using session_manager.session(user_id=user_id) a second time and re-resolve the counter function group (function_groups["counter"]) to fetch increment and get_count again, then invoke get_count (and/or increment if desired) with IncrementInput and assert that both the numeric count and the instance identifier (e.g., instance_id returned by the counter) are the same as the values observed in the first session, proving state and instance_id persist across separate session() entries for the same user_id.examples/benchmarks/agent_leaderboard/src/nat_benchmark_agent_leaderboard/configs/eval_banking.yml-17-17 (1)
17-17:⚠️ Potential issue | 🟡 MinorIncorrect path in usage comment.
The usage comment references
examples/benchmarks/agent_leaderboard/configs/eval_banking.yml, but this file is located atexamples/benchmarks/agent_leaderboard/src/nat_benchmark_agent_leaderboard/configs/eval_banking.yml.Proposed fix
-# Usage: nat eval --config_file examples/benchmarks/agent_leaderboard/configs/eval_banking.yml +# Usage: nat eval --config_file examples/benchmarks/agent_leaderboard/src/nat_benchmark_agent_leaderboard/configs/eval_banking.yml🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/benchmarks/agent_leaderboard/src/nat_benchmark_agent_leaderboard/configs/eval_banking.yml` at line 17, The usage comment currently points to the wrong config path; update the comment line that reads "Usage: nat eval --config_file examples/benchmarks/agent_leaderboard/configs/eval_banking.yml" to the correct relative path "examples/benchmarks/agent_leaderboard/src/nat_benchmark_agent_leaderboard/configs/eval_banking.yml" so the usage example matches the file's actual location.packages/nvidia_nat_benchmarks/src/nat/meta/pypi.md-21-21 (1)
21-21:⚠️ Potential issue | 🟡 MinorIncomplete Markdown link syntax.
The text
[`NeMo Agent Toolkit Benchmarks`]uses reference-style link syntax but lacks a corresponding link definition. This will render as plain text with brackets rather than a hyperlink.Consider either:
- Converting to an inline link:
[NeMo Agent Toolkit Benchmarks](URL_HERE)- Or removing the brackets if no link is intended:
NeMo Agent Toolkit BenchmarksProposed fix (assuming no link intended)
-This is a subpackage for [`NeMo Agent Toolkit Benchmarks`], which integrates key agentic workflow evaluations from [NeMo Evaluator](https://github.com/NVIDIA-NeMo/Evaluator/tree/main) and [HuggingFace datasets](https://huggingface.co/datasets) into the Toolkit for rapid benchmark evaluation. +This is a subpackage for NeMo Agent Toolkit Benchmarks, which integrates key agentic workflow evaluations from [NeMo Evaluator](https://github.com/NVIDIA-NeMo/Evaluator/tree/main) and [HuggingFace datasets](https://huggingface.co/datasets) into the Toolkit for rapid benchmark evaluation.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_benchmarks/src/nat/meta/pypi.md` at line 21, The Markdown uses a reference-style link token [`NeMo Agent Toolkit Benchmarks`] without a matching link definition; update the text in the nat/meta/pypi.md line containing that token by either converting it to an inline link (e.g., replace [`NeMo Agent Toolkit Benchmarks`] with an inline form and add the target URL) or remove the square-bracket markers to render plain text (replace [`NeMo Agent Toolkit Benchmarks`] with NeMo Agent Toolkit Benchmarks); ensure the replacement preserves surrounding punctuation and spacing.examples/benchmarks/tooltalk/src/nat_benchmark_tooltalk/configs/eval_easy.yml-17-17 (1)
17-17:⚠️ Potential issue | 🟡 MinorIncorrect path in usage comment.
The usage comment references
examples/benchmarks/tooltalk/configs/eval_easy.yml, but this file is located atexamples/benchmarks/tooltalk/src/nat_benchmark_tooltalk/configs/eval_easy.yml.Proposed fix
-# Usage: nat eval --config_file examples/benchmarks/tooltalk/configs/eval_easy.yml +# Usage: nat eval --config_file examples/benchmarks/tooltalk/src/nat_benchmark_tooltalk/configs/eval_easy.yml🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/benchmarks/tooltalk/src/nat_benchmark_tooltalk/configs/eval_easy.yml` at line 17, The usage comment at the top of eval_easy.yml contains an incorrect repository path; update the comment string to reference this file's actual repository-relative location by inserting the missing src directory segment (so the usage comment matches where eval_easy.yml actually lives) — edit the comment line in eval_easy.yml that starts with "Usage:" to the corrected path.examples/benchmarks/bfcl/src/nat_benchmark_bfcl/configs/eval_react_simple.yml-18-18 (1)
18-18:⚠️ Potential issue | 🟡 MinorFix the usage path in the comment.
The usage comment references
examples/benchmarks/bfcl/configs/but this file is located atexamples/benchmarks/bfcl/src/nat_benchmark_bfcl/configs/. Update the path to match the actual file location.📝 Proposed fix
-# Usage: nat eval --config_file examples/benchmarks/bfcl/configs/eval_react_simple.yml +# Usage: nat eval --config_file examples/benchmarks/bfcl/src/nat_benchmark_bfcl/configs/eval_react_simple.yml🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/benchmarks/bfcl/src/nat_benchmark_bfcl/configs/eval_react_simple.yml` at line 18, Update the usage comment at the top of eval_react_simple.yml to point to the actual file location: replace the incorrect path segment "examples/benchmarks/bfcl/configs/" with the correct "examples/benchmarks/bfcl/src/nat_benchmark_bfcl/configs/" so the usage line in eval_react_simple.yml accurately reflects the file's real path.examples/benchmarks/agent_leaderboard/src/nat_benchmark_agent_leaderboard/configs/eval_all_domains.yml-17-17 (1)
17-17:⚠️ Potential issue | 🟡 MinorFix the usage path in the comment.
The usage comment references
examples/benchmarks/agent_leaderboard/configs/but this file is located atexamples/benchmarks/agent_leaderboard/src/nat_benchmark_agent_leaderboard/configs/. Update the path to match the actual file location.📝 Proposed fix
-# Usage: nat eval --config_file examples/benchmarks/agent_leaderboard/configs/eval_all_domains.yml +# Usage: nat eval --config_file examples/benchmarks/agent_leaderboard/src/nat_benchmark_agent_leaderboard/configs/eval_all_domains.yml🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/benchmarks/agent_leaderboard/src/nat_benchmark_agent_leaderboard/configs/eval_all_domains.yml` at line 17, The usage comment at the top of eval_all_domains.yml points to the wrong directory; open eval_all_domains.yml and update the "# Usage: nat eval ..." comment to reference the file's actual location under the src/nat_benchmark_agent_leaderboard/configs directory (so the path includes src/nat_benchmark_agent_leaderboard/configs/eval_all_domains.yml) so the example command shows the correct config file path.examples/benchmarks/agent_leaderboard/README.md-72-85 (1)
72-85:⚠️ Potential issue | 🟡 MinorAdd language specifiers to fenced code blocks.
Several code blocks are missing language specifiers, which impacts rendering and accessibility. The static analysis tool flagged lines 82 and 130.
📝 Proposed fixes
For the output block at line 73-79, add
textorconsole:**Expected output:** -``` +```text INFO - Loading agent leaderboard v2 dataset from Hugging Face...For the blank line comment block around line 130:
-``` +```bash unset AGENT_LEADERBOARD_LIMITAlso applies to: 127-131
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/benchmarks/agent_leaderboard/README.md` around lines 72 - 85, Update the README fenced code blocks to include language specifiers: tag the expected output block (the multiline INFO output) with a plain text specifier such as "text" or "console", and tag the shell snippets that set/unset environment variables (the export AGENT_LEADERBOARD_DATA line and the unset AGENT_LEADERBOARD_LIMIT line) with "bash" so they render correctly; locate the blocks by searching for the INFO output and the AGENT_LEADERBOARD_DATA / AGENT_LEADERBOARD_LIMIT examples in README.md and add the appropriate language labels to the opening triple-backticks.examples/benchmarks/agent_leaderboard/README.md-22-22 (1)
22-22:⚠️ Potential issue | 🟡 MinorSpell out "NeMo Agent Toolkit" instead of "NAT".
Per documentation guidelines, avoid using "NAT" as an acronym in Markdown files. Spell it out as "NeMo Agent Toolkit" unless referring to code identifiers in backticks.
📝 Proposed fix
-Evaluate NAT agent workflows against the [Galileo Agent Leaderboard v2](https://huggingface.co/datasets/galileo-ai/agent-leaderboard-v2) benchmark. This benchmark tests whether an agent can select the correct tools for real-world use cases across multiple domains. +Evaluate NeMo Agent Toolkit agent workflows against the [Galileo Agent Leaderboard v2](https://huggingface.co/datasets/galileo-ai/agent-leaderboard-v2) benchmark. This benchmark tests whether an agent can select the correct tools for real-world use cases across multiple domains.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/benchmarks/agent_leaderboard/README.md` at line 22, Replace the acronym "NAT" with the full name "NeMo Agent Toolkit" in the README sentence that currently reads "Evaluate NAT agent workflows..." (leave any backticked code identifiers unchanged); update the sentence to "Evaluate NeMo Agent Toolkit agent workflows against the Galileo Agent Leaderboard v2 benchmark" so the acronym is spelled out per documentation guidelines.examples/benchmarks/agent_leaderboard/README.md-138-145 (1)
138-145:⚠️ Potential issue | 🟡 MinorDocumentation style issues.
- Line 138: Uses "NAT's eval config" — avoid possessive "'s" with inanimate objects and spell out "NeMo Agent Toolkit"
- Line 145: Uses "scenario's user goals" — rephrase to avoid possessive with inanimate object
📝 Proposed fixes
-This example uses the **Tool Selection Quality (TSQ)** evaluator (`_type: agent_leaderboard_tsq` in the eval config). It compares the tool calls the agent made (captured by the workflow via `ToolIntentBuffer`) against the expected tool calls derived from the scenario's user goals. +This example uses the **Tool Selection Quality (TSQ)** evaluator (`_type: agent_leaderboard_tsq` in the eval config). It compares the tool calls the agent made (captured by the workflow via `ToolIntentBuffer`) against the expected tool calls derived from the user goals of each scenario.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/benchmarks/agent_leaderboard/README.md` around lines 138 - 145, Replace possessive phrasing: change "NAT's eval config" to "NeMo Agent Toolkit eval config" (reference: Tool Selection Quality (TSQ) evaluator `_type: agent_leaderboard_tsq`) and rephrase "scenario's user goals" to "user goals in the scenario" (reference: how the workflow captures tool calls via `ToolIntentBuffer`) so the README avoids possessive "'s" with inanimate objects.packages/nvidia_nat_benchmarks/configs/agent_leaderboard_eval.yaml-1-11 (1)
1-11:⚠️ Potential issue | 🟡 MinorMissing Apache 2.0 license header.
This YAML config file is missing the SPDX license header that's present in other config files (e.g.,
eval_ast_simple.yml). Per coding guidelines, all code should include the Apache 2.0 header.📝 Proposed fix - add at the beginning of the file
+# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + # Agent Leaderboard v2 evaluation config for NAT🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_benchmarks/configs/agent_leaderboard_eval.yaml` around lines 1 - 11, Add the Apache-2.0 SPDX license header to the top of the YAML config file (agent_leaderboard_eval.yaml) to match other configs like eval_ast_simple.yml; insert a single-line SPDX identifier (e.g., "SPDX-License-Identifier: Apache-2.0") followed by the standard Apache 2.0 copyright block used across the repo so the file has the same license header format as other config files.packages/nvidia_nat_benchmarks/tests/agent_leaderboard/test_evaluator.py-24-43 (1)
24-43:⚠️ Potential issue | 🟡 MinorIncorrect return type annotation.
The function signature declares
tuple[EvalInputItem, list]as the return type, but the function only returnsitem(anEvalInputItem). Additionally, type hints are missing on the parameters.🐛 Proposed fix
-def _make_entry(expected_tools: list[str], predicted_tools: list[str]) -> tuple[EvalInputItem, list]: +def _make_entry(expected_tools: list[str], predicted_tools: list[str]) -> EvalInputItem: """Create a test item with expected and predicted tool calls."""🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_benchmarks/tests/agent_leaderboard/test_evaluator.py` around lines 24 - 43, The _make_entry function's signature incorrectly declares a return type of tuple[EvalInputItem, list] and lacks parameter type hints; update the signature of _make_entry(expected_tools: list[str], predicted_tools: list[str]) -> EvalInputItem (or use Sequence[str] if preferred) to reflect that it returns a single EvalInputItem, and ensure the parameter annotations are added so expected_tools and predicted_tools are typed as lists of strings; no other changes to the function body are required.examples/benchmarks/byob/README.md-137-149 (1)
137-149:⚠️ Potential issue | 🟡 MinorAdd language specifiers to fenced code blocks.
Static analysis flags several code blocks without language specifiers (MD040). For expected output blocks, use
textorconsole.📝 Example fix for expected output block
**Expected output:** -``` +```text INFO - Starting evaluation run with config file: .../eval_exact_match.yml🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/benchmarks/byob/README.md` around lines 137 - 149, The fenced "Expected output" code blocks lack language specifiers (MD040); update the triple-backtick fences in the README's "Expected output" section to include a language specifier such as text or console (e.g., ```text) so static analysis no longer flags the code blocks; search for the "Expected output:" block and replace the opening backticks with a specifier for each expected-output fenced block.examples/benchmarks/byob/README.md-22-32 (1)
22-32:⚠️ Potential issue | 🟡 MinorReplace "NAT" with "NeMo Agent Toolkit" per coding guidelines.
The coding guidelines specify: "Documentation in Markdown files should not use NAT as an acronym, always spell out NeMo Agent Toolkit." Multiple occurrences on these lines use "NAT" without backticks.
📝 Example fixes
-Run [NeMo Evaluator BYOB](https://docs.nvidia.com/nemo/evaluator/latest/) benchmarks directly on NAT agent workflows — without re-implementing the dataset loader or scorer logic. +Run [NeMo Evaluator BYOB](https://docs.nvidia.com/nemo/evaluator/latest/) benchmarks directly on NeMo Agent Toolkit workflows — without re-implementing the dataset loader or scorer logic.-**This integration reuses the benchmark definition, dataset loading, and scorer functions from NeMo Evaluator, but replaces the model-calling step with NAT's workflow execution.** This means: +**This integration reuses the benchmark definition, dataset loading, and scorer functions from NeMo Evaluator, but replaces the model-calling step with NeMo Agent Toolkit workflow execution.** This means:-- **NAT handles the agent execution** — instead of calling a model endpoint, NAT runs its own workflow +- **NeMo Agent Toolkit handles the agent execution** — instead of calling a model endpoint, NeMo Agent Toolkit runs its own workflow🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/benchmarks/byob/README.md` around lines 22 - 32, Replace all occurrences of the acronym "NAT" in the README paragraph with the full term "NeMo Agent Toolkit" (no backticks), e.g. change "NAT agent workflows", "NAT handles the agent execution", and "NAT handles all LLM calls upstream" to "NeMo Agent Toolkit agent workflows", "NeMo Agent Toolkit handles the agent execution", and "NeMo Agent Toolkit handles all LLM calls upstream" respectively; ensure there are no leftover "NAT" usages or inline code formatting for these phrases.packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/agent_leaderboard/dataset.py-165-171 (1)
165-171:⚠️ Potential issue | 🟡 MinorType mismatch:
config.domainsislist[AgentLeaderboardDomain], notlist[str].The lambda passes
domains=config.domainsdirectly, butload_agent_leaderboard_datasetexpectsdomains: list[str] | None. WhileStrEnumvalues may work due to string inheritance, explicit conversion would be safer and consistent with theparser()method inAgentLeaderboardDatasetConfig(line 67 of config.py) which does[str(d) for d in self.domains].🔧 Proposed fix
yield DatasetLoaderInfo( config=config, - load_fn=lambda fp, **kw: load_agent_leaderboard_dataset(fp, domains=config.domains, limit=config.limit, **kw), + load_fn=lambda fp, **kw: load_agent_leaderboard_dataset( + fp, domains=[str(d) for d in config.domains], limit=config.limit, **kw + ), description="Galileo Agent Leaderboard v2 dataset loader", )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/agent_leaderboard/dataset.py` around lines 165 - 171, register_agent_leaderboard_dataset_loader passes config.domains (type list[AgentLeaderboardDomain]) directly to load_agent_leaderboard_dataset which expects list[str] | None; change the lambda to convert domains to strings (e.g., use [str(d) for d in config.domains] or None when config.domains is falsy) so the call to load_agent_leaderboard_dataset receives list[str] consistent with AgentLeaderboardDatasetConfig.parser() behavior.packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/byob/evaluator.py-40-46 (1)
40-46:⚠️ Potential issue | 🟡 MinorUnused
target_fieldparameter - potential missing functionality.The
target_fieldparameter is passed frombench.target_fieldbut never used in the function body. This may indicate intended functionality that wasn't implemented (e.g., extracting the target from a specific field in metadata).🔧 If target_field should specify where to extract target from metadata
# Get target value - target = item.expected_output_obj + target = metadata.get(target_field, item.expected_output_obj) if target_field else item.expected_output_obj if isinstance(target, str):Alternatively, if
target_fieldis intentionally unused, remove it from the signature and thepartial()call inbyob_evaluator_function.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/byob/evaluator.py` around lines 40 - 46, The parameter target_field on _evaluate_single is unused; either implement extracting the target from the provided EvalInputItem using that field or remove the parameter and its use in byob_evaluator_function.partial(). If the intent is to read the ground-truth from item.metadata, update _evaluate_single (the function named _evaluate_single) to pull target = item.metadata.get(target_field) (or handle nested keys as your metadata model requires), use that value when calling scorer_fn/when populating EvalOutputItem, and validate/match types; otherwise remove target_field from _evaluate_single's signature and remove the corresponding argument passed in the partial() call inside byob_evaluator_function so signatures align.packages/nvidia_nat_benchmarks/tests/bfcl/test_tool_intent_stubs.py-200-200 (1)
200-200:⚠️ Potential issue | 🟡 MinorRemove the unused
descunpack target.This leaves a new
RUF059warning behind. Rename it to_/_descor stop unpacking the description.🧹 Minimal fix
- stub_fn, _, desc = create_tool_stub_function(schema, buf, canned_response="OK") + stub_fn, _, _ = create_tool_stub_function(schema, buf, canned_response="OK")As per coding guidelines, "Use ruff (via 'ruff check --fix') for linting with configuration in pyproject.toml; fix warnings unless explicitly ignored in pyproject.toml."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_benchmarks/tests/bfcl/test_tool_intent_stubs.py` at line 200, The test unpacks three values from create_tool_stub_function into stub_fn, _, desc but never uses desc causing a RUF059 warning; modify the unpacking in the test (where create_tool_stub_function(schema, buf, canned_response="OK") is called) to either drop the description entirely or rename it to a throwaway variable (e.g., replace desc with _ or _desc) so the unused variable is suppressed; ensure you only change the unpack target and do not alter create_tool_stub_function itself.packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/common/tool_intent_stubs.py-173-177 (1)
173-177:⚠️ Potential issue | 🟡 MinorAdd proper type hints to match the actual return value.
The return annotation
tuple[callable, BaseModel | None, str]is incorrect on three counts:
callableis the runtime builtin, not a typing construct; should beCallablefromcollections.abc- The function returns the
PermissiveToolInputclass itself, not an instance; should betype[PermissiveToolInput]notBaseModel | None- The first element is an async function, which should be expressed as
Callable[[object], Awaitable[str]]to capture the awaitable natureLine 218 returns
tool_stub_fn, PermissiveToolInput, tool_description, wheretool_stub_fnis defined at line 198 asasync def tool_stub_fn(input_params: object) -> str.Suggested fix
+from collections.abc import Awaitable, Callable @@ -) -> tuple[callable, BaseModel | None, str]: +) -> tuple[Callable[[object], Awaitable[str]], type[PermissiveToolInput], str]:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/common/tool_intent_stubs.py` around lines 173 - 177, The return type of create_tool_stub_function is wrong; update its annotation to reflect it returns an async callable (tool_stub_fn), the class PermissiveToolInput (the type, not an instance), and a str. Change the signature to return tuple[Callable[[object], Awaitable[str]], type[PermissiveToolInput], str] (import Callable from collections.abc and Awaitable from typing if needed) and ensure references to ToolIntentBuffer, tool_stub_fn and PermissiveToolInput remain unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c0dd3aaa-6d1c-46ae-9f1c-7e26ccea2398
📒 Files selected for processing (96)
docs/source/improve-workflows/evaluate.mdexamples/benchmarks/agent_leaderboard/README.mdexamples/benchmarks/agent_leaderboard/configsexamples/benchmarks/agent_leaderboard/pyproject.tomlexamples/benchmarks/agent_leaderboard/src/nat_benchmark_agent_leaderboard/__init__.pyexamples/benchmarks/agent_leaderboard/src/nat_benchmark_agent_leaderboard/configs/eval_all_domains.ymlexamples/benchmarks/agent_leaderboard/src/nat_benchmark_agent_leaderboard/configs/eval_banking.ymlexamples/benchmarks/agent_leaderboard/src/nat_benchmark_agent_leaderboard/register.pyexamples/benchmarks/bfcl/README.mdexamples/benchmarks/bfcl/configsexamples/benchmarks/bfcl/pyproject.tomlexamples/benchmarks/bfcl/src/nat_benchmark_bfcl/__init__.pyexamples/benchmarks/bfcl/src/nat_benchmark_bfcl/configs/eval_ast_simple.ymlexamples/benchmarks/bfcl/src/nat_benchmark_bfcl/configs/eval_fc_simple.ymlexamples/benchmarks/bfcl/src/nat_benchmark_bfcl/configs/eval_react_simple.ymlexamples/benchmarks/bfcl/src/nat_benchmark_bfcl/register.pyexamples/benchmarks/byob/README.mdexamples/benchmarks/byob/configsexamples/benchmarks/byob/pyproject.tomlexamples/benchmarks/byob/src/nat_benchmark_byob/__init__.pyexamples/benchmarks/byob/src/nat_benchmark_byob/configs/eval_exact_match.ymlexamples/benchmarks/byob/src/nat_benchmark_byob/register.pyexamples/benchmarks/tooltalk/README.mdexamples/benchmarks/tooltalk/configsexamples/benchmarks/tooltalk/pyproject.tomlexamples/benchmarks/tooltalk/src/nat_benchmark_tooltalk/__init__.pyexamples/benchmarks/tooltalk/src/nat_benchmark_tooltalk/configs/eval_easy.ymlexamples/benchmarks/tooltalk/src/nat_benchmark_tooltalk/register.pypackages/nvidia_nat_benchmarks/configs/agent_leaderboard_eval.yamlpackages/nvidia_nat_benchmarks/configs/bfcl_ast_eval.yamlpackages/nvidia_nat_benchmarks/configs/tooltalk_eval.yamlpackages/nvidia_nat_benchmarks/pyproject.tomlpackages/nvidia_nat_benchmarks/src/nat/meta/pypi.mdpackages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/__init__.pypackages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/agent_leaderboard/__init__.pypackages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/agent_leaderboard/config.pypackages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/agent_leaderboard/dataset.pypackages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/agent_leaderboard/evaluator.pypackages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/agent_leaderboard/workflow.pypackages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/bfcl/__init__.pypackages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/bfcl/config.pypackages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/bfcl/dataset.pypackages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/bfcl/evaluator.pypackages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/bfcl/tool_intent_stubs.pypackages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/bfcl/workflow_ast.pypackages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/bfcl/workflow_fc.pypackages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/bfcl/workflow_react.pypackages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/byob/__init__.pypackages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/byob/config.pypackages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/byob/dataset.pypackages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/byob/evaluator.pypackages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/common/__init__.pypackages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/common/bfcl_helpers.pypackages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/common/eval_helpers.pypackages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/common/tool_intent_stubs.pypackages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/register.pypackages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/tooltalk/__init__.pypackages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/tooltalk/config.pypackages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/tooltalk/dataset.pypackages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/tooltalk/evaluator.pypackages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/tooltalk/workflow.pypackages/nvidia_nat_benchmarks/tests/__init__.pypackages/nvidia_nat_benchmarks/tests/agent_leaderboard/__init__.pypackages/nvidia_nat_benchmarks/tests/agent_leaderboard/conftest.pypackages/nvidia_nat_benchmarks/tests/agent_leaderboard/test_dataset.pypackages/nvidia_nat_benchmarks/tests/agent_leaderboard/test_evaluator.pypackages/nvidia_nat_benchmarks/tests/agent_leaderboard/test_evaluator_comprehensive.pypackages/nvidia_nat_benchmarks/tests/bfcl/__init__.pypackages/nvidia_nat_benchmarks/tests/bfcl/conftest.pypackages/nvidia_nat_benchmarks/tests/bfcl/test_dataset.pypackages/nvidia_nat_benchmarks/tests/bfcl/test_evaluator.pypackages/nvidia_nat_benchmarks/tests/bfcl/test_evaluator_comprehensive.pypackages/nvidia_nat_benchmarks/tests/bfcl/test_integration.pypackages/nvidia_nat_benchmarks/tests/bfcl/test_regression.pypackages/nvidia_nat_benchmarks/tests/bfcl/test_tool_intent_stubs.pypackages/nvidia_nat_benchmarks/tests/byob/__init__.pypackages/nvidia_nat_benchmarks/tests/byob/conftest.pypackages/nvidia_nat_benchmarks/tests/byob/fixtures/sample_benchmark.pypackages/nvidia_nat_benchmarks/tests/byob/test_evaluator.pypackages/nvidia_nat_benchmarks/tests/byob/test_evaluator_comprehensive.pypackages/nvidia_nat_benchmarks/tests/byob/test_integration.pypackages/nvidia_nat_benchmarks/tests/common/__init__.pypackages/nvidia_nat_benchmarks/tests/common/test_bfcl_helpers.pypackages/nvidia_nat_benchmarks/tests/common/test_config_enums.pypackages/nvidia_nat_benchmarks/tests/common/test_eval_helpers.pypackages/nvidia_nat_benchmarks/tests/tooltalk/__init__.pypackages/nvidia_nat_benchmarks/tests/tooltalk/conftest.pypackages/nvidia_nat_benchmarks/tests/tooltalk/test_dataset.pypackages/nvidia_nat_benchmarks/tests/tooltalk/test_evaluator.pypackages/nvidia_nat_benchmarks/tests/tooltalk/test_evaluator_comprehensive.pypackages/nvidia_nat_benchmarks/tests/tooltalk/test_integration.pypackages/nvidia_nat_benchmarks/tests/tooltalk/test_regression.pypackages/nvidia_nat_core/src/nat/data_models/evaluate_config.pypackages/nvidia_nat_core/tests/nat/builder/test_per_user_function_group_eval_isolation.pypackages/nvidia_nat_eval/src/nat/plugins/eval/dataset_handler/dataset_handler.pypackages/nvidia_nat_eval/src/nat/plugins/eval/runtime/evaluate.py
packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/common/tool_intent_stubs.py
Show resolved
Hide resolved
| @field_validator('input_params', mode='before') | ||
| @classmethod | ||
| def parse_string_to_dict(cls, v: Any) -> dict[str, Any]: | ||
| """Convert JSON string to dict if needed.""" | ||
| if isinstance(v, str): | ||
| try: | ||
| normalized = v.replace("'", '"') | ||
| return json.loads(normalized) | ||
| except json.JSONDecodeError: | ||
| logger.warning("Failed to parse input_params string as JSON: %s", v[:100]) | ||
| return {} | ||
| elif isinstance(v, dict): | ||
| return v | ||
| else: | ||
| logger.warning("Unexpected input_params type: %s", type(v)) | ||
| return {} |
There was a problem hiding this comment.
Blind quote normalization corrupts some valid tool inputs, then drops them.
v.replace("'", '"') turns a valid payload like {"name": "O'Reilly"} into invalid JSON. The failure path then logs the raw tool input and collapses it to {}, which both leaks user-supplied data and loses the stripped-string fallback behavior that the existing ReAct path relies on. Please try json.loads() on the original string first, then use a safer fallback parser or preserve the stripped raw string instead of zeroing the params.
Based on learnings, the existing ReAct path intentionally preserves stripped tool_input when structured parsing fails, and as per coding guidelines, "Validate and sanitise all user input, especially in web or CLI interfaces."
Also applies to: 198-207
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/common/tool_intent_stubs.py`
around lines 155 - 170, The parse_string_to_dict field validator is currently
replacing single quotes before attempting JSON decode which corrupts valid
inputs like {"name":"O'Reilly"} and then returns {} while logging the raw input;
change it to first try json.loads on the original string, and only if that fails
attempt a safer fallback (e.g., ast.literal_eval or a controlled
quote-normalization) or preserve the stripped raw string in a non-sensitive
container (e.g., return {"_raw": v.strip()}) so consumers (the ReAct path) can
still access the fallback tool_input; also avoid logging raw user input—log a
truncated/sanitized preview instead. Apply the same fix pattern to the similar
validator block around the other occurrence noted (the block at the other
location referenced in the comment).
Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/common/tool_intent_stubs.py (2)
155-175:⚠️ Potential issue | 🟠 MajorDon't collapse unparsed tool input to
{}.This now handles JSON and Python-style dict strings, but the failure path still returns
{}. That drops the original tool input and records an empty intent instead of the fallback value the ReAct path relies on. Return a sanitized raw wrapper instead of zeroing it out.Based on learnings, the existing ReAct path intentionally preserves stripped
tool_inputwhen structured parsing fails as fallback.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/common/tool_intent_stubs.py` around lines 155 - 175, The parse_string_to_dict field_validator currently returns an empty dict on parse failure which discards the original tool input; change the failure path in parse_string_to_dict (the `@field_validator` for 'input_params') to return a sanitized raw-wrapper containing the original (stripped) string instead of {} so the ReAct fallback can see the raw tool_input (e.g., return a dict like a single-key wrapper with the stripped original string and keep the existing logger warning). Ensure you only modify the failure branch after both json.loads and ast.literal_eval attempts.
84-118:⚠️ Potential issue | 🟠 MajorKeep
ToolIntentBufferstate per scenario.
record()writes the global registry per scenario, butself.intentsis still one shared list. Reusing one buffer across evaluation items makesget_intents()leak intents between scenarios, andclear()wipes the local history for all of them. Store the local view per scenario too, or deriveget_intents()/clear()from the current-scenario registry.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/common/tool_intent_stubs.py` around lines 84 - 118, ToolIntentBuffer currently uses a single shared list self.intents causing cross-scenario leaks; change the internal storage to be per-scenario (e.g. self.intents_by_scenario: dict[str, list[dict[str, Any]]]) in __init__, update record(tool_name, parameters) to use get_current_scenario_id() to append the intent into both self.intents_by_scenario[current_scenario] and _GLOBAL_INTENT_REGISTRY[current_scenario] (creating empty lists if missing), make get_intents() return a copy of self.intents_by_scenario[get_current_scenario_id()] (or [] if none), and make clear() only clear the list for get_current_scenario_id() in both self.intents_by_scenario and _GLOBAL_INTENT_REGISTRY and log that scenario; keep references to the same symbol names (__init__, record, get_intents, clear, _GLOBAL_INTENT_REGISTRY) so the changes are localized.
🧹 Nitpick comments (20)
examples/benchmarks/tooltalk/README.md (3)
85-88: Add language specifier to fenced code block.Expected output blocks should have a language specifier like
text.🔧 Suggested fix
**Expected output:** -``` +```text export TOOLTALK_DATABASE_DIR=/path/to/.venv/lib/python3.11/site-packages/tooltalk/data/databases🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/benchmarks/tooltalk/README.md` around lines 85 - 88, The fenced code block showing the environment variable exports (TOOLTALK_DATABASE_DIR and TOOLTALK_DATASET_DIR) lacks a language specifier; add a language tag (e.g., text) to the opening triple-backtick so the block becomes a labeled text block and renders correctly in README.md—update the block containing the two lines with TOOLTALK_DATABASE_DIR and TOOLTALK_DATASET_DIR to start with ```text.
107-126: Add language specifier to fenced code block.🔧 Suggested fix
**Expected output (during run):** -``` +```text INFO - Starting evaluation run with config file: examples/benchmarks/tooltalk/configs/eval_easy.yml🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/benchmarks/tooltalk/README.md` around lines 107 - 126, The fenced code block showing the "Expected output (during run):" should include a language specifier to enable proper syntax highlighting; update the opening fence for that block from ``` to ```text (or another appropriate language) in the README example block so the lines starting with "INFO - Starting evaluation run..." and the subsequent output are marked as plain text.
182-188: Add language specifier to fenced code block.🔧 Suggested fix
**Example output:** -``` +```text Average success rate: 0.345🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/benchmarks/tooltalk/README.md` around lines 182 - 188, The fenced code block showing benchmark output uses a bare triple-backtick; update the opening fence to include a language specifier (e.g., change the opening "```" to "```text") so syntax highlighters treat the block as plain text; locate the fenced block containing "Average success rate: 0.345" and modify only its opening fence.examples/benchmarks/byob/README.md (6)
138-149: Add language specifier to fenced code block.Expected output blocks should have a language specifier like
text.🔧 Suggested fix
**Expected output:** -``` +```text INFO - Starting evaluation run with config file: .../eval_exact_match.yml🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/benchmarks/byob/README.md` around lines 138 - 149, The fenced code block in the README example lacks a language specifier; update the triple-backtick fence in the BYOB evaluation output block to include "text" (i.e., change ``` to ```text) so syntax highlighting/rendering is correct; locate the output block in examples/benchmarks/byob/README.md (the block that begins with "INFO - Starting evaluation run...") and add the language tag to the opening fence.
39-46: Spell out "NeMo Agent Toolkit" instead of "NAT" acronym.Per coding guidelines, "NAT" should not be used as an acronym in documentation. Replace "NAT" with "NeMo Agent Toolkit" where it refers to the toolkit (not package names in backticks).
📝 Suggested fix
├───────────────┬─────────────────────┤ -│ NeMo Evaluator│ NAT Integration │ +│ NeMo Evaluator│ NeMo Agent Toolkit │ │ (standalone) │ (this example) │ │ │ │ │ load_dataset()│ load_dataset() │ ← Same function, reused -│ render_prompt │ NAT workflow │ ← NAT replaces model calling +│ render_prompt │ Toolkit workflow │ ← Toolkit replaces model calling │ call_model() │ (agents, tools) │ │ scorer_fn() │ scorer_fn() │ ← Same scorer, reused🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/benchmarks/byob/README.md` around lines 39 - 46, Replace occurrences of the acronym "NAT" that refer to the toolkit in the README text with the full phrase "NeMo Agent Toolkit" (e.g., change "NAT Integration", "NAT workflow", and "NAT replaces model calling" to use "NeMo Agent Toolkit") while leaving any package names or code identifiers in backticks untouched; update the table cells and surrounding descriptive text accordingly to maintain sentence grammar and capitalization.
55-55: Avoid possessive 's with inanimate objects.Per documentation guidelines, rephrase "benchmark's
extraconfig" to avoid possessive form with inanimate objects.📝 Suggested fix
-- **Dataset from benchmark**: Dataset path, prompt template, and target field come from the benchmark definition — no duplication in the eval config +- **Dataset from benchmark**: Dataset path, prompt template, and target field come from the benchmark definition — no duplication in the eval configNote: Line 55 appears fine ("benchmark definition"). The issue is at line 293: "Benchmark's
extraconfig" → "Theextraconfig of the benchmark".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/benchmarks/byob/README.md` at line 55, Replace the possessive phrase "Benchmark's `extra` config" with a non-possessive form such as "The `extra` config of the benchmark" (search for the exact string "Benchmark's `extra` config" in the README and update it) so inanimate object wording follows the documentation guideline.
170-172: Avoid possessive 's with inanimate objects.Rephrase "scorer's output dict" to "the output dict of the scorer" or "the scorer output dict".
📝 Suggested fix
-The `score_field` parameter controls which key from the scorer's output dict becomes the item's primary score. For `exact_match` and `contains`, this is `correct` (boolean → 1.0 or 0.0). For `f1_token`, you'd set `score_field: f1` to get the F1 float. The `average_score` in the output is the mean of all items' primary scores. +The `score_field` parameter controls which key from the scorer output dict becomes the primary score of the item. For `exact_match` and `contains`, this is `correct` (boolean → 1.0 or 0.0). For `f1_token`, you'd set `score_field: f1` to get the F1 float. The `average_score` in the output is the mean of primary scores across all items. - -The metrics available in each item's `reasoning` dict depend entirely on what the scorer function returns — the evaluator passes through the full scorer output without modification. + +The metrics available in the `reasoning` dict of each item depend entirely on what the scorer function returns — the evaluator passes through the full scorer output without modification.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/benchmarks/byob/README.md` around lines 170 - 172, Update the wording to avoid using a possessive with an inanimate object: replace "scorer's output dict" with "the output dict of the scorer" (or "the scorer output dict") in the README text that explains score_field and reasoning, so the sentence reads e.g. "The metrics available in each item's `reasoning` dict depend entirely on the output dict of the scorer — the evaluator passes through the full scorer output without modification."
293-293: Avoid possessive 's with inanimate objects.Rephrase "Benchmark's
extraconfig" to "Theextraconfig of the benchmark".📝 Suggested fix
-| `config` | `Dict[str, Any]` | Benchmark's `extra` config | +| `config` | `Dict[str, Any]` | The `extra` config of the benchmark |🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/benchmarks/byob/README.md` at line 293, Change the possessive phrase "Benchmark's `extra` config" to use non-possessive wording: replace that cell value (the table row for `config` / `Dict[str, Any]`) so it reads "The `extra` config of the benchmark" instead of "Benchmark's `extra` config".
189-194: Add language specifier to fenced code block.🔧 Suggested fix
**Example output:** -``` +```text Average score: 0.667🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/benchmarks/byob/README.md` around lines 189 - 194, The fenced code block showing the benchmark output (the block starting with "Average score: 0.667" and the lines "0: score=1.0 correct=True" etc.) is missing a language specifier; update the opening fence to include a language (for example use "text" or "console" like ```text) so markdown renderers apply correct formatting and avoid syntax highlighting issues.pyproject.toml (2)
266-270: Same alphabetical ordering issue inuv.sources.For consistency, the
uv.sourcessection should also be alphabetically sorted, placingnat_automated_description_generationbefore thenat_benchmark_*entries.🔧 Suggested ordering fix
nat_autogen_demo = { path = "examples/frameworks/nat_autogen_demo", editable = true } +nat_automated_description_generation = { path = "examples/custom_functions/automated_description_generation", editable = true } nat_benchmark_agent_leaderboard = { path = "examples/benchmarks/agent_leaderboard", editable = true } nat_benchmark_bfcl = { path = "examples/benchmarks/bfcl", editable = true } nat_benchmark_byob = { path = "examples/benchmarks/byob", editable = true } nat_benchmark_tooltalk = { path = "examples/benchmarks/tooltalk", editable = true } -nat_automated_description_generation = { path = "examples/custom_functions/automated_description_generation", editable = true } nat_currency_agent_a2a = { path = "examples/A2A/currency_agent_a2a", editable = true }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pyproject.toml` around lines 266 - 270, The entries in uv.sources are not alphabetically ordered: move the nat_automated_description_generation entry so it appears before the nat_benchmark_* entries. Specifically, within the uv.sources block reorder so nat_automated_description_generation appears above nat_benchmark_agent_leaderboard, nat_benchmark_bfcl, nat_benchmark_byob, and nat_benchmark_tooltalk to maintain consistent alphabetical sorting.
126-130: Examples list is not sorted alphabetically.The benchmark entries are placed between
nat_autogen_demoandnat_automated_description_generation, but alphabeticallynat_automated_description_generationshould come before anynat_benchmark_*entries since "automated" < "benchmark".🔧 Suggested ordering fix
"nat_autogen_demo", + "nat_automated_description_generation", "nat_benchmark_agent_leaderboard", "nat_benchmark_bfcl", "nat_benchmark_byob", "nat_benchmark_tooltalk", - "nat_automated_description_generation", "nat_currency_agent_a2a",🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pyproject.toml` around lines 126 - 130, The examples list in pyproject.toml is not alphabetically ordered; move "nat_automated_description_generation" so it appears before the "nat_benchmark_*" entries (e.g., nat_benchmark_agent_leaderboard, nat_benchmark_bfcl, nat_benchmark_byob, nat_benchmark_tooltalk) so the sequence is alphabetically sorted; update the examples array to place nat_automated_description_generation immediately before the nat_benchmark_* items.examples/benchmarks/agent_leaderboard/README.md (4)
138-138: Avoid possessive 's with inanimate objects.Per documentation guidelines, avoid using possessive 's with inanimate objects. Consider rewording "scenario's user goals" to "user goals from the scenario" or "user goals of the scenario".
📝 Suggested fix
-This example uses the **Tool Selection Quality (TSQ)** evaluator (`_type: agent_leaderboard_tsq` in the eval config). It compares the tool calls the agent made (captured by the workflow via `ToolIntentBuffer`) against the expected tool calls derived from the scenario's user goals. +This example uses the **Tool Selection Quality (TSQ)** evaluator (`_type: agent_leaderboard_tsq` in the eval config). It compares the tool calls the agent made (captured by the workflow via `ToolIntentBuffer`) against the expected tool calls derived from the user goals of the scenario.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/benchmarks/agent_leaderboard/README.md` at line 138, Reword the phrase "scenario's user goals" to avoid possessive 's with an inanimate object: update the README line that describes the Tool Selection Quality (TSQ) evaluator (`_type: agent_leaderboard_tsq`) and the ToolIntentBuffer comparison to use "user goals from the scenario" or "user goals of the scenario" so the sentence reads e.g. "...against the expected tool calls derived from the user goals from the scenario."
191-203: Add language specifier to fenced code block.This example output block should have a language specifier.
🔧 Suggested fix
-``` +```text Average TSQ F1: 0.650🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/benchmarks/agent_leaderboard/README.md` around lines 191 - 203, Update the fenced code block that starts with the example output containing "Average TSQ F1: 0.650" to include a language specifier (use "text") so the block begins with ```text instead of ```; locate the fenced block around the TSQ output (the lines listing "Average TSQ F1", "Total scenarios", and the scenario entries) and change the opening fence to ```text to satisfy the documentation lint rule.
73-79: Add language specifier to fenced code block.Static analysis flagged this code block as missing a language specifier. For expected terminal output, use
textorplaintext.🔧 Suggested fix
-``` +```text INFO - Loading agent leaderboard v2 dataset from Hugging Face...🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/benchmarks/agent_leaderboard/README.md` around lines 73 - 79, The fenced code block showing the example terminal output (the block beginning with "INFO - Loading agent leaderboard v2 dataset from Hugging Face...") is missing a language specifier; update the triple-backtick fence to include a language tag such as text or plaintext (e.g., change ``` to ```text) so the block is correctly marked as plain text in README.md.
112-123: Add language specifier to fenced code block.This expected output block should have a language specifier.
🔧 Suggested fix
-``` +```text INFO - Starting evaluation run with config file: .../eval_banking.yml🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/benchmarks/agent_leaderboard/README.md` around lines 112 - 123, The fenced code block in the README example output is missing a language specifier; update the opening fence for that block (the triple backticks before "INFO - Starting evaluation run...") to include a language tag such as "text" so Markdown renders it correctly (e.g., change ``` to ```text for the code block that shows the INFO/LOG output and evaluation summary).examples/benchmarks/bfcl/README.md (5)
213-225: Add language specifier to fenced code block.🔧 Suggested fix
**Expected output:** -``` +```text INFO - Starting evaluation run with config file: .../eval_react_simple.yml🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/benchmarks/bfcl/README.md` around lines 213 - 225, The fenced code block showing the "Expected output" lacks a language specifier; update the opening triple-backtick for that block to include a language tag (e.g., change ``` to ```text) so the log lines like "INFO - Starting evaluation run with config file: .../eval_react_simple.yml" and the rest of the sample output render as plain text; locate the fenced block containing the example output and add the specifier to its opening ``` marker.
184-196: Add language specifier to fenced code block.🔧 Suggested fix
**Expected output:** -``` +```text INFO - Starting evaluation run with config file: .../eval_fc_simple.yml🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/benchmarks/bfcl/README.md` around lines 184 - 196, The fenced code block under the "Expected output:" section in README.md is missing a language specifier; update the triple-backtick fence that precedes the log sample (the block showing "INFO - Starting evaluation run...") to use a language tag (for example use ```text) so syntax highlighters render it correctly while leaving the block contents unchanged.
278-287: Add language specifier to fenced code block.🔧 Suggested fix
**Example output:** -``` +```text Accuracy: 0.850🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/benchmarks/bfcl/README.md` around lines 278 - 287, The fenced code block showing the example output lacks a language specifier; update the block delimiter for the "Example output" snippet (the triple-backtick surrounding the Accuracy/ PASS/ FAIL lines) to include a language tag (e.g., replace ``` with ```text) so syntax highlighters treat it as plain text while preserving the existing content.
157-168: Add language specifier to fenced code block.🔧 Suggested fix
**Expected output:** -``` +```text INFO - Starting evaluation run with config file: .../eval_ast_simple.yml🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/benchmarks/bfcl/README.md` around lines 157 - 168, Update the fenced code block in examples/benchmarks/bfcl/README.md that contains the evaluation log so it includes a language specifier (use "text") on the opening fence; replace the opening ``` with ```text for the block that starts with "INFO - Starting evaluation run with config file: .../eval_ast_simple.yml" so the block is properly labeled and rendered.
130-133: Add language specifier to fenced code block.Expected output blocks should have a language specifier like
text.🔧 Suggested fix
**Expected output:** -``` +```text Dataset dir: /path/to/.venv/lib/python3.11/site-packages/bfcl/data🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/benchmarks/bfcl/README.md` around lines 130 - 133, The fenced code block showing "Dataset dir: /path/to/.venv/..." and "Answers dir: /path/to/.venv/..." in the README lacks a language specifier; update that fenced block (the triple-backtick block containing those two lines) to use a language tag (e.g., add "text" after the opening ```), so the block becomes ```text ... ``` to ensure proper formatting.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/common/tool_intent_stubs.py`:
- Around line 121-143: The functions get_global_intents and clear_global_intents
currently use the literal default "current" and should instead accept
scenario_id: str | None = None and resolve None by calling
get_current_scenario_id(); update both functions (get_global_intents and
clear_global_intents) to: change the signature to accept Optional scenario_id,
call get_current_scenario_id() when scenario_id is None, then use the resolved
id when reading/writing _GLOBAL_INTENT_REGISTRY and logging so the active
scenario bucket is used rather than the literal "current".
- Around line 183-187: The public API return annotation on
create_tool_stub_function is wrong: change its signature to return
tuple[Callable[[object], Coroutine[Any, Any, str]], type[PermissiveToolInput] |
None, str] (the first element is an async call signature, the second is the
PermissiveToolInput class type, not an instance). Import Callable from
collections.abc and Coroutine and Any from typing (or typing.Any) and ensure
PermissiveToolInput is referenced in the annotation (use a forward reference or
real symbol) so pyright validates the function create_tool_stub_function
correctly.
---
Duplicate comments:
In
`@packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/common/tool_intent_stubs.py`:
- Around line 155-175: The parse_string_to_dict field_validator currently
returns an empty dict on parse failure which discards the original tool input;
change the failure path in parse_string_to_dict (the `@field_validator` for
'input_params') to return a sanitized raw-wrapper containing the original
(stripped) string instead of {} so the ReAct fallback can see the raw tool_input
(e.g., return a dict like a single-key wrapper with the stripped original string
and keep the existing logger warning). Ensure you only modify the failure branch
after both json.loads and ast.literal_eval attempts.
- Around line 84-118: ToolIntentBuffer currently uses a single shared list
self.intents causing cross-scenario leaks; change the internal storage to be
per-scenario (e.g. self.intents_by_scenario: dict[str, list[dict[str, Any]]]) in
__init__, update record(tool_name, parameters) to use get_current_scenario_id()
to append the intent into both self.intents_by_scenario[current_scenario] and
_GLOBAL_INTENT_REGISTRY[current_scenario] (creating empty lists if missing),
make get_intents() return a copy of
self.intents_by_scenario[get_current_scenario_id()] (or [] if none), and make
clear() only clear the list for get_current_scenario_id() in both
self.intents_by_scenario and _GLOBAL_INTENT_REGISTRY and log that scenario; keep
references to the same symbol names (__init__, record, get_intents, clear,
_GLOBAL_INTENT_REGISTRY) so the changes are localized.
---
Nitpick comments:
In `@examples/benchmarks/agent_leaderboard/README.md`:
- Line 138: Reword the phrase "scenario's user goals" to avoid possessive 's
with an inanimate object: update the README line that describes the Tool
Selection Quality (TSQ) evaluator (`_type: agent_leaderboard_tsq`) and the
ToolIntentBuffer comparison to use "user goals from the scenario" or "user goals
of the scenario" so the sentence reads e.g. "...against the expected tool calls
derived from the user goals from the scenario."
- Around line 191-203: Update the fenced code block that starts with the example
output containing "Average TSQ F1: 0.650" to include a language specifier (use
"text") so the block begins with ```text instead of ```; locate the fenced block
around the TSQ output (the lines listing "Average TSQ F1", "Total scenarios",
and the scenario entries) and change the opening fence to ```text to satisfy the
documentation lint rule.
- Around line 73-79: The fenced code block showing the example terminal output
(the block beginning with "INFO - Loading agent leaderboard v2 dataset from
Hugging Face...") is missing a language specifier; update the triple-backtick
fence to include a language tag such as text or plaintext (e.g., change ``` to
```text) so the block is correctly marked as plain text in README.md.
- Around line 112-123: The fenced code block in the README example output is
missing a language specifier; update the opening fence for that block (the
triple backticks before "INFO - Starting evaluation run...") to include a
language tag such as "text" so Markdown renders it correctly (e.g., change ```
to ```text for the code block that shows the INFO/LOG output and evaluation
summary).
In `@examples/benchmarks/bfcl/README.md`:
- Around line 213-225: The fenced code block showing the "Expected output" lacks
a language specifier; update the opening triple-backtick for that block to
include a language tag (e.g., change ``` to ```text) so the log lines like "INFO
- Starting evaluation run with config file: .../eval_react_simple.yml" and the
rest of the sample output render as plain text; locate the fenced block
containing the example output and add the specifier to its opening ``` marker.
- Around line 184-196: The fenced code block under the "Expected output:"
section in README.md is missing a language specifier; update the triple-backtick
fence that precedes the log sample (the block showing "INFO - Starting
evaluation run...") to use a language tag (for example use ```text) so syntax
highlighters render it correctly while leaving the block contents unchanged.
- Around line 278-287: The fenced code block showing the example output lacks a
language specifier; update the block delimiter for the "Example output" snippet
(the triple-backtick surrounding the Accuracy/ PASS/ FAIL lines) to include a
language tag (e.g., replace ``` with ```text) so syntax highlighters treat it as
plain text while preserving the existing content.
- Around line 157-168: Update the fenced code block in
examples/benchmarks/bfcl/README.md that contains the evaluation log so it
includes a language specifier (use "text") on the opening fence; replace the
opening ``` with ```text for the block that starts with "INFO - Starting
evaluation run with config file: .../eval_ast_simple.yml" so the block is
properly labeled and rendered.
- Around line 130-133: The fenced code block showing "Dataset dir:
/path/to/.venv/..." and "Answers dir: /path/to/.venv/..." in the README lacks a
language specifier; update that fenced block (the triple-backtick block
containing those two lines) to use a language tag (e.g., add "text" after the
opening ```), so the block becomes ```text ... ``` to ensure proper formatting.
In `@examples/benchmarks/byob/README.md`:
- Around line 138-149: The fenced code block in the README example lacks a
language specifier; update the triple-backtick fence in the BYOB evaluation
output block to include "text" (i.e., change ``` to ```text) so syntax
highlighting/rendering is correct; locate the output block in
examples/benchmarks/byob/README.md (the block that begins with "INFO - Starting
evaluation run...") and add the language tag to the opening fence.
- Around line 39-46: Replace occurrences of the acronym "NAT" that refer to the
toolkit in the README text with the full phrase "NeMo Agent Toolkit" (e.g.,
change "NAT Integration", "NAT workflow", and "NAT replaces model calling" to
use "NeMo Agent Toolkit") while leaving any package names or code identifiers in
backticks untouched; update the table cells and surrounding descriptive text
accordingly to maintain sentence grammar and capitalization.
- Line 55: Replace the possessive phrase "Benchmark's `extra` config" with a
non-possessive form such as "The `extra` config of the benchmark" (search for
the exact string "Benchmark's `extra` config" in the README and update it) so
inanimate object wording follows the documentation guideline.
- Around line 170-172: Update the wording to avoid using a possessive with an
inanimate object: replace "scorer's output dict" with "the output dict of the
scorer" (or "the scorer output dict") in the README text that explains
score_field and reasoning, so the sentence reads e.g. "The metrics available in
each item's `reasoning` dict depend entirely on the output dict of the scorer —
the evaluator passes through the full scorer output without modification."
- Line 293: Change the possessive phrase "Benchmark's `extra` config" to use
non-possessive wording: replace that cell value (the table row for `config` /
`Dict[str, Any]`) so it reads "The `extra` config of the benchmark" instead of
"Benchmark's `extra` config".
- Around line 189-194: The fenced code block showing the benchmark output (the
block starting with "Average score: 0.667" and the lines "0: score=1.0
correct=True" etc.) is missing a language specifier; update the opening fence to
include a language (for example use "text" or "console" like ```text) so
markdown renderers apply correct formatting and avoid syntax highlighting
issues.
In `@examples/benchmarks/tooltalk/README.md`:
- Around line 85-88: The fenced code block showing the environment variable
exports (TOOLTALK_DATABASE_DIR and TOOLTALK_DATASET_DIR) lacks a language
specifier; add a language tag (e.g., text) to the opening triple-backtick so the
block becomes a labeled text block and renders correctly in README.md—update the
block containing the two lines with TOOLTALK_DATABASE_DIR and
TOOLTALK_DATASET_DIR to start with ```text.
- Around line 107-126: The fenced code block showing the "Expected output
(during run):" should include a language specifier to enable proper syntax
highlighting; update the opening fence for that block from ``` to ```text (or
another appropriate language) in the README example block so the lines starting
with "INFO - Starting evaluation run..." and the subsequent output are marked as
plain text.
- Around line 182-188: The fenced code block showing benchmark output uses a
bare triple-backtick; update the opening fence to include a language specifier
(e.g., change the opening "```" to "```text") so syntax highlighters treat the
block as plain text; locate the fenced block containing "Average success rate:
0.345" and modify only its opening fence.
In `@pyproject.toml`:
- Around line 266-270: The entries in uv.sources are not alphabetically ordered:
move the nat_automated_description_generation entry so it appears before the
nat_benchmark_* entries. Specifically, within the uv.sources block reorder so
nat_automated_description_generation appears above
nat_benchmark_agent_leaderboard, nat_benchmark_bfcl, nat_benchmark_byob, and
nat_benchmark_tooltalk to maintain consistent alphabetical sorting.
- Around line 126-130: The examples list in pyproject.toml is not alphabetically
ordered; move "nat_automated_description_generation" so it appears before the
"nat_benchmark_*" entries (e.g., nat_benchmark_agent_leaderboard,
nat_benchmark_bfcl, nat_benchmark_byob, nat_benchmark_tooltalk) so the sequence
is alphabetically sorted; update the examples array to place
nat_automated_description_generation immediately before the nat_benchmark_*
items.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 9bf982d9-d5b1-43ba-944f-838d6fa58d02
📒 Files selected for processing (11)
ci/vale/styles/config/vocabularies/nat/accept.txtexamples/benchmarks/agent_leaderboard/README.mdexamples/benchmarks/bfcl/README.mdexamples/benchmarks/byob/README.mdexamples/benchmarks/tooltalk/README.mdpackages/nvidia_nat_benchmarks/configs/agent_leaderboard_eval.yamlpackages/nvidia_nat_benchmarks/configs/bfcl_ast_eval.yamlpackages/nvidia_nat_benchmarks/configs/tooltalk_eval.yamlpackages/nvidia_nat_benchmarks/pyproject.tomlpackages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/common/tool_intent_stubs.pypyproject.toml
🚧 Files skipped from review as they are similar to previous changes (4)
- packages/nvidia_nat_benchmarks/configs/tooltalk_eval.yaml
- packages/nvidia_nat_benchmarks/configs/bfcl_ast_eval.yaml
- packages/nvidia_nat_benchmarks/configs/agent_leaderboard_eval.yaml
- packages/nvidia_nat_benchmarks/pyproject.toml
| def get_global_intents(scenario_id: str = "current") -> list[dict[str, Any]]: | ||
| """Retrieve tool intents from the global registry. | ||
|
|
||
| This allows evaluators to access intents without needing builder access. | ||
|
|
||
| Args: | ||
| scenario_id: Identifier for the scenario | ||
|
|
||
| Returns: | ||
| List of tool intents | ||
| """ | ||
| return _GLOBAL_INTENT_REGISTRY.get(scenario_id, []).copy() | ||
|
|
||
|
|
||
| def clear_global_intents(scenario_id: str = "current") -> None: | ||
| """Clear intents from global registry. | ||
|
|
||
| Args: | ||
| scenario_id: Identifier for the scenario to clear | ||
| """ | ||
| if scenario_id in _GLOBAL_INTENT_REGISTRY: | ||
| _GLOBAL_INTENT_REGISTRY[scenario_id] = [] | ||
| logger.debug("Cleared global intents for scenario %s", scenario_id) |
There was a problem hiding this comment.
Resolve omitted scenario_id through get_current_scenario_id().
The default "current" is used as a literal registry key here. After set_current_scenario_id("foo"), calling get_global_intents() or clear_global_intents() with no argument still targets "current" instead of the active scenario, which can silently read or clear the wrong bucket. Prefer scenario_id: str | None = None and resolve None via get_current_scenario_id().
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/common/tool_intent_stubs.py`
around lines 121 - 143, The functions get_global_intents and
clear_global_intents currently use the literal default "current" and should
instead accept scenario_id: str | None = None and resolve None by calling
get_current_scenario_id(); update both functions (get_global_intents and
clear_global_intents) to: change the signature to accept Optional scenario_id,
call get_current_scenario_id() when scenario_id is None, then use the resolved
id when reading/writing _GLOBAL_INTENT_REGISTRY and logging so the active
scenario bucket is used rather than the literal "current".
| def create_tool_stub_function( | ||
| tool_schema: dict[str, Any], | ||
| intent_buffer: ToolIntentBuffer, | ||
| canned_response: str | None = None, | ||
| ) -> tuple[callable, BaseModel | None, str]: |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/common/tool_intent_stubs.py | head -250Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 11039
🏁 Script executed:
rg "class PermissiveToolInput" packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/common/tool_intent_stubs.py -A 5Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 348
🏁 Script executed:
rg "def create_tool_stub_function" packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/common/tool_intent_stubs.py -A 30Repository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 1449
🏁 Script executed:
sed -n '220,235p' packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/common/tool_intent_stubs.pyRepository: NVIDIA/NeMo-Agent-Toolkit
Length of output: 696
Fix the return type annotation for the public API.
The annotation tuple[callable, BaseModel | None, str] is invalid. callable (lowercase) is not a valid type; the first element is an async function requiring Callable[[object], Coroutine[Any, Any, str]]; and the second element returns the class PermissiveToolInput itself, not an instance, requiring type[PermissiveToolInput] | None. Update the annotation to:
-> tuple[Callable[[object], Coroutine[Any, Any, str]], type[PermissiveToolInput] | None, str]:This requires importing Callable and Coroutine from collections.abc and typing respectively. Per coding guidelines, all public APIs must have correct Python 3.11+ type hints and pass pyright validation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@packages/nvidia_nat_benchmarks/src/nat/plugins/benchmarks/common/tool_intent_stubs.py`
around lines 183 - 187, The public API return annotation on
create_tool_stub_function is wrong: change its signature to return
tuple[Callable[[object], Coroutine[Any, Any, str]], type[PermissiveToolInput] |
None, str] (the first element is an async call signature, the second is the
PermissiveToolInput class type, not an instance). Import Callable from
collections.abc and Coroutine and Any from typing (or typing.Any) and ensure
PermissiveToolInput is referenced in the annotation (use a forward reference or
real symbol) so pyright validates the function create_tool_stub_function
correctly.
Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
There was a problem hiding this comment.
🧹 Nitpick comments (2)
examples/benchmarks/byob/pyproject.toml (2)
27-32: Publish the example README in package metadata.This example already has a README in the PR. Adding
readme = "README.md"makes the installed package more self-describing on package indexes and in metadata views.Suggested change
[project] name = "nat_benchmark_byob" dynamic = ["version", "dependencies"] requires-python = ">=3.11,<3.14" description = "BYOB (Bring Your Own Benchmark) evaluation example for NeMo Agent Toolkit" +readme = "README.md" classifiers = ["Programming Language :: Python"]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/benchmarks/byob/pyproject.toml` around lines 27 - 32, Add the README to the package metadata by adding the readme field under the [project] table in pyproject.toml (where name, dynamic, requires-python, description, classifiers are defined); set readme = "README.md" so the package includes and exposes the example README in its installed metadata and on package indexes.
34-38: Remove unusedtestextra fromnvidia-natdependency.Line 36 includes
nvidia-nat[test]but this example has no Python source code and performs no test-only imports at runtime. Dropping thetestextra would reduce unnecessary dependency bloat.Note: Other examples in the repository consistently include this extra; this change would deviate from that pattern.
Suggested change
[tool.setuptools_dynamic_dependencies] dependencies = [ - "nvidia-nat[eval,langchain,test] == {version}", + "nvidia-nat[eval,langchain] == {version}", "nvidia-nat-benchmarks[byob] == {version}", ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/benchmarks/byob/pyproject.toml` around lines 34 - 38, Update the dependencies entry under [tool.setuptools_dynamic_dependencies] to remove the unused "test" extra from the "nvidia-nat" spec: replace "nvidia-nat[eval,langchain,test] == {version}" with "nvidia-nat[eval,langchain] == {version}" so the dependencies list no longer includes the unnecessary test extra.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@examples/benchmarks/byob/pyproject.toml`:
- Around line 27-32: Add the README to the package metadata by adding the readme
field under the [project] table in pyproject.toml (where name, dynamic,
requires-python, description, classifiers are defined); set readme = "README.md"
so the package includes and exposes the example README in its installed metadata
and on package indexes.
- Around line 34-38: Update the dependencies entry under
[tool.setuptools_dynamic_dependencies] to remove the unused "test" extra from
the "nvidia-nat" spec: replace "nvidia-nat[eval,langchain,test] == {version}"
with "nvidia-nat[eval,langchain] == {version}" so the dependencies list no
longer includes the unnecessary test extra.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8f1639a4-974c-4c3b-9aaa-3ac690dc3def
⛔ Files ignored due to path filters (2)
packages/nvidia_nat_benchmarks/uv.lockis excluded by!**/*.lockuv.lockis excluded by!**/*.lock
📒 Files selected for processing (4)
examples/benchmarks/agent_leaderboard/pyproject.tomlexamples/benchmarks/bfcl/pyproject.tomlexamples/benchmarks/byob/pyproject.tomlexamples/benchmarks/tooltalk/pyproject.toml
🚧 Files skipped from review as they are similar to previous changes (2)
- examples/benchmarks/tooltalk/pyproject.toml
- examples/benchmarks/bfcl/pyproject.toml
Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
examples/benchmarks/agent_leaderboard/README.md (1)
84-95: Add language specifiers to fenced code blocks.The output blocks are missing language specifiers. Add
textorconsoleto improve rendering and meet markdown best practices.📝 Proposed fix
For lines 84-95:
**Expected output:** -``` +```text INFO - Starting evaluation run with config file: .../eval_banking.ymlFor lines 163-175:
**Example output:** -``` +```text Average TSQ F1: 0.650Also applies to: 163-175
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/benchmarks/agent_leaderboard/README.md` around lines 84 - 95, The fenced code blocks showing console output (e.g., the block starting with "INFO - Starting evaluation run with config file: .../eval_banking.yml" and the block containing "Average TSQ F1: 0.650") should include a language specifier for proper rendering; update the opening triple backticks for those blocks (and any other similar output blocks, including the block around lines 163-175) from ``` to ```text (or ```console) so they are explicitly marked as plain text/console output.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@examples/benchmarks/agent_leaderboard/README.md`:
- Around line 84-95: The fenced code blocks showing console output (e.g., the
block starting with "INFO - Starting evaluation run with config file:
.../eval_banking.yml" and the block containing "Average TSQ F1: 0.650") should
include a language specifier for proper rendering; update the opening triple
backticks for those blocks (and any other similar output blocks, including the
block around lines 163-175) from ``` to ```text (or ```console) so they are
explicitly marked as plain text/console output.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7780df8e-8217-44f9-a80e-21fc8b6651c6
📒 Files selected for processing (2)
examples/benchmarks/agent_leaderboard/README.mdexamples/benchmarks/bfcl/pyproject.toml
🚧 Files skipped from review as they are similar to previous changes (1)
- examples/benchmarks/bfcl/pyproject.toml
Description
This PR creates a new NAT "benchmarks" plugin (installed with
nvidia-nat-benchmarks[tooltalk,bfcl,byob,agent-leaderboard]). This plugin is an extension of the optionalnvidia-nat-evalplugin - it manages integrations with NeMo Evaluator and HuggingFace benchmarks for academic/public agentic performance benchmarks. Functionality is defined by the following three axis of agentic benchmarks:Summary
nat evalpipeline with full trajectory capture, token usage, and latency breakdownsexamples/benchmarks/with READMEs, configs, and installation instructions for each benchmarkshuffleandshuffle_seedfields toEvalGeneralConfigfor randomizing evaluation orderMotivation
NAT users need to evaluate their agent workflows against standard function-calling benchmarks, but NeMo Evaluator's endpoint-centric integration model treats NAT as a black-box LLM — discarding the multi-step workflow traces, tool call sequences, and intermediate-step observability that make NAT evaluations valuable. This package extracts the portable parts from each benchmark (datasets and scoring functions) and runs them inside NAT's own evaluation pipeline.
What's included
New package:
packages/nvidia_nat_benchmarks/Four benchmark integrations, each providing a dataset adapter, one or more workflows, and an evaluator:
ToolExecutorsimulationtooltalk_evaluatorToolIntentBufferbfcl_evaluatorast_checkeragent_leaderboard_tsqbyob_evaluatorNew examples:
examples/benchmarks/{tooltalk,bfcl,agent_leaderboard,byob}/Core NAT changes:
EvalGeneralConfig: Addedshuffle: boolandshuffle_seed: int | NonefieldsDatasetHandler: Acceptsshuffle/shuffle_seedand appliesdf.sample(frac=1)after dedupevaluate.py: Passes shuffle config toDatasetHandlerevaluate.md: Documents the shuffle featureDependencies
Closes
By Submitting this PR I confirm:
Summary by CodeRabbit
New Features
Documentation