Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
632445e
Add nested trajectory evaluation example for simple calculator
AnuradhaKaruppiah Mar 18, 2026
1b84943
Add traj print script options to view the traj for different purposes
AnuradhaKaruppiah Mar 18, 2026
e4cab40
Add an e2e test to maintaing nested traj integrity
AnuradhaKaruppiah Mar 18, 2026
e1ace03
Atif enhancements proposal
AnuradhaKaruppiah Mar 18, 2026
6daa35b
Update models to populate rich path
AnuradhaKaruppiah Mar 18, 2026
1d75872
Update printing script to use new paths
AnuradhaKaruppiah Mar 19, 2026
6c9e2db
Move to a flat tool structure when translating from IntermediateStep …
AnuradhaKaruppiah Mar 19, 2026
d08d492
Add example with deeper nesting
AnuradhaKaruppiah Mar 19, 2026
10aaded
Make paths optional to keep the schema tight
AnuradhaKaruppiah Mar 19, 2026
2c417b4
Misc fixes
AnuradhaKaruppiah Mar 19, 2026
e7b8fcd
Update proposal with references
AnuradhaKaruppiah Mar 19, 2026
ad3b166
Spit the extra into InvocationInfo and Ancestry
AnuradhaKaruppiah Mar 20, 2026
92fb756
Update docs with the ancestry, tool invocation split
AnuradhaKaruppiah Mar 30, 2026
589297f
Update the ATIF path proposal
AnuradhaKaruppiah Mar 30, 2026
e570ab3
Add phoenix to the eval example to enable tracing
AnuradhaKaruppiah Mar 30, 2026
669cf81
Add phoenix as a dep to the simple calc eval example
AnuradhaKaruppiah Mar 30, 2026
ae7e6eb
Update readme to include instructions for installing phoenix
AnuradhaKaruppiah Mar 30, 2026
3eee670
Merge remote-tracking branch 'upstream/develop' into ak-eval-nested-calc
AnuradhaKaruppiah Mar 30, 2026
41740ad
Drop dependency on nvidia-nat-phoenix
AnuradhaKaruppiah Mar 30, 2026
acf3120
Update lock files
AnuradhaKaruppiah Mar 30, 2026
2970ef8
Fix deps
AnuradhaKaruppiah Mar 30, 2026
bc9a32e
Make timestamps optional so they are not fabricated
AnuradhaKaruppiah Mar 30, 2026
a8a3de8
Misc cleanup
AnuradhaKaruppiah Mar 30, 2026
7b442e1
Add dedup logic in trajectory evaluator
AnuradhaKaruppiah Mar 30, 2026
6c2156e
Add an option for validating the ATIF trajectory
AnuradhaKaruppiah Mar 30, 2026
5e8287f
Supress synthetic calls
AnuradhaKaruppiah Mar 30, 2026
25e5135
Enable tracing for tunable rag eval config
AnuradhaKaruppiah Mar 30, 2026
ce3e36b
Expand producer rules
AnuradhaKaruppiah Mar 31, 2026
6c8ec28
Update print script to find trajectories within a list, dict or payload
AnuradhaKaruppiah Mar 31, 2026
b6db81f
Remove dependency on NAT models
AnuradhaKaruppiah Apr 1, 2026
6581fa7
Limit the dataset entries processed for quick ref
AnuradhaKaruppiah Apr 1, 2026
f42bbfa
Update reference artifacts
AnuradhaKaruppiah Apr 1, 2026
ec4ecaa
Fix failing tests
AnuradhaKaruppiah Apr 1, 2026
4427f06
Misc cleanup
AnuradhaKaruppiah Apr 1, 2026
d48f410
Merge remote-tracking branch 'upstream/develop' into ak-eval-nested-calc
AnuradhaKaruppiah Apr 1, 2026
2719a46
Fix vale warnings
AnuradhaKaruppiah Apr 1, 2026
f4ed546
Fix CI failures
AnuradhaKaruppiah Apr 1, 2026
ce16e7b
Simplify dedup logic used for setting up trajectory
AnuradhaKaruppiah Apr 1, 2026
5ad060e
Update atif converter to handle orphan FUNCTION_END similar to TOOL_END
AnuradhaKaruppiah Apr 1, 2026
d03ed69
Additional unit tests
AnuradhaKaruppiah Apr 1, 2026
d0890ae
Additional tests for traj eval
AnuradhaKaruppiah Apr 1, 2026
702677c
Address style issues
AnuradhaKaruppiah Apr 1, 2026
0c50787
Restore uv lock
AnuradhaKaruppiah Apr 1, 2026
3b2d91e
Restore simple calculator eval lock
AnuradhaKaruppiah Apr 1, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,22 @@ This example demonstrates how to evaluate and profile AI agent performance using

1. **Agent toolkit**: Ensure you have the Agent toolkit installed. If you have not already done so, follow the instructions in the [Install Guide](../../../docs/source/get-started/installation.md#install-from-source) to create the development environment and install NeMo Agent Toolkit.
2. **Base workflow**: This example builds upon the Getting Started [Simple Calculator](../../getting_started/simple_calculator/) example. Make sure you are familiar with the example before proceeding.
3. **Phoenix tracing backend**: Start Phoenix before running trajectory-based configurations in this example.

```bash
phoenix serve
```

If your environment does not include the `phoenix` CLI, install it with:

```bash
uv pip install arize-phoenix
```

You can run Phoenix from a separate virtual environment than the one used for
NVIDIA NeMo Agent Toolkit evaluation runs. This is often preferable to avoid
dependency and version conflicts between Phoenix packages and toolkit plus
evaluator dependencies.

## Installation

Expand Down Expand Up @@ -77,3 +93,47 @@ The evaluation generates comprehensive metrics including:
- **Question-by-Question Analysis**: Detailed breakdown of individual responses
- **Performance Metrics**: Overall quality assessments
- **Error Analysis**: Identification of common failure patterns

### Running Nested Trajectory Evaluation

Evaluate a workflow that performs a nested tool call (`power_of_two` -> `calculator__multiply`) and inspect how it appears in the ATIF trajectory output:

```bash
nat eval --config_file examples/evaluation_and_profiling/simple_calculator_eval/configs/config-nested-trajectory-eval.yml
```

This command:
- Uses `examples/evaluation_and_profiling/simple_calculator_eval/data/simple_calculator_power_of_two.json`
- Runs the built-in `trajectory` evaluator
- Writes workflow trajectories to `.tmp/nat/examples/simple_calculator/nested-eval/workflow_output_atif.json`

To inspect the call hierarchy from the generated ATIF file:

```bash
python packages/nvidia_nat_eval/scripts/print_atif_function_tree.py \
.tmp/nat/examples/simple_calculator/nested-eval/workflow_output_atif.json \
--view ancestry \
--item-id 1
```

### Running Branching Nested Trajectory Evaluation

Evaluate a workflow where one top-level tool (`power_branch`) fans out to two internal tools (`square_via_multiply` and `cube_via_multiply_chain`) and each branch calls `calculator__multiply`.

```bash
nat eval --config_file examples/evaluation_and_profiling/simple_calculator_eval/configs/config-branching-nested-trajectory-eval.yml
```

This command:
- Uses `examples/evaluation_and_profiling/simple_calculator_eval/data/simple_calculator_power_branch.json`
- Runs the built-in `trajectory` evaluator
- Writes trajectories to `.tmp/nat/examples/simple_calculator/branching-nested-eval/workflow_output_atif.json`

To inspect one input item:

```bash
python packages/nvidia_nat_eval/scripts/print_atif_function_tree.py \
.tmp/nat/examples/simple_calculator/branching-nested-eval/workflow_output_atif.json \
--view ancestry \
--item-id 1
```
Original file line number Diff line number Diff line change
Expand Up @@ -34,10 +34,13 @@ classifiers = ["Programming Language :: Python"]

[tool.setuptools_dynamic_dependencies]
dependencies = [
"nvidia-nat[eval,langchain,profiler,test] == {version}",
"nvidia-nat[eval,langchain,phoenix,profiler,test] == {version}",
"nat_simple_calculator",
]

[tool.uv.sources]
nvidia-nat = { path = "../../..", editable = true }
nat_simple_calculator = { path = "../../getting_started/simple_calculator", editable = true }

[project.entry-points."nat.components"]
nat_simple_calculator_eval = "nat_simple_calculator_eval.register"
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Branching nested trajectory evaluation example:
# react_agent -> power_branch -> {square_via_multiply, cube_via_multiply_chain}
# and both internal branches call calculator__multiply.

general:
telemetry:
tracing:
phoenix:
_type: phoenix
endpoint: http://localhost:6006/v1/traces
project: simple_calculator_branching_nested_eval

function_groups:
calculator:
_type: calculator

functions:
square_via_multiply:
_type: square_via_multiply
multiply_fn: calculator__multiply
cube_via_multiply_chain:
_type: cube_via_multiply_chain
multiply_fn: calculator__multiply
power_branch:
_type: power_branch
square_fn: square_via_multiply
cube_fn: cube_via_multiply_chain

llms:
nim_llm:
_type: nim
model_name: nvidia/nemotron-3-nano-30b-a3b
temperature: 0.0
max_tokens: 1024
chat_template_kwargs:
enable_thinking: false
eval_llm:
_type: nim
model_name: mistralai/mixtral-8x22b-instruct-v0.1
temperature: 0.0
max_tokens: 1024

workflow:
_type: react_agent
tool_names: [power_branch]
llm_name: nim_llm
verbose: true
parse_agent_response_max_retries: 3

eval:
general:
max_concurrency: 1
output:
dir: .tmp/nat/examples/simple_calculator/branching-nested-eval
write_atif_workflow_output: true
cleanup: true
dataset:
_type: json
file_path: examples/evaluation_and_profiling/simple_calculator_eval/data/simple_calculator_power_branch.json
filter:
allowlist:
field:
id: [1]

evaluators:
trajectory_eval:
_type: trajectory
enable_atif_evaluator: true
llm_name: eval_llm
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Nested trajectory evaluation example:
# react_agent -> power_of_two -> calculator__multiply
#
# This configuration writes ATIF workflow output so you can inspect how nested
# tool calls are represented in trajectory steps.

general:
telemetry:
tracing:
phoenix:
_type: phoenix
endpoint: http://localhost:6006/v1/traces
project: simple_calculator_nested_eval

function_groups:
calculator:
_type: calculator

functions:
power_of_two:
_type: power_of_two
multiply_fn: calculator__multiply

llms:
nim_llm:
_type: nim
model_name: nvidia/nemotron-3-nano-30b-a3b
temperature: 0.0
max_tokens: 1024
chat_template_kwargs:
enable_thinking: false
eval_llm:
_type: nim
model_name: mistralai/mixtral-8x22b-instruct-v0.1
temperature: 0.0
max_tokens: 1024

workflow:
_type: react_agent
tool_names: [power_of_two]
llm_name: nim_llm
verbose: true
parse_agent_response_max_retries: 3

eval:
general:
max_concurrency: 1
output:
dir: .tmp/nat/examples/simple_calculator/nested-eval
write_atif_workflow_output: true
cleanup: true
dataset:
_type: json
file_path: examples/evaluation_and_profiling/simple_calculator_eval/data/simple_calculator_power_of_two.json
filter:
allowlist:
field:
id: [1]

evaluators:
trajectory_eval:
_type: trajectory
enable_atif_evaluator: true
llm_name: eval_llm
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Trajectory evaluation example:
# react_agent -> {calculator, current_datetime}
#
# This configuration writes ATIF workflow output so you can inspect
# trajectory structure with standard calculator and datetime tool calls.

general:
telemetry:
tracing:
phoenix:
_type: phoenix
endpoint: http://localhost:6006/v1/traces
project: simple_calculator_eval

function_groups:
calculator:
_type: calculator

functions:
current_datetime:
_type: current_datetime

llms:
nim_llm:
_type: nim
model_name: nvidia/nemotron-3-nano-30b-a3b
temperature: 0.0
max_tokens: 1024
chat_template_kwargs:
enable_thinking: false
eval_llm:
_type: nim
model_name: mistralai/mixtral-8x22b-instruct-v0.1
temperature: 0.0
max_tokens: 1024

workflow:
_type: react_agent
tool_names: [calculator, current_datetime]
llm_name: nim_llm
verbose: true
parse_agent_response_max_retries: 3

eval:
general:
max_concurrency: 1
output:
dir: .tmp/nat/examples/simple_calculator/trajectory-eval
write_atif_workflow_output: true
cleanup: true
dataset:
_type: json
file_path: examples/getting_started/simple_calculator/data/simple_calculator.json
filter:
allowlist:
field:
id: [1]

evaluators:
trajectory_eval:
_type: trajectory
enable_atif_evaluator: true
llm_name: eval_llm
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,14 @@
# See the License for the specific language governing permissions and
# limitations under the License.

general:
telemetry:
tracing:
phoenix:
_type: phoenix
endpoint: http://localhost:6006/v1/traces
project: simple_calculator_tunable_rag_eval_atif

function_groups:
calculator:
_type: calculator
Expand All @@ -27,6 +35,8 @@ llms:
model_name: nvidia/nemotron-3-nano-30b-a3b
temperature: 0.0
max_tokens: 1024
chat_template_kwargs:
enable_thinking: false
eval_llm:
_type: nim
model_name: mistralai/mixtral-8x22b-instruct-v0.1
Expand Down
Git LFS file not shown
Git LFS file not shown
Git LFS file not shown
Git LFS file not shown
Git LFS file not shown
Loading
Loading