feat: add correctness evaluator, trace-based and reference-based by ybdarrenwang · Pull Request #185 · strands-agents/evals

ybdarrenwang · 2026-04-02T16:48:50Z

Description

Add correctness evaluator, supporting both trace-based and reference-based evaluation

Related Issues

#95

Documentation PR

strands-agents/docs#714

Type of Change

New feature

Testing

How have you tested the change? Verify that the changes do not break functionality or introduce warnings in consuming repositories: agents-docs, agents-tools, agents-cli

I ran hatch run prepare

Checklist

I have read the CONTRIBUTING document
I have added any necessary tests that prove my fix is effective or my feature works
I have updated the documentation accordingly
I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
My changes generate no new warnings
Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

poshinchen

🤔 I'm a bit confused about expected_assertion, and the expected_output now.
What is the difference? Can't we always use one or another in trace-based evaluators?

Co-authored-by: Kang Zhou <kangzhou1991@gmail.com> Co-authored-by: Subramanian Chidambaram <subbu10123@gmail.com>

ybdarrenwang · 2026-04-10T17:17:42Z

🤔 I'm a bit confused about expected_assertion, and the expected_output now. What is the difference? Can't we always use one or another in trace-based evaluators?

After discussion offline, we confirmed that it shouldn't make difference, and we'll use expected_assertions for both

GoalSuccessRateEvaluator with assertion
CorrectnessEvaluator with reference

ybdarrenwang temporarily deployed to manual-approval April 2, 2026 16:49 — with GitHub Actions Inactive

ybdarrenwang mentioned this pull request Apr 2, 2026

docs: add correctness, goal success rate, coherence evaluator examples strands-agents/docs#714

Merged

4 tasks

poshinchen reviewed Apr 7, 2026

View reviewed changes

Comment thread src/strands_evals/evaluators/prompt_templates/correctness/correctness_v0.py

poshinchen reviewed Apr 7, 2026

View reviewed changes

Comment thread src/strands_evals/evaluators/coherence_evaluator.py

poshinchen reviewed Apr 7, 2026

View reviewed changes

Comment thread src/strands_evals/evaluators/correctness_evaluator.py

poshinchen reviewed Apr 9, 2026

View reviewed changes

Comment thread src/strands_evals/evaluators/correctness_evaluator.py Outdated

poshinchen reviewed Apr 9, 2026

View reviewed changes

feat: add correctness evaluator, trace-based and reference-based

a54fb25

Co-authored-by: Kang Zhou <kangzhou1991@gmail.com> Co-authored-by: Subramanian Chidambaram <subbu10123@gmail.com>

ybdarrenwang force-pushed the feature/gt-rc branch from 9addb56 to a54fb25 Compare April 10, 2026 17:16

ybdarrenwang temporarily deployed to manual-approval April 10, 2026 17:16 — with GitHub Actions Inactive

poshinchen approved these changes Apr 10, 2026

View reviewed changes

poshinchen merged commit d40a2b3 into strands-agents:main Apr 10, 2026
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add correctness evaluator, trace-based and reference-based#185

feat: add correctness evaluator, trace-based and reference-based#185
poshinchen merged 1 commit intostrands-agents:mainfrom
ybdarrenwang:feature/gt-rc

ybdarrenwang commented Apr 2, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

poshinchen left a comment

Uh oh!

ybdarrenwang commented Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ybdarrenwang commented Apr 2, 2026

Description

Related Issues

Documentation PR

Type of Change

Testing

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

poshinchen left a comment

Choose a reason for hiding this comment

Uh oh!

ybdarrenwang commented Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants