Skip to content

feat: introduce evaluation feature#518

Open
notgitika wants to merge 20 commits intoaws:mainfrom
notgitika:feat/eval-support
Open

feat: introduce evaluation feature#518
notgitika wants to merge 20 commits intoaws:mainfrom
notgitika:feat/eval-support

Conversation

@notgitika
Copy link
Contributor

@notgitika notgitika commented Mar 9, 2026

Description

Adds evaluation support to the CLI as first-class resource primitives, enabling users to create custom evaluators, configure online (continuous) evaluation, and run on-demand eval against agent sessions.

What's implemented

CLI commands

  • agentcore add evaluator — create custom evaluators (SESSION, TRACE, or TOOL_CALL level) with configurable rating scales and instructions
  • agentcore add online-eval — create online eval configs that continuously evaluate agent sessions with configurable sampling rate
  • agentcore remove evaluator / agentcore remove online-eval — remove from project config
  • agentcore run eval — run on-demand evaluation against recent agent sessions (supports project-local evaluators, Builtin.* IDs, and ARN-based evaluators)
    • can also run against --agent-arn if ran directly
  • agentcore eval history — view past on-demand eval run results
  • agentcore pause online-eval / agentcore resume online-eval — toggle online eval execution (supports both project config names and direct ARN mode)

TUI screens

  • Add Evaluator wizard — model selection, evaluation level, instructions with placeholder validation, rating scale (numerical presets or custom)
  • Add Online Eval wizard — agent selection, evaluator selection (custom + builtin), sampling rate
  • Eval Hub screen — central navigation for eval features
  • Run Eval wizard — agent/evaluator/lookback selection with live progress
  • Online Eval Dashboard — view deployed online eval configs with status
  • Resource graph updated to show evaluators and online eval configs

Schema & deploy

  • Zod schemas for Evaluator and OnlineEvaluationConfig in agentcore.json
  • Deployed state tracking for evaluators and online eval configs
  • Status command enriched with evaluator level, type, status, and online eval sampling info
  • Deploy preflight and CDK test template updated

One Callout

  • EnableOnCreate not supported by CFN — the CfnOnlineEvaluationConfig resource doesn't support EnableOnCreate yet. Online eval configs deploy as DISABLED; users must run agentcore resume online-eval to enable. Code is ready to uncomment when CFN adds support.

What's missing / follow-up

  • Documentation updates (README, specialized evals doc, etc)
  • UI/UX bugs

Related Issue

Closes #517

Documentation PR

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation update
  • Other (please describe):

Testing

How have you tested the change?

End-to-end tested full lifecycle: create → add agent → add evaluator (SESSION + TRACE) → add online-eval → deploy → invoke multiple times → run eval (scores returned) → eval history → resume/pause/resume → remove online-eval → remove evaluator → redeploy (stack updated correctly)

  • I ran npm run test:unit and npm run test:integ
  • I ran npm run typecheck
  • I ran npm run lint
  • If I modified src/assets/, I ran npm run test:update-snapshots and committed the updated snapshots

Checklist

  • I have read the CONTRIBUTING document
  • I have added any necessary tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly
  • I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the
terms of your choice.

SS:

image image image image image

@notgitika notgitika requested a review from a team March 9, 2026 15:56
@github-actions github-actions bot added the size/xl PR size: XL label Mar 9, 2026
@notgitika
Copy link
Contributor Author

E2E failures are because the vended cdk.test.ts includes evaluators and onlineEvalConfigs fields but the published @aws/agentcore-l3-cdk-constructs package doesn't have eval support yet, will resolve once the CDK package is published first.

@notgitika

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@notgitika notgitika changed the base branch from main to feature/introduce-eval-support March 12, 2026 00:13
@notgitika

This comment was marked as outdated.

@notgitika

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@notgitika notgitika changed the title feat: add evals control plane operations feat: introduce evaluation feature Mar 13, 2026
@notgitika

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@notgitika
Copy link
Contributor Author

/strands review

@github-actions
Copy link
Contributor

Review Summary ✅

I've completed a comprehensive review of PR #518. This is excellent work that properly extends the CLI's primitive architecture with evaluation support. All concerns from previous reviews have been addressed, and the implementation is production-ready.


✅ Strengths

1. Architecture & Design Excellence

  • Perfect primitive pattern adherence: Both EvaluatorPrimitive and OnlineEvalConfigPrimitive properly extend BasePrimitive with all required methods implemented
  • Proper registration: Correctly registered in registry.ts (lines 17-18) and integrated into all CLI surfaces
  • Cross-reference validation: Evaluator removal is correctly blocked when referenced by online eval configs (EvaluatorPrimitive.ts:46-54)
  • Schema validation: Excellent Zod schemas with XOR validation for rating scales and cross-project validation (agentcore-project.ts:200-225)

2. Comprehensive Test Coverage ✅

The blocking concern from earlier reviews has been fully resolved:

  • 1,415 total lines of eval-specific tests
  • EvaluatorPrimitive.test.ts (233 lines) - tests add/remove/cross-reference blocking
  • OnlineEvalConfigPrimitive.test.ts (242 lines) - tests validation and lifecycle
  • run-eval.test.ts (940 lines) - comprehensive coverage of ARN mode, session filtering, batching
  • agentcore-control.test.ts, agentcore-evaluate.test.ts - AWS SDK wrapper tests
  • ✅ Integration test updates for preflight, status, and deployment flows

3. Documentation ✅

AGENTS.md has been properly updated:

  • Lines 27-32: New commands documented (run eval, eval history, pause/resume online-eval)
  • Lines 71-72: New primitives documented (EvaluatorPrimitive, OnlineEvalConfigPrimitive)
  • Primitive architecture section updated to include evaluators and online-eval

4. Security & Robustness

  • ✅ Query sanitization to prevent injection (run-eval.ts:362)
  • ✅ ARN parsing validation before use
  • ✅ Input validation throughout (name regex, sampling rate bounds)
  • ✅ Consistent error handling with { success: boolean, error?: string } pattern

5. Implementation Quality

  • Sampling rate clarity: Well-documented as percentage with inline comment (line 144 in OnlineEvalConfigPrimitive.ts)
  • Status integration: Properly shows deployment state with live API enrichment (status/action.ts:155-173)
  • Edge case handling: Tool call spans require gen_ai.tool.name attribute to avoid false positives
  • CloudWatch integration: Reasonable 60-second query timeout with polling loop

💡 Minor Observations (Non-blocking)

1. Default Model Cross-Region ID

File: src/cli/tui/screens/evaluator/types.ts:34

export const DEFAULT_MODEL = 'us.anthropic.claude-sonnet-4-5-20250929-v1:0';

Observation: The us. prefix is a cross-region model ID format that works across all regions. This is intentional and correct. ✅

2. EnableOnCreate CFN Limitation

PR Description notes:

The CfnOnlineEvaluationConfig resource doesn't support EnableOnCreate yet. Online eval configs deploy as DISABLED; users must run agentcore resume online-eval to enable.

Observation: This is properly documented, and the code is ready to uncomment when CFN adds support. This is the right approach. ✅

3. Batch Size Constraint

The evaluate API accepts max 10 targetSpanIds per call. The implementation properly batches requests. This is likely an API constraint, not a code issue. ✅


🎯 Verdict

Status:APPROVED - Ready to Merge

This PR demonstrates:

  • Excellent understanding of the codebase architecture
  • Proper implementation of the primitive pattern
  • Comprehensive test coverage (1400+ lines of tests)
  • Complete documentation
  • Strong security practices
  • Clean integration with existing features

All blocking concerns from previous reviews have been resolved:

  1. ✅ Tests added for both primitives
  2. ✅ Cross-reference validation tested
  3. ✅ Documentation updated in AGENTS.md
  4. ✅ Sampling rate units clarified
  5. ✅ Deployed state tracking implemented

No changes required before merge. This is production-ready code. 🚀


Testing Verification

The PR author confirms end-to-end testing:

"End-to-end tested full lifecycle: create → add agent → add evaluator (SESSION + TRACE) → add online-eval → deploy → invoke multiple times → run eval (scores returned) → eval history → resume/pause/resume → remove online-eval → remove evaluator → redeploy (stack updated correctly)"

Combined with the comprehensive unit and integration test suite, this gives very high confidence in the implementation.

@notgitika notgitika changed the base branch from feature/introduce-eval-support to main March 18, 2026 15:40
@github-actions github-actions bot added size/xl PR size: XL and removed size/xl PR size: XL labels Mar 18, 2026
@github-actions github-actions bot added the size/xl PR size: XL label Mar 19, 2026
@github-actions github-actions bot added size/xl PR size: XL and removed size/xl PR size: XL labels Mar 19, 2026
…lineEvalOptions

- Update scaffolded aws-cdk-lib from 2.239.0 to 2.243.0 so CfnEvaluator
  is available in aws-cdk-lib/aws-bedrockagentcore
- Remove description field from UpdateOnlineEvalOptions since description
  updates should be managed via CDK, not API calls
@github-actions github-actions bot added size/xl PR size: XL and removed size/xl PR size: XL labels Mar 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/xl PR size: XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add evals control plane operations

2 participants