feat: add support for line-by-line evaluation by AAgnihotry · Pull Request #1481 · UiPath/uipath-python

AAgnihotry · 2026-03-23T17:17:17Z

Summary

Adds line-by-line evaluation capability to both new evaluators (version-based) and legacy evaluators (category/type-based), enabling per-line evaluation of multi-line outputs with partial credit scoring.

Problem

Current evaluators provide binary pass/fail results for multi-line outputs. If any part of the output is incorrect, the entire evaluation fails with a score of 0.0, even if most lines are correct.

Solution

This PR introduces line-by-line evaluation that:

Splits outputs by a configurable delimiter (default: \n)
Evaluates each line independently
Returns both per-line scores and an aggregated score (average)
Provides partial credit based on the percentage of correct lines
Works with both new evaluators and legacy evaluators

Key Changes

New Evaluators (Version-based)

output_evaluator.py: Added lineByLineEvaluator and lineDelimiter config options to OutputEvaluatorConfig
output_evaluator.py: Implemented _evaluate_line_by_line() method in BaseOutputEvaluator
test_evaluator_methods.py: Added 4 comprehensive tests for line-by-line evaluation

Legacy Evaluators (Category/Type-based)

base_legacy_evaluator.py: Added lineByLineEvaluation and lineDelimiter fields to BaseLegacyEvaluator
base_legacy_evaluator.py: Implemented _evaluate_line_by_line() method for legacy evaluators
base_legacy_evaluator.py: Added helper methods: _split_into_lines() and _get_actual_output()
test_legacy_exact_match_evaluator.py: Added 5 comprehensive tests for legacy line-by-line evaluation

Runtime & Bug Fixes

runtime.py: Fixed aggregation logic to skip line-by-line sub-results (e.g., "ExactMatch (Line 1)")
output_evaluator.py: Fixed bug where targetOutputKey wasn't properly applied to individual lines

Sample Project

samples/line_by_line_test/: Added complete sample agent demonstrating the feature with:
- Simple agent that outputs one item per line
- 3 evaluators: new line-by-line, regular, and legacy line-by-line
- 3 test cases showing partial credit scoring
- Full documentation in README.md

Benefits

✅ Partial credit: 2/3 correct lines = 0.67 score instead of 0.0
✅ Granular feedback: See which specific lines passed/failed
✅ Flexible: Works with any output evaluator (ExactMatch, Contains, JsonSimilarity, LLMJudge)
✅ Universal: Works with both new and legacy evaluator formats
✅ Backwards compatible: Disabled by default, opt-in via config

Example Usage

New Evaluators (Low-Code JSON config):

{
  "version": "1.0",
  "evaluatorTypeId": "uipath-exact-match",
  "evaluatorConfig": {
    "name": "LineByLineExactMatch",
    "lineByLineEvaluator": true,
    "lineDelimiter": "\n"
  }
}

Legacy Evaluators (Low-Code JSON config):

{
  "category": 0,
  "type": 1,
  "name": "LegacyLineByLineExactMatch",
  "targetOutputKey": "result",
  "lineByLineEvaluation": true,
  "lineDelimiter": "\n"
}

Coded Evaluator:

@evaluator
def line_by_line_evaluator() -> ExactMatchEvaluator:
    return ExactMatchEvaluator(
        evaluatorConfig={
            "lineByLineEvaluator": True,
            "lineDelimiter": "\n"
        }
    )

Test Results

Sample agent output showing all three evaluators:

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃  Evaluation               ┃  LineByLineExactMatch ┃  RegularExactMatch  ┃  LegacyLineByLineExactMatch ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│  Test all lines match     │                  1.0  │                1.0  │                         1.0 │
│  Test when one line       │                  0.7  │                0.0  │                         0.7 │  ← Partial credit!
│  doesn't match            │                       │                     │                             │
│  Test with single item    │                  1.0  │                1.0  │                         1.0 │
├───────────────────────────┼───────────────────────┼─────────────────────┼─────────────────────────────┤
│  Average                  │                  0.9  │                0.7  │                         0.9 │
└───────────────────────────┴───────────────────────┴─────────────────────┴─────────────────────────────┘

Testing

✅ All existing tests pass (1632 tests)
✅ Added 4 new line-by-line tests for new evaluators
✅ Added 5 new line-by-line tests for legacy evaluators
✅ Sample agent demonstrates full functionality with all 3 evaluator types
✅ Type checking passes (mypy)
✅ Linting passes (ruff)
✅ Build succeeds

Bugs Fixed

Runtime aggregation bug: Line sub-results were causing validation errors during score aggregation
targetOutputKey bug: Individual lines weren't wrapped in proper structure when targetOutputKey != "*"
Type checking bug: Added proper type assertions for LineByLineEvaluationDetails

🤖 Generated with Claude Code

Development Packages

uipath

[project]
dependencies = [
  # Exact version:
  "uipath==2.10.30.dev1014815583",

  # Any version from PR
  "uipath>=2.10.30.dev1014810000,<2.10.30.dev1014820000"
]

[[tool.uv.index]]
name = "testpypi"
url = "https://test.pypi.org/simple/"
publish-url = "https://test.pypi.org/legacy/"
explicit = true

[tool.uv.sources]
uipath = { index = "testpypi" }

This commit adds line-by-line evaluation capability to output evaluators, allowing them to evaluate multi-line outputs on a per-line basis and provide granular feedback with partial credit scoring. Key changes: - Added lineByLineEvaluator config flag to OutputEvaluatorConfig - Added lineDelimiter config to customize split behavior (default: "\n") - Implemented _evaluate_line_by_line() method in BaseOutputEvaluator - Fixed runtime aggregation to handle line-by-line sub-results - Fixed targetOutputKey wrapping for individual line evaluations - Added sample agent demonstrating the feature (samples/line_by_line_test) Benefits: - Provides partial credit (e.g., 2/3 lines correct = 0.67 score) - More granular feedback with per-line results - Useful for evaluating structured multi-line outputs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Add lineByLineEvaluation and lineDelimiter fields to BaseLegacyEvaluator - Implement _evaluate_line_by_line() method for legacy evaluators - Add helper methods: _split_into_lines() and _get_actual_output() - Add 5 comprehensive tests for legacy line-by-line evaluation - Add legacy evaluator to line_by_line_test sample for validation - Update sample README to document both new and legacy evaluators This feature enables legacy evaluators (category/type based) to support line-by-line evaluation with partial credit scoring, matching the functionality already available in new evaluators (version/evaluatorTypeId based). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Import LineByLineEvaluationDetails in test file - Add isinstance() checks to help mypy understand result.details type - Fixes mypy union-attr errors in legacy evaluator tests 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Change pyproject.toml to use local editable install instead of TestPyPI - Fix legacy evaluator JSON to use integer enum values (category: 0, type: 1) - Verified: All 3 evaluators (new line-by-line, regular, legacy line-by-line) work correctly 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

AAgnihotry · 2026-03-24T20:35:26Z

Chibionos

need to rearrange the functions they are pilled up on the core evaluator files

packages/uipath/src/uipath/eval/evaluators/output_evaluator.py

packages/uipath/src/uipath/eval/runtime/runtime.py

packages/uipath/src/uipath/eval/evaluators/base_legacy_evaluator.py

packages/uipath/src/uipath/eval/evaluators/output_evaluator.py

Chibionos · 2026-03-24T21:10:22Z

...uipath/samples/line_by_line_test/evaluations/evaluators/legacy-line-by-line-exact-match.json

+  "id": "LegacyLineByLineExactMatch",
+  "category": 0,
+  "type": 1,
+  "name": "LegacyLineByLineExactMatch",


this should not break medline when they turn on URT for them.
We should also make sure we track this on the sprint board.

packages/uipath/samples/line_by_line_test/evaluations/evaluators/line-by-line-exact-match.json

Added isinstance() checks to help mypy understand that result.details is specifically a LineByLineEvaluationDetails object, not just the generic str | BaseModel | None type. This fixes mypy errors when accessing total_lines_actual and total_lines_expected attributes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

github-actions bot added test:uipath-langchain Triggers tests in the uipath-langchain-python repository test:uipath-llamaindex Triggers tests in the uipath-llamaindex-python repository labels Mar 23, 2026

AAgnihotry force-pushed the feat/line-by-line-evaluation branch from 88191a0 to 8750e34 Compare March 23, 2026 17:18

AAgnihotry added the build:dev Create a dev build from the pr label Mar 24, 2026

AAgnihotry and others added 2 commits March 24, 2026 11:30

chore: bump version to 2.10.27

730ccc2

AAgnihotry force-pushed the feat/line-by-line-evaluation branch from 5ba8d3d to 730ccc2 Compare March 24, 2026 18:32

fix: add support for running line by line evaluations

a43c257

AAgnihotry force-pushed the feat/line-by-line-evaluation branch from afee629 to a43c257 Compare March 24, 2026 18:58

AAgnihotry and others added 3 commits March 24, 2026 13:02

AAgnihotry added build:dev Create a dev build from the pr and removed build:dev Create a dev build from the pr labels Mar 24, 2026

Chibionos requested changes Mar 24, 2026

View reviewed changes

packages/uipath/src/uipath/eval/evaluators/output_evaluator.py Outdated Show resolved Hide resolved

packages/uipath/src/uipath/eval/runtime/runtime.py Outdated Show resolved Hide resolved

packages/uipath/src/uipath/eval/evaluators/base_legacy_evaluator.py Show resolved Hide resolved

Chibionos approved these changes Mar 24, 2026

View reviewed changes

AAgnihotry and others added 2 commits March 24, 2026 17:08

fix: move some functions to utility and dedup the legacy and coded

9516bea

AAgnihotry force-pushed the feat/line-by-line-evaluation branch from 077915e to c2c050e Compare March 25, 2026 00:34

fix: update readme.md

4ff6f39

AAgnihotry merged commit 10bc126 into main Mar 25, 2026
127 checks passed

AAgnihotry deleted the feat/line-by-line-evaluation branch March 25, 2026 00:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add support for line-by-line evaluation#1481

feat: add support for line-by-line evaluation#1481
AAgnihotry merged 9 commits intomainfrom
feat/line-by-line-evaluation

AAgnihotry commented Mar 23, 2026 •

edited by github-actions bot

Loading

Uh oh!

AAgnihotry commented Mar 24, 2026

Uh oh!

Chibionos left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Chibionos Mar 24, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AAgnihotry commented Mar 23, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Key Changes

New Evaluators (Version-based)

Legacy Evaluators (Category/Type-based)

Runtime & Bug Fixes

Sample Project

Benefits

Example Usage

New Evaluators (Low-Code JSON config):

Legacy Evaluators (Low-Code JSON config):

Coded Evaluator:

Test Results

Testing

Bugs Fixed

Development Packages

uipath

Uh oh!

AAgnihotry commented Mar 24, 2026

Uh oh!

Chibionos left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Chibionos Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AAgnihotry commented Mar 23, 2026 •

edited by github-actions bot

Loading