Skip to content

feat: add support for line-by-line evaluation#1481

Merged
AAgnihotry merged 9 commits intomainfrom
feat/line-by-line-evaluation
Mar 25, 2026
Merged

feat: add support for line-by-line evaluation#1481
AAgnihotry merged 9 commits intomainfrom
feat/line-by-line-evaluation

Conversation

@AAgnihotry
Copy link
Contributor

@AAgnihotry AAgnihotry commented Mar 23, 2026

Summary

Adds line-by-line evaluation capability to both new evaluators (version-based) and legacy evaluators (category/type-based), enabling per-line evaluation of multi-line outputs with partial credit scoring.

Problem

Current evaluators provide binary pass/fail results for multi-line outputs. If any part of the output is incorrect, the entire evaluation fails with a score of 0.0, even if most lines are correct.

Solution

This PR introduces line-by-line evaluation that:

  • Splits outputs by a configurable delimiter (default: \n)
  • Evaluates each line independently
  • Returns both per-line scores and an aggregated score (average)
  • Provides partial credit based on the percentage of correct lines
  • Works with both new evaluators and legacy evaluators

Key Changes

New Evaluators (Version-based)

  • output_evaluator.py: Added lineByLineEvaluator and lineDelimiter config options to OutputEvaluatorConfig
  • output_evaluator.py: Implemented _evaluate_line_by_line() method in BaseOutputEvaluator
  • test_evaluator_methods.py: Added 4 comprehensive tests for line-by-line evaluation

Legacy Evaluators (Category/Type-based)

  • base_legacy_evaluator.py: Added lineByLineEvaluation and lineDelimiter fields to BaseLegacyEvaluator
  • base_legacy_evaluator.py: Implemented _evaluate_line_by_line() method for legacy evaluators
  • base_legacy_evaluator.py: Added helper methods: _split_into_lines() and _get_actual_output()
  • test_legacy_exact_match_evaluator.py: Added 5 comprehensive tests for legacy line-by-line evaluation

Runtime & Bug Fixes

  • runtime.py: Fixed aggregation logic to skip line-by-line sub-results (e.g., "ExactMatch (Line 1)")
  • output_evaluator.py: Fixed bug where targetOutputKey wasn't properly applied to individual lines

Sample Project

  • samples/line_by_line_test/: Added complete sample agent demonstrating the feature with:
    • Simple agent that outputs one item per line
    • 3 evaluators: new line-by-line, regular, and legacy line-by-line
    • 3 test cases showing partial credit scoring
    • Full documentation in README.md

Benefits

  • Partial credit: 2/3 correct lines = 0.67 score instead of 0.0
  • Granular feedback: See which specific lines passed/failed
  • Flexible: Works with any output evaluator (ExactMatch, Contains, JsonSimilarity, LLMJudge)
  • Universal: Works with both new and legacy evaluator formats
  • Backwards compatible: Disabled by default, opt-in via config

Example Usage

New Evaluators (Low-Code JSON config):

{
  "version": "1.0",
  "evaluatorTypeId": "uipath-exact-match",
  "evaluatorConfig": {
    "name": "LineByLineExactMatch",
    "lineByLineEvaluator": true,
    "lineDelimiter": "\n"
  }
}

Legacy Evaluators (Low-Code JSON config):

{
  "category": 0,
  "type": 1,
  "name": "LegacyLineByLineExactMatch",
  "targetOutputKey": "result",
  "lineByLineEvaluation": true,
  "lineDelimiter": "\n"
}

Coded Evaluator:

@evaluator
def line_by_line_evaluator() -> ExactMatchEvaluator:
    return ExactMatchEvaluator(
        evaluatorConfig={
            "lineByLineEvaluator": True,
            "lineDelimiter": "\n"
        }
    )

Test Results

Sample agent output showing all three evaluators:

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃  Evaluation               ┃  LineByLineExactMatch ┃  RegularExactMatch  ┃  LegacyLineByLineExactMatch ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│  Test all lines match     │                  1.0  │                1.0  │                         1.0 │
│  Test when one line       │                  0.7  │                0.0  │                         0.7 │  ← Partial credit!
│  doesn't match            │                       │                     │                             │
│  Test with single item    │                  1.0  │                1.0  │                         1.0 │
├───────────────────────────┼───────────────────────┼─────────────────────┼─────────────────────────────┤
│  Average                  │                  0.9  │                0.7  │                         0.9 │
└───────────────────────────┴───────────────────────┴─────────────────────┴─────────────────────────────┘

Testing

  • ✅ All existing tests pass (1632 tests)
  • ✅ Added 4 new line-by-line tests for new evaluators
  • ✅ Added 5 new line-by-line tests for legacy evaluators
  • ✅ Sample agent demonstrates full functionality with all 3 evaluator types
  • ✅ Type checking passes (mypy)
  • ✅ Linting passes (ruff)
  • ✅ Build succeeds

Bugs Fixed

  1. Runtime aggregation bug: Line sub-results were causing validation errors during score aggregation
  2. targetOutputKey bug: Individual lines weren't wrapped in proper structure when targetOutputKey != "*"
  3. Type checking bug: Added proper type assertions for LineByLineEvaluationDetails

🤖 Generated with Claude Code

Development Packages

uipath

[project]
dependencies = [
  # Exact version:
  "uipath==2.10.30.dev1014815583",

  # Any version from PR
  "uipath>=2.10.30.dev1014810000,<2.10.30.dev1014820000"
]

[[tool.uv.index]]
name = "testpypi"
url = "https://test.pypi.org/simple/"
publish-url = "https://test.pypi.org/legacy/"
explicit = true

[tool.uv.sources]
uipath = { index = "testpypi" }

@github-actions github-actions bot added test:uipath-langchain Triggers tests in the uipath-langchain-python repository test:uipath-llamaindex Triggers tests in the uipath-llamaindex-python repository labels Mar 23, 2026
@AAgnihotry AAgnihotry force-pushed the feat/line-by-line-evaluation branch from 88191a0 to 8750e34 Compare March 23, 2026 17:18
@AAgnihotry AAgnihotry added the build:dev Create a dev build from the pr label Mar 24, 2026
AAgnihotry and others added 2 commits March 24, 2026 11:30
This commit adds line-by-line evaluation capability to output evaluators,
allowing them to evaluate multi-line outputs on a per-line basis and provide
granular feedback with partial credit scoring.

Key changes:
- Added lineByLineEvaluator config flag to OutputEvaluatorConfig
- Added lineDelimiter config to customize split behavior (default: "\n")
- Implemented _evaluate_line_by_line() method in BaseOutputEvaluator
- Fixed runtime aggregation to handle line-by-line sub-results
- Fixed targetOutputKey wrapping for individual line evaluations
- Added sample agent demonstrating the feature (samples/line_by_line_test)

Benefits:
- Provides partial credit (e.g., 2/3 lines correct = 0.67 score)
- More granular feedback with per-line results
- Useful for evaluating structured multi-line outputs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@AAgnihotry AAgnihotry force-pushed the feat/line-by-line-evaluation branch from 5ba8d3d to 730ccc2 Compare March 24, 2026 18:32
@AAgnihotry AAgnihotry force-pushed the feat/line-by-line-evaluation branch from afee629 to a43c257 Compare March 24, 2026 18:58
AAgnihotry and others added 3 commits March 24, 2026 13:02
- Add lineByLineEvaluation and lineDelimiter fields to BaseLegacyEvaluator
- Implement _evaluate_line_by_line() method for legacy evaluators
- Add helper methods: _split_into_lines() and _get_actual_output()
- Add 5 comprehensive tests for legacy line-by-line evaluation
- Add legacy evaluator to line_by_line_test sample for validation
- Update sample README to document both new and legacy evaluators

This feature enables legacy evaluators (category/type based) to support
line-by-line evaluation with partial credit scoring, matching the
functionality already available in new evaluators (version/evaluatorTypeId based).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Import LineByLineEvaluationDetails in test file
- Add isinstance() checks to help mypy understand result.details type
- Fixes mypy union-attr errors in legacy evaluator tests

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Change pyproject.toml to use local editable install instead of TestPyPI
- Fix legacy evaluator JSON to use integer enum values (category: 0, type: 1)
- Verified: All 3 evaluators (new line-by-line, regular, legacy line-by-line) work correctly

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@AAgnihotry AAgnihotry added build:dev Create a dev build from the pr and removed build:dev Create a dev build from the pr labels Mar 24, 2026
@AAgnihotry
Copy link
Contributor Author

image image

Copy link
Contributor

@Chibionos Chibionos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to rearrange the functions they are pilled up on the core evaluator files

"id": "LegacyLineByLineExactMatch",
"category": 0,
"type": 1,
"name": "LegacyLineByLineExactMatch",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should not break medline when they turn on URT for them.
We should also make sure we track this on the sprint board.

AAgnihotry and others added 2 commits March 24, 2026 17:08
Added isinstance() checks to help mypy understand that result.details
is specifically a LineByLineEvaluationDetails object, not just the
generic str | BaseModel | None type.

This fixes mypy errors when accessing total_lines_actual and
total_lines_expected attributes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@AAgnihotry AAgnihotry force-pushed the feat/line-by-line-evaluation branch from 077915e to c2c050e Compare March 25, 2026 00:34
@AAgnihotry AAgnihotry merged commit 10bc126 into main Mar 25, 2026
127 checks passed
@AAgnihotry AAgnihotry deleted the feat/line-by-line-evaluation branch March 25, 2026 00:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build:dev Create a dev build from the pr test:uipath-langchain Triggers tests in the uipath-langchain-python repository test:uipath-llamaindex Triggers tests in the uipath-llamaindex-python repository

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants