Summary
Expand the evaluation harness so compiler changes for multi-document output and metadata enrichment are measured before they become the default path.
Why this work is needed
The repo already has an evaluation harness, but it does not yet capture the new compiler behaviors planned under this epic.
Scope
- Add evaluation cases for multi-document outputs.
- Add evaluation cases for topic and metadata enrichment.
- Keep the harness useful for comparing compiler changes.
- Preserve compatibility with existing evaluation workflows.
Out of scope
- Hybrid retrieval evaluation.
- Frontend UX testing.
Acceptance criteria
- The evaluation harness covers representative multi-document cases.
- Metadata-enrichment behavior is represented in the dataset or checks.
- Compiler changes in this stream can be regression-tested.
Dependencies
Summary
Expand the evaluation harness so compiler changes for multi-document output and metadata enrichment are measured before they become the default path.
Why this work is needed
The repo already has an evaluation harness, but it does not yet capture the new compiler behaviors planned under this epic.
Scope
Out of scope
Acceptance criteria
Dependencies