Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,13 @@ title: "Evaluation Integrations"
description: "Bridge external evaluation frameworks like DeepEval, RAGAS, MLflow, and pytest into the observability stack"
---

The `Experiment` class bridges any evaluation framework into the observability stack. Run your evaluations with your preferred tool, then upload the results as OTel spans so everything is queryable in one place.
import { Aside } from '@astrojs/starlight/components';

The `Benchmark` class bridges any evaluation framework into the observability stack. Run your evaluations with your preferred tool, then upload the results as OTel spans so everything is queryable in one place.

<Aside type="note">
Third-party framework APIs change frequently. The examples below show the integration pattern — adapt imports and method calls to your installed version. The SDK side (`Benchmark`, `b.log()`, `evaluate()`, `EvalScore`) is stable.
</Aside>

---

Expand All @@ -12,15 +18,15 @@ The `Experiment` class bridges any evaluation framework into the observability s
[DeepEval](https://github.com/confident-ai/deepeval) provides LLM-as-judge metrics like faithfulness, answer relevancy, and hallucination detection.

```bash
pip install deepeval
pip install opensearch-genai-observability-sdk-py deepeval
```

```python
from opensearch_genai_observability_sdk_py import register, Experiment
from opensearch_genai_observability_sdk_py import register, Benchmark
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval.test_case import LLMTestCase

register(service_name="deepeval-experiment")
register(service_name="deepeval-eval")

# Define your test cases
test_cases = [
Expand All @@ -42,7 +48,7 @@ test_cases = [
relevancy = AnswerRelevancyMetric(model="gpt-4o")
faithfulness = FaithfulnessMetric(model="gpt-4o")

with Experiment("deepeval_run", metadata={"framework": "deepeval"}) as exp:
with Benchmark("deepeval_run", metadata={"framework": "deepeval"}) as b:
for case in test_cases:
tc = LLMTestCase(
input=case["input"],
Expand All @@ -52,7 +58,7 @@ with Experiment("deepeval_run", metadata={"framework": "deepeval"}) as exp:
)
relevancy.measure(tc)
faithfulness.measure(tc)
exp.log(
b.log(
input=case["input"],
output=case["output"],
expected=case["expected"],
Expand All @@ -64,23 +70,27 @@ with Experiment("deepeval_run", metadata={"framework": "deepeval"}) as exp:
)
```

<Aside type="tip">
DeepEval also has a `deepeval.evaluate()` function that runs all metrics at once. You can use either approach — the key is extracting the numeric scores and passing them to `b.log()`.
</Aside>

---

## RAGAS

[RAGAS](https://docs.ragas.io/) evaluates RAG pipelines with metrics like context precision, faithfulness, and answer correctness.

```bash
pip install ragas
pip install opensearch-genai-observability-sdk-py ragas datasets
```

```python
from opensearch_genai_observability_sdk_py import register, Experiment
from opensearch_genai_observability_sdk_py import register, Benchmark
from ragas import evaluate as ragas_evaluate
from ragas.metrics import faithfulness, answer_correctness, context_precision
from datasets import Dataset

register(service_name="ragas-experiment")
register(service_name="ragas-eval")

# Prepare your RAG evaluation dataset
data = {
Expand All @@ -106,75 +116,97 @@ ragas_result = ragas_evaluate(
metrics=[faithfulness, answer_correctness, context_precision],
)

# Upload results to OpenSearch via Experiment
with Experiment("ragas_eval", metadata={"framework": "ragas"}) as exp:
for i, row in enumerate(ragas_result.to_pandas().itertuples()):
exp.log(
# Upload results to OpenSearch via Benchmark
df = ragas_result.to_pandas()
with Benchmark("ragas_eval", metadata={"framework": "ragas"}) as b:
for i, row in df.iterrows():
b.log(
input=data["question"][i],
output=data["answer"][i],
expected=data["ground_truth"][i],
scores={
"faithfulness": row.faithfulness,
"answer_correctness": row.answer_correctness,
"context_precision": row.context_precision,
},
scores={col: row[col] for col in df.columns if col not in data},
case_name=data["question"][i][:50],
)
```

<Aside type="note">
RAGAS v0.2+ changed its dataset format and metric API. If you're on v0.2+, check the [RAGAS migration guide](https://docs.ragas.io/) for the updated `evaluate()` signature. The SDK-side `Benchmark.log()` call stays the same — only the RAGAS imports and `evaluate()` call change.
</Aside>

---

## MLflow

[MLflow](https://mlflow.org/) tracks ML experiments. Export MLflow evaluation results into the observability stack:

```bash
pip install mlflow
pip install opensearch-genai-observability-sdk-py mlflow
```

```python
from opensearch_genai_observability_sdk_py import register, Experiment
from opensearch_genai_observability_sdk_py import register, Benchmark
import mlflow
import pandas as pd

register(service_name="mlflow-experiment")
register(service_name="mlflow-eval")

# Run MLflow evaluation
eval_data = [
{"inputs": {"question": "What is OpenSearch?"}, "ground_truth": "search engine"},
{"inputs": {"question": "What is OTEL?"}, "ground_truth": "observability framework"},
]
# Prepare evaluation data as a DataFrame
eval_df = pd.DataFrame([
{"inputs": "What is OpenSearch?", "ground_truth": "search engine"},
{"inputs": "What is OTEL?", "ground_truth": "observability framework"},
])

mlflow_result = mlflow.evaluate(
model="openai:/gpt-4o",
data=eval_data,
model_type="question-answering",
)
# Run MLflow evaluation
with mlflow.start_run():
mlflow_result = mlflow.evaluate(
model="openai:/gpt-4o",
data=eval_df,
targets="ground_truth",
model_type="question-answering",
)

# Upload to observability stack
with Experiment("mlflow_eval", metadata={"framework": "mlflow"}) as exp:
for _, row in mlflow_result.tables["eval_results_table"].iterrows():
exp.log(
input=row["inputs"],
output=row["outputs"],
results_df = mlflow_result.tables["eval_results_table"]
with Benchmark("mlflow_eval", metadata={"framework": "mlflow"}) as b:
for _, row in results_df.iterrows():
b.log(
input=row.get("inputs", ""),
output=row.get("outputs", ""),
expected=row.get("ground_truth", ""),
scores={
col: row[col]
for col in mlflow_result.metrics
if col in row and row[col] is not None
k: v for k, v in mlflow_result.metrics.items()
if isinstance(v, (int, float))
},
)
```

<Aside type="note">
MLflow's `evaluate()` API varies across versions. The `model_type` parameter was deprecated in MLflow 2.12+ in favor of `evaluators`. Check [MLflow evaluate docs](https://mlflow.org/docs/latest/llms/llm-evaluate/index.html) for your version.
</Aside>

---

## pytest

Use `evaluate()` directly in your test suite for CI/CD integration:

```bash
pip install opensearch-genai-observability-sdk-py pytest
```

```python
from opensearch_genai_observability_sdk_py import register, evaluate, EvalScore
# conftest.py — initialize tracing once for all tests
import pytest
from opensearch_genai_observability_sdk_py import register

@pytest.fixture(scope="session", autouse=True)
def _init_tracing():
register(service_name="pytest-eval")
```

register(service_name="pytest-eval")
```python
# test_agent.py
from opensearch_genai_observability_sdk_py import evaluate, EvalScore

def accuracy_scorer(input, output, expected) -> EvalScore:
is_correct = expected.lower() in output.lower()
Expand All @@ -184,8 +216,8 @@ def accuracy_scorer(input, output, expected) -> EvalScore:
label="pass" if is_correct else "fail",
)

def latency_scorer(input, output, expected) -> EvalScore:
return EvalScore(name="response_length", value=len(output))
def response_length_scorer(input, output, expected) -> EvalScore:
return EvalScore(name="response_length", value=float(len(output)))

def my_agent(input: str) -> str:
# Replace with your agent logic
Expand All @@ -199,18 +231,17 @@ def test_agent_quality():
{"input": "What is OpenSearch?", "expected": "search"},
{"input": "What is OTEL?", "expected": "opentelemetry"},
],
scores=[accuracy_scorer, latency_scorer],
scores=[accuracy_scorer, response_length_scorer],
)
avg_accuracy = result.summary.scores["accuracy"].avg
assert avg_accuracy >= 0.8, f"Accuracy dropped to {avg_accuracy}"
```

Run with: `pytest test_agent.py` - results are recorded as OTel experiment spans and available in OpenSearch Dashboards.
Run with: `pytest test_agent.py` results are recorded as OTel benchmark spans and available in OpenSearch Dashboards.

---

## Related links

- [Evaluation & Scoring](/docs/ai-observability/evaluation/) - core `score()`, `evaluate()`, `Experiment` API
- [Evaluation & Scoring](/docs/ai-observability/evaluation/) - core `score()`, `evaluate()`, `Benchmark` API
- [Python SDK reference](/docs/send-data/ai-agents/python/) - full SDK documentation
- [Agent Health - Experiments](/docs/agent-health/evaluations/experiments/) - UI and CLI-based experiment workflows
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ The Python SDK provides three evaluation capabilities that all emit data through

- **`score()`** - attach quality scores to individual traces or spans
- **`evaluate()`** - run an agent against a dataset with automated scorer functions
- **`Experiment`** - upload pre-computed results from any evaluation framework
- **`Benchmark`** - upload pre-computed results from any evaluation framework

All evaluation data lands in the same OpenSearch index as your traces, so you can query scores alongside agent spans.

Expand Down Expand Up @@ -76,7 +76,7 @@ def run(query: str) -> str:

## `evaluate()` - run experiments

Executes a task function against each item in a dataset, runs scorer functions, and records everything as OTel experiment spans.
Executes a task function against each item in a dataset, runs scorer functions, and records everything as OTel benchmark spans.

```python
from opensearch_genai_observability_sdk_py import register, observe, evaluate, Op, EvalScore
Expand Down Expand Up @@ -113,11 +113,11 @@ print(result.summary)

| Parameter | Type | Description |
|---|---|---|
| `name` | `str` | Experiment name. |
| `name` | `str` | Benchmark name (`test.suite.name`). Stable across runs. |
| `task` | `Callable` | Function that takes input and returns output. Use `@observe` for full tracing. |
| `data` | `list[dict]` | Test cases: `"input"` (required), `"expected"`, `"case_id"`, `"case_name"` (optional). |
| `scores` | `list[Callable]` | Scorer functions - each receives `(input, output, expected)`. |
| `metadata` | `dict` | Attached to the root experiment span. |
| `metadata` | `dict` | Attached to the root benchmark span. Reserved keys (`test.*`, `gen_ai.operation.name`) are filtered with a warning. |
| `record_io` | `bool` | Record input/output/expected as span attributes. Default `False`. |

### Scorer functions
Expand Down Expand Up @@ -154,44 +154,48 @@ class EvalScore:

```mermaid
flowchart TD
A["test_suite_run - experiment root"] --> B["test_case - case 1"]
A["test_suite_run - benchmark root"] --> B["test_case - case 1"]
A --> C["test_case - case 2"]
B --> D["invoke_agent my_agent"]
B --> E["evaluation result events"]
D --> F["execute_tool ..."]
```

Agent traces from the task become children of `test_case` spans - full waterfall from experiment to individual LLM calls.
Agent traces from the task become children of `test_case` spans - full waterfall from benchmark to individual LLM calls.

### Result types

```python
result = evaluate(...)
result.summary # ExperimentSummary
result.summary # BenchmarkSummary
result.summary.scores # dict[str, ScoreSummary] - avg, min, max, count per metric
result.cases # list[CaseResult] - per-case input, output, scores, status
result.cases # list[TestCaseResult] - per-case input, output, scores, status
```

`BenchmarkResult` contains:
- `summary: BenchmarkSummary` — benchmark name, run ID, total cases, error count, duration, and per-metric `ScoreSummary` (avg, min, max, count)
- `cases: list[TestCaseResult]` — per-case case_id, input, output, expected, scores dict, error, status, scorer_errors

---

## `Experiment` - upload pre-computed results
## `Benchmark` - upload pre-computed results

Use `Experiment` when you already have evaluation results from another framework (RAGAS, DeepEval, pytest, custom) and want to upload them as OTel spans.
Use `Benchmark` when you already have evaluation results from another framework (RAGAS, DeepEval, pytest, custom) and want to upload them as OTel spans.

```python
from opensearch_genai_observability_sdk_py import register, Experiment
from opensearch_genai_observability_sdk_py import register, Benchmark

register(service_name="eval-upload")

with Experiment("ragas_eval_v2", metadata={"framework": "ragas"}) as exp:
exp.log(
with Benchmark("ragas_eval_v2", metadata={"framework": "ragas"}) as b:
b.log(
input="What is OpenSearch?",
output="OpenSearch is an open-source search engine.",
expected="search and analytics engine",
scores={"faithfulness": 0.92, "relevance": 0.88},
case_name="opensearch_definition",
)
exp.log(
b.log(
input="How does RAG work?",
output="RAG retrieves documents then generates answers.",
scores={"faithfulness": 0.95, "relevance": 0.91},
Expand All @@ -200,6 +204,22 @@ with Experiment("ragas_eval_v2", metadata={"framework": "ragas"}) as exp:
# summary printed on close
```

You can also use `Benchmark` without a context manager:

```python
b = Benchmark(name="my-eval")
b.log(input="q1", output="a1", scores={"accuracy": 1.0})
summary = b.close()
```

### Constructor parameters

| Parameter | Type | Default | Description |
|---|---|---|---|
| `name` | `str` | | Benchmark name (`test.suite.name`). Stable across runs. |
| `metadata` | `dict` | `None` | Attached to the root span. Reserved keys (`test.*`, `gen_ai.operation.name`) are filtered with a warning. |
| `record_io` | `bool` | `False` | Record input/output/expected as span attributes. |

### `log()` parameters

| Parameter | Type | Description |
Expand All @@ -208,12 +228,12 @@ with Experiment("ragas_eval_v2", metadata={"framework": "ragas"}) as exp:
| `output` | any | Agent output. |
| `expected` | any | Ground truth. |
| `scores` | `dict[str, float]` | Pre-computed scores. |
| `metadata` | `dict` | Per-case metadata. |
| `metadata` | `dict` | Per-case metadata. Reserved keys are filtered. |
| `error` | `str` | Error message (sets status to `"fail"`). |
| `case_id` | `str` | Explicit ID. Defaults to SHA256 of input. |
| `case_name` | `str` | Human-readable name. |
| `trace_id` | `str` | Creates OTel span link to an agent trace. |
| `span_id` | `str` | Span-level linking (with `trace_id`). |
| `case_id` | `str` | Explicit ID. Defaults to SHA256 hash of input. |
| `case_name` | `str` | Human-readable name (`test.case.name`). |
| `trace_id` | `str` | Creates OTel span link to an agent trace. Must be provided with `span_id`. |
| `span_id` | `str` | Span-level linking. Must be provided with `trace_id`. |

---

Expand Down Expand Up @@ -394,4 +414,3 @@ retriever = OpenSearchTraceRetriever(
- [Evaluation Integrations](/docs/ai-observability/evaluation-integrations/) - use DeepEval, RAGAS, MLflow, pytest with the observability stack
- [Python SDK reference](/docs/send-data/ai-agents/python/) - `register`, `observe`, `enrich` documentation
- [Agent Tracing UI](/docs/ai-observability/agent-tracing/) - explore traces in OpenSearch Dashboards
- [Agent Health - Experiments](/docs/agent-health/evaluations/experiments/) - UI and CLI-based experiment workflows
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ flowchart LR
| **Auto-instrument LLMs** | `register(auto_instrument=True)` | OpenAI, Anthropic, Bedrock, LangChain, and 20+ libraries traced automatically |
| **Score traces** | `score()` | Attaches evaluation scores to traces through the OTLP pipeline |
| **Run experiments** | `evaluate()` | Runs a task against a dataset with scorer functions, records everything as OTel spans |
| **Upload results** | `Experiment` | Uploads pre-computed eval results from RAGAS, DeepEval, pytest, or custom frameworks |
| **Upload results** | `Benchmark` | Uploads pre-computed eval results from RAGAS, DeepEval, pytest, or custom frameworks |
| **Query traces** | `OpenSearchTraceRetriever` | Retrieves stored traces from OpenSearch for evaluation pipelines |
| **AWS production** | `AWSSigV4OTLPExporter` | SigV4-signed exports to OpenSearch Ingestion or OpenSearch Service |

Expand Down Expand Up @@ -196,6 +196,6 @@ flowchart LR
## What's next

- [Python SDK reference](/docs/send-data/ai-agents/python/) - full API documentation for `register`, `observe`, `enrich`, and AWS auth
- [Evaluation & Scoring](/docs/ai-observability/evaluation/) - `score()`, `evaluate()`, `Experiment`, and `OpenSearchTraceRetriever` in depth
- [Evaluation & Scoring](/docs/ai-observability/evaluation/) - `score()`, `evaluate()`, `Benchmark`, and `OpenSearchTraceRetriever` in depth
- [Agent Tracing UI](/docs/ai-observability/agent-tracing/) - explore traces, graphs, and timelines in OpenSearch Dashboards
- [Agent Health](/docs/agent-health/) - evaluate agents with Golden Path comparison, LLM judges, and batch experiments
Loading
Loading