opensearch-project · vamsimanohar · Mar 31, 2026 · Mar 30, 2026 · Mar 30, 2026 · Mar 31, 2026
@@ -3,7 +3,13 @@ title: "Evaluation Integrations"
 description: "Bridge external evaluation frameworks like DeepEval, RAGAS, MLflow, and pytest into the observability stack"
 ---
 
-The `Experiment` class bridges any evaluation framework into the observability stack. Run your evaluations with your preferred tool, then upload the results as OTel spans so everything is queryable in one place.
+import { Aside } from '@astrojs/starlight/components';
+
+The `Benchmark` class bridges any evaluation framework into the observability stack. Run your evaluations with your preferred tool, then upload the results as OTel spans so everything is queryable in one place.
+
+<Aside type="note">
+Third-party framework APIs change frequently. The examples below show the integration pattern — adapt imports and method calls to your installed version. The SDK side (`Benchmark`, `b.log()`, `evaluate()`, `EvalScore`) is stable.
+</Aside>
 
 ---
 
@@ -12,15 +18,15 @@ The `Experiment` class bridges any evaluation framework into the observability s
 [DeepEval](https://github.com/confident-ai/deepeval) provides LLM-as-judge metrics like faithfulness, answer relevancy, and hallucination detection.
 
 ```bash
-pip install deepeval
+pip install opensearch-genai-observability-sdk-py deepeval
 ```
 
 ```python
-from opensearch_genai_observability_sdk_py import register, Experiment
+from opensearch_genai_observability_sdk_py import register, Benchmark
 from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
 from deepeval.test_case import LLMTestCase
 
-register(service_name="deepeval-experiment")
+register(service_name="deepeval-eval")
 
 # Define your test cases
 test_cases = [
@@ -42,7 +48,7 @@ test_cases = [
 relevancy = AnswerRelevancyMetric(model="gpt-4o")
 faithfulness = FaithfulnessMetric(model="gpt-4o")
 
-with Experiment("deepeval_run", metadata={"framework": "deepeval"}) as exp:
+with Benchmark("deepeval_run", metadata={"framework": "deepeval"}) as b:
     for case in test_cases:
         tc = LLMTestCase(
             input=case["input"],
@@ -52,7 +58,7 @@ with Experiment("deepeval_run", metadata={"framework": "deepeval"}) as exp:
         )
         relevancy.measure(tc)
         faithfulness.measure(tc)
-        exp.log(
+        b.log(
             input=case["input"],
             output=case["output"],
             expected=case["expected"],
@@ -64,23 +70,27 @@ with Experiment("deepeval_run", metadata={"framework": "deepeval"}) as exp:
         )
 ```
 
+<Aside type="tip">
+DeepEval also has a `deepeval.evaluate()` function that runs all metrics at once. You can use either approach — the key is extracting the numeric scores and passing them to `b.log()`.
+</Aside>
+
 ---
 
 ## RAGAS
 
 [RAGAS](https://docs.ragas.io/) evaluates RAG pipelines with metrics like context precision, faithfulness, and answer correctness.
 
 ```bash
-pip install ragas
+pip install opensearch-genai-observability-sdk-py ragas datasets
 ```
 
 ```python
-from opensearch_genai_observability_sdk_py import register, Experiment
+from opensearch_genai_observability_sdk_py import register, Benchmark
 from ragas import evaluate as ragas_evaluate
 from ragas.metrics import faithfulness, answer_correctness, context_precision
 from datasets import Dataset
 
-register(service_name="ragas-experiment")
+register(service_name="ragas-eval")
 
 # Prepare your RAG evaluation dataset
 data = {
@@ -106,75 +116,97 @@ ragas_result = ragas_evaluate(
     metrics=[faithfulness, answer_correctness, context_precision],
 )
 
-# Upload results to OpenSearch via Experiment
-with Experiment("ragas_eval", metadata={"framework": "ragas"}) as exp:
-    for i, row in enumerate(ragas_result.to_pandas().itertuples()):
-        exp.log(
+# Upload results to OpenSearch via Benchmark
+df = ragas_result.to_pandas()
+with Benchmark("ragas_eval", metadata={"framework": "ragas"}) as b:
+    for i, row in df.iterrows():
+        b.log(
             input=data["question"][i],
             output=data["answer"][i],
             expected=data["ground_truth"][i],
-            scores={
-                "faithfulness": row.faithfulness,
-                "answer_correctness": row.answer_correctness,
-                "context_precision": row.context_precision,
-            },
+            scores={col: row[col] for col in df.columns if col not in data},
             case_name=data["question"][i][:50],
         )
 ```
 
+<Aside type="note">
+RAGAS v0.2+ changed its dataset format and metric API. If you're on v0.2+, check the [RAGAS migration guide](https://docs.ragas.io/) for the updated `evaluate()` signature. The SDK-side `Benchmark.log()` call stays the same — only the RAGAS imports and `evaluate()` call change.
+</Aside>
+
 ---
 
 ## MLflow
 
 [MLflow](https://mlflow.org/) tracks ML experiments. Export MLflow evaluation results into the observability stack:
 
 ```bash
-pip install mlflow
+pip install opensearch-genai-observability-sdk-py mlflow
 ```
 
 ```python
-from opensearch_genai_observability_sdk_py import register, Experiment
+from opensearch_genai_observability_sdk_py import register, Benchmark
 import mlflow
+import pandas as pd
 
-register(service_name="mlflow-experiment")
+register(service_name="mlflow-eval")
 
-# Run MLflow evaluation
-eval_data = [
-    {"inputs": {"question": "What is OpenSearch?"}, "ground_truth": "search engine"},
-    {"inputs": {"question": "What is OTEL?"}, "ground_truth": "observability framework"},
-]
+# Prepare evaluation data as a DataFrame
+eval_df = pd.DataFrame([
+    {"inputs": "What is OpenSearch?", "ground_truth": "search engine"},
+    {"inputs": "What is OTEL?", "ground_truth": "observability framework"},
+])
 
-mlflow_result = mlflow.evaluate(
-    model="openai:/gpt-4o",
-    data=eval_data,
-    model_type="question-answering",
-)
+# Run MLflow evaluation
+with mlflow.start_run():
+    mlflow_result = mlflow.evaluate(
+        model="openai:/gpt-4o",
+        data=eval_df,
+        targets="ground_truth",
+        model_type="question-answering",
+    )
 
 # Upload to observability stack
-with Experiment("mlflow_eval", metadata={"framework": "mlflow"}) as exp:
-    for _, row in mlflow_result.tables["eval_results_table"].iterrows():
-        exp.log(
-            input=row["inputs"],
-            output=row["outputs"],
+results_df = mlflow_result.tables["eval_results_table"]
+with Benchmark("mlflow_eval", metadata={"framework": "mlflow"}) as b:
+    for _, row in results_df.iterrows():
+        b.log(
+            input=row.get("inputs", ""),
+            output=row.get("outputs", ""),
             expected=row.get("ground_truth", ""),
             scores={
-                col: row[col]
-                for col in mlflow_result.metrics
-                if col in row and row[col] is not None
+                k: v for k, v in mlflow_result.metrics.items()
+                if isinstance(v, (int, float))
             },
         )
 ```
 
+<Aside type="note">
+MLflow's `evaluate()` API varies across versions. The `model_type` parameter was deprecated in MLflow 2.12+ in favor of `evaluators`. Check [MLflow evaluate docs](https://mlflow.org/docs/latest/llms/llm-evaluate/index.html) for your version.
+</Aside>
+
 ---
 
 ## pytest
 
 Use `evaluate()` directly in your test suite for CI/CD integration:
 
+```bash
+pip install opensearch-genai-observability-sdk-py pytest
+```
+
 ```python
-from opensearch_genai_observability_sdk_py import register, evaluate, EvalScore
+# conftest.py — initialize tracing once for all tests
+import pytest
+from opensearch_genai_observability_sdk_py import register
+
+@pytest.fixture(scope="session", autouse=True)
+def _init_tracing():
+    register(service_name="pytest-eval")
+```
 
-register(service_name="pytest-eval")
+```python
+# test_agent.py
+from opensearch_genai_observability_sdk_py import evaluate, EvalScore
 
 def accuracy_scorer(input, output, expected) -> EvalScore:
     is_correct = expected.lower() in output.lower()
@@ -184,8 +216,8 @@ def accuracy_scorer(input, output, expected) -> EvalScore:
         label="pass" if is_correct else "fail",
     )
 
-def latency_scorer(input, output, expected) -> EvalScore:
-    return EvalScore(name="response_length", value=len(output))
+def response_length_scorer(input, output, expected) -> EvalScore:
+    return EvalScore(name="response_length", value=float(len(output)))
 
 def my_agent(input: str) -> str:
     # Replace with your agent logic
@@ -199,18 +231,17 @@ def test_agent_quality():
             {"input": "What is OpenSearch?", "expected": "search"},
             {"input": "What is OTEL?", "expected": "opentelemetry"},
         ],
-        scores=[accuracy_scorer, latency_scorer],
+        scores=[accuracy_scorer, response_length_scorer],
     )
     avg_accuracy = result.summary.scores["accuracy"].avg
     assert avg_accuracy >= 0.8, f"Accuracy dropped to {avg_accuracy}"
 ```
 
-Run with: `pytest test_agent.py` - results are recorded as OTel experiment spans and available in OpenSearch Dashboards.
+Run with: `pytest test_agent.py` — results are recorded as OTel benchmark spans and available in OpenSearch Dashboards.
 
 ---
 
 ## Related links
 
-- [Evaluation & Scoring](/docs/ai-observability/evaluation/) - core `score()`, `evaluate()`, `Experiment` API
+- [Evaluation & Scoring](/docs/ai-observability/evaluation/) - core `score()`, `evaluate()`, `Benchmark` API
 - [Python SDK reference](/docs/send-data/ai-agents/python/) - full SDK documentation
-- [Agent Health - Experiments](/docs/agent-health/evaluations/experiments/) - UI and CLI-based experiment workflows
@@ -11,7 +11,7 @@ The Python SDK provides three evaluation capabilities that all emit data through
 
 - **`score()`** - attach quality scores to individual traces or spans
 - **`evaluate()`** - run an agent against a dataset with automated scorer functions
-- **`Experiment`** - upload pre-computed results from any evaluation framework
+- **`Benchmark`** - upload pre-computed results from any evaluation framework
 
 All evaluation data lands in the same OpenSearch index as your traces, so you can query scores alongside agent spans.
 
@@ -76,7 +76,7 @@ def run(query: str) -> str:
 
 ## `evaluate()` - run experiments
 
-Executes a task function against each item in a dataset, runs scorer functions, and records everything as OTel experiment spans.
+Executes a task function against each item in a dataset, runs scorer functions, and records everything as OTel benchmark spans.
 
 ```python
 from opensearch_genai_observability_sdk_py import register, observe, evaluate, Op, EvalScore
@@ -113,11 +113,11 @@ print(result.summary)
 
 | Parameter | Type | Description |
 |---|---|---|
-| `name` | `str` | Experiment name. |
+| `name` | `str` | Benchmark name (`test.suite.name`). Stable across runs. |
 | `task` | `Callable` | Function that takes input and returns output. Use `@observe` for full tracing. |
 | `data` | `list[dict]` | Test cases: `"input"` (required), `"expected"`, `"case_id"`, `"case_name"` (optional). |
 | `scores` | `list[Callable]` | Scorer functions - each receives `(input, output, expected)`. |
-| `metadata` | `dict` | Attached to the root experiment span. |
+| `metadata` | `dict` | Attached to the root benchmark span. Reserved keys (`test.*`, `gen_ai.operation.name`) are filtered with a warning. |
 | `record_io` | `bool` | Record input/output/expected as span attributes. Default `False`. |
 
 ### Scorer functions
@@ -154,44 +154,48 @@ class EvalScore:
 
 ```mermaid
 flowchart TD
-    A["test_suite_run - experiment root"] --> B["test_case - case 1"]
+    A["test_suite_run - benchmark root"] --> B["test_case - case 1"]
     A --> C["test_case - case 2"]
     B --> D["invoke_agent my_agent"]
     B --> E["evaluation result events"]
     D --> F["execute_tool ..."]
 ```
 
-Agent traces from the task become children of `test_case` spans - full waterfall from experiment to individual LLM calls.
+Agent traces from the task become children of `test_case` spans - full waterfall from benchmark to individual LLM calls.
 
 ### Result types
 
 ```python
 result = evaluate(...)
-result.summary          # ExperimentSummary
+result.summary          # BenchmarkSummary
 result.summary.scores   # dict[str, ScoreSummary] - avg, min, max, count per metric
-result.cases            # list[CaseResult] - per-case input, output, scores, status
+result.cases            # list[TestCaseResult] - per-case input, output, scores, status
 ```
 
+`BenchmarkResult` contains:
+- `summary: BenchmarkSummary` — benchmark name, run ID, total cases, error count, duration, and per-metric `ScoreSummary` (avg, min, max, count)
+- `cases: list[TestCaseResult]` — per-case case_id, input, output, expected, scores dict, error, status, scorer_errors
+
 ---
 
-## `Experiment` - upload pre-computed results
+## `Benchmark` - upload pre-computed results
 
-Use `Experiment` when you already have evaluation results from another framework (RAGAS, DeepEval, pytest, custom) and want to upload them as OTel spans.
+Use `Benchmark` when you already have evaluation results from another framework (RAGAS, DeepEval, pytest, custom) and want to upload them as OTel spans.
 
 ```python
-from opensearch_genai_observability_sdk_py import register, Experiment
+from opensearch_genai_observability_sdk_py import register, Benchmark
 
 register(service_name="eval-upload")
 
-with Experiment("ragas_eval_v2", metadata={"framework": "ragas"}) as exp:
-    exp.log(
+with Benchmark("ragas_eval_v2", metadata={"framework": "ragas"}) as b:
+    b.log(
         input="What is OpenSearch?",
         output="OpenSearch is an open-source search engine.",
         expected="search and analytics engine",
         scores={"faithfulness": 0.92, "relevance": 0.88},
         case_name="opensearch_definition",
     )
-    exp.log(
+    b.log(
         input="How does RAG work?",
         output="RAG retrieves documents then generates answers.",
         scores={"faithfulness": 0.95, "relevance": 0.91},
@@ -200,6 +204,22 @@ with Experiment("ragas_eval_v2", metadata={"framework": "ragas"}) as exp:
 # summary printed on close
 ```
 
+You can also use `Benchmark` without a context manager:
+
+```python
+b = Benchmark(name="my-eval")
+b.log(input="q1", output="a1", scores={"accuracy": 1.0})
+summary = b.close()
+```
+
+### Constructor parameters
+
+| Parameter | Type | Default | Description |
+|---|---|---|---|
+| `name` | `str` | | Benchmark name (`test.suite.name`). Stable across runs. |
+| `metadata` | `dict` | `None` | Attached to the root span. Reserved keys (`test.*`, `gen_ai.operation.name`) are filtered with a warning. |
+| `record_io` | `bool` | `False` | Record input/output/expected as span attributes. |
+
 ### `log()` parameters
 
 | Parameter | Type | Description |
@@ -208,12 +228,12 @@ with Experiment("ragas_eval_v2", metadata={"framework": "ragas"}) as exp:
 | `output` | any | Agent output. |
 | `expected` | any | Ground truth. |
 | `scores` | `dict[str, float]` | Pre-computed scores. |
-| `metadata` | `dict` | Per-case metadata. |
+| `metadata` | `dict` | Per-case metadata. Reserved keys are filtered. |
 | `error` | `str` | Error message (sets status to `"fail"`). |
-| `case_id` | `str` | Explicit ID. Defaults to SHA256 of input. |
-| `case_name` | `str` | Human-readable name. |
-| `trace_id` | `str` | Creates OTel span link to an agent trace. |
-| `span_id` | `str` | Span-level linking (with `trace_id`). |
+| `case_id` | `str` | Explicit ID. Defaults to SHA256 hash of input. |
+| `case_name` | `str` | Human-readable name (`test.case.name`). |
+| `trace_id` | `str` | Creates OTel span link to an agent trace. Must be provided with `span_id`. |
+| `span_id` | `str` | Span-level linking. Must be provided with `trace_id`. |
 
 ---
 
@@ -394,4 +414,3 @@ retriever = OpenSearchTraceRetriever(
 - [Evaluation Integrations](/docs/ai-observability/evaluation-integrations/) - use DeepEval, RAGAS, MLflow, pytest with the observability stack
 - [Python SDK reference](/docs/send-data/ai-agents/python/) - `register`, `observe`, `enrich` documentation
 - [Agent Tracing UI](/docs/ai-observability/agent-tracing/) - explore traces in OpenSearch Dashboards
-- [Agent Health - Experiments](/docs/agent-health/evaluations/experiments/) - UI and CLI-based experiment workflows
@@ -69,7 +69,7 @@ flowchart LR
 | **Auto-instrument LLMs** | `register(auto_instrument=True)` | OpenAI, Anthropic, Bedrock, LangChain, and 20+ libraries traced automatically |
 | **Score traces** | `score()` | Attaches evaluation scores to traces through the OTLP pipeline |
 | **Run experiments** | `evaluate()` | Runs a task against a dataset with scorer functions, records everything as OTel spans |
-| **Upload results** | `Experiment` | Uploads pre-computed eval results from RAGAS, DeepEval, pytest, or custom frameworks |
+| **Upload results** | `Benchmark` | Uploads pre-computed eval results from RAGAS, DeepEval, pytest, or custom frameworks |
 | **Query traces** | `OpenSearchTraceRetriever` | Retrieves stored traces from OpenSearch for evaluation pipelines |
 | **AWS production** | `AWSSigV4OTLPExporter` | SigV4-signed exports to OpenSearch Ingestion or OpenSearch Service |
 
@@ -196,6 +196,6 @@ flowchart LR
 ## What's next
 
 - [Python SDK reference](/docs/send-data/ai-agents/python/) - full API documentation for `register`, `observe`, `enrich`, and AWS auth
-- [Evaluation & Scoring](/docs/ai-observability/evaluation/) - `score()`, `evaluate()`, `Experiment`, and `OpenSearchTraceRetriever` in depth
+- [Evaluation & Scoring](/docs/ai-observability/evaluation/) - `score()`, `evaluate()`, `Benchmark`, and `OpenSearchTraceRetriever` in depth
 - [Agent Tracing UI](/docs/ai-observability/agent-tracing/) - explore traces, graphs, and timelines in OpenSearch Dashboards
 - [Agent Health](/docs/agent-health/) - evaluate agents with Golden Path comparison, LLM judges, and batch experiments