Create and run LLM benchmarks.
Just the library:
pip install flow-benchmark-tools:1.5.0Library + Example benchmarks (see below):
pip install "flow-benchmark-tools[examples]:1.5.0"-
Create an agent by inheriting BenchmarkAgent and implementing the
run_benchmark_casemethod. -
Create a Benchmark by compiling a list of BenchmarkCases. These can be read from a JSONL file.
-
Associate agent and benchmark in a BenchmarkRun.
-
Use a BenchmarkRunner to run your BenchmarkRun.
Two end-to-end benchmark examples are provided in the examples folder: a LangChain RAG application and an OpenAI Assistant agent.
To run the LangChain RAG benchmark:
python src/examples/langchain_rag_agent.pyTo run the OpenAI Assistant benchmark:
python src/examples/openai_assistant_agent.pyThe rag benchmark cases are defined in data/rag_benchmark.jsonl.
The two examples follow the typical usage pattern of the library:
- define an agent by implementing the BenchmarkAgent interface and overriding the
run_benchmark_casemethod (you can also override thebeforeandaftermethods, if needed), - create a set of benchmark cases, typically as a JSONL file such as data/rag_benchmark.jsonl,
- use a BenchmarkRunner to run the benchmark.
An application of a criteria benchmark is also provided in examples folder: a Criteria application that assesses the quality of pre-computed LLM outputs based on the criteria defined in each benchmark case.
To run the Criteria benchmark:
python src/examples/criteria_evaluation_agent.pyThe criteria benchmark cases are defined in data/criteria_benchmark.jsonl.
This example follows a different application of the library:
- define an agent implementing the BenchmarkAgent interface. In this application, each case already has the output we want to evaluate, so we override the
run_benchmark_casemethod to simply repackage eachBenchmarkCaseasBenchmarkCaseResponse. - create a set of quality benchmark cases, typically as a JSONL file such as data/criteria_benchmark.jsonl. In this application, each case's "extra" dictionary includes a "criteria" string.
- use a custom CriteriaBenchmarkRunner which overrides the
_execute_benchmark_casemethod, to run the benchmark using an evaluator that inherits fromCriteriaEvaluator
Matt Whalley, Machine Learning Engineer
Please file a GitHub issue describing the desired behaviour and what actually happened