Benchmark Support for A/B Tests by hildebrandmw · Pull Request #900 · microsoft/DiskANN

hildebrandmw · 2026-04-03T23:25:53Z

This adds support for native A/B testing in diskann-benchmark-runner with an example implementation added to diskann-benchmark-simd. The benefits of having this infrastructure at the Rust level are:

Code actually performing the before and after checks can run on the concrete types used in the Rust code. They do not need to dig through layers of JSON (and the corresponding maintenance headache associated with changing data structures). The compiler checks that at least structurally the logic is sound.
Shared infrastructure for running comparison tests locally during development and for CI.
Stricter checks on correctness and wellformedness of data that all users can take advantage of.

Concept

The idea is to use one input.json file to generate two output JSON files before.json and after.json for different builds or configurations of the library. The input.json is then accompanied by a tolerances.json, which contains runtime thresholds for values of interest to help accommodate runtime variability. A regression check takes all such files and performs the following steps:

Correctly parse tolerances.json and input.json.
Unambiguously match entries in tolerances.json to entries in input.json. I went with matching semantics rather than requiring a one-to-one correspondence to make it easier to have a single tolerance entry work as a blanket entry for multiple benchmark runs.
Take the resolve tolerance/input pairs and match them with a regression-checkable benchmark.
Once everything has been correctly set up, run each regression check which
- Deserializes entries in before/after.json into the benchmark's output type or error gracefully if this cannot be done due to an incorrect environment.
- Runs the before/after check.
- Aggregates and displays errors/failures/successes with an exit code that indicates success or failure.
An invocation of the checking functionality only returns a successful exit code if all of the above steps succeed without error and all the run checks pass.

The matching semantics work like this. Each tolerance entry looks like the following:

{
  "input": {
    "type": "type-tag",
    "content": {},
  },
  "tolerance": {
     "type": "tolerance-type-tag",
     "content": "defined-per-tolerance",
  }
}

The content field of "input" need not be deserializable to its corresponding value. Instead, we use the raw JSON of the input and match it as a "subset" against the raw JSON of the input.json using the following rules:

Nulls are always equal.
Primitive types are equal if their values compare equal.
An array x is a subset of an array y if x.len() <= y.len() and each entry i in x is a subset of its corresponding entry in y (i.e., we match prefixes).
An object x is a subset of an object y if each key in x is a key in y and each value associated with a key in x is a subset of the value of the same entry in y.
Heterogeneous matches are not supported. A bool cannot be a subset of an integer. This breaks the match instantly.

This means that an empty "content" field will match any struct and thus be a blanket implementation for the input with the same type tag. Or, the "content" field can be refined to be more specific as needed.

Since a single tolerance entry may match multiple inputs, we ensure that the matching is unambiguous using the following rules:

Each tolerance entry must be matched with at least one entry from "inputs.json".
Every entry in "inputs.json" must have exactly one match from "tolerances.json".

If there is any ambiguity, the app stops with an error.

CLI Changes

This adds a new check subcommand to the runner CLI with the following options:

check skeleton: Print a skeleton tolerance JSON file.
check tolerances [NAME]: List tolerance kinds, or describe one by name. This is similar to inputs [NAME]
check verify --tolerances <FILE> --input-file <FILE>: Validate a tolerance file against an input file. This runs up to step 3 of the checklist above and serves as a pre-flight check before any CI jobs are run. Errors with the setup that can be caught early will be and can thus save CI time.
check run --tolerances <FILE> --input-file <FILE> --before <FILE> --after <FILE> [--output-file <FILE>]: Run regression checks.

Benchmark Registration Changes

Regression checks are opt-in. Benchmarks that wish to opt-in implement benchmark::Regression and the singular check method. All logic for the before and after comparison lives in the check method. Such benchmarks also need to use registry::Benchmarks::register_regression to be correctly tracked as regression compatible. No independent registration of the Tolerance associated type is needed.

That is it.

Note that check should not print anything out to stdout and instead communicate success/failure solely through its return type to avoid spamming the output.

Example

You can see this in action in diskann-benchmark-simd. Run

# Generate the "before" file
cargo run --package diskann-benchmark-simd --release -- \
   run \
   --input-file ./diskann-benchmark-simd/examples/simd-scalar.json \
   --output-file before.json

# Generate the "after" file
cargo run --package diskann-benchmark-simd --release -- \
   run \
   --input-file ./diskann-benchmark-simd/examples/simd-scalar.json \
   --output-file after.json

# Run the before/after check
cargo run --package diskann-benchmark-simd --release -- \
  check run \
  --tolerances ./diskann-benchmark-simd/examples/tolerance.json \
  --input-file ./diskann-benchmark-simd/examples/simd-scalar.json \
  --before before.json \
  --after after.json

Depending on the noise in your system, you will see something like the following (note that the ci profile should be used for more reliable measurements):

Check 1 of 4 ("simd-op-f32xf32-scalar") FAILED:
     Distance,   Dim,   Min Before (ns),   Min After (ns),   Change (%),                                      Remark
====================================================================================================================
   squared_l2,   100,             0.000,            0.000,      invalid,   expected "before" to be greater than zero
   squared_l2,   128,             0.000,            0.000,      invalid,   expected "before" to be greater than zero
   squared_l2,   160,             0.000,            0.000,      invalid,   expected "before" to be greater than zero
   squared_l2,   384,            40.000,           40.000,      0.000 %,
   squared_l2,   768,            80.000,           80.000,      0.000 %,
inner_product,   100,             0.000,            0.000,      invalid,   expected "before" to be greater than zero
inner_product,   128,             0.000,            0.000,      invalid,   expected "before" to be greater than zero
inner_product,   160,             0.000,            0.000,      invalid,   expected "before" to be greater than zero
inner_product,   384,            40.000,            0.000,   -100.000 %,
inner_product,   768,           120.000,           40.000,    -66.667 %,
       cosine,   100,           120.000,           40.000,    -66.667 %,
       cosine,   128,            80.000,            0.000,   -100.000 %,
       cosine,   160,            40.000,           40.000,      0.000 %,
       cosine,   384,           120.000,           80.000,    -33.333 %,
       cosine,   768,           280.000,          160.000,    -42.857 %,

Where this particular implementation has decided that a regression cannot be meaningfully detected if an execution time got rounded to zero.

Suggested Reviewing Order

diskann-benchmark-runner

benchmark.rs: This contains the new Regression trait and the internal type-erasure machinery that is used inside the registry to mark a benchmark as regression-capable.
registry.rs: The new register_regression API as well as methods for retrieving and interrogating regression capable benchmarks.
internal/regression.rs: This is the meat and potatoes of this PR. This contains all the logic to perform the outlined steps and has a pretty good module-level documentation that outlines the approach taken.
app.rs: The new Check subcommand and routing, this is pretty straight-forward save for the changes to tests (see below).
utils/num.rs: I added two new opinionated utilities to help with writing checks:
- relative_change: Compute the relative change between two values, handling corner cases.
- NonNegativeFinite: Useful to serde-compatible assertions that tolerance entries have the obvious properties the name implies.
jobs.rs: The changes here are to mainly make it easier to interact with the JSON patterns used by the input files to have more uniform handling of said files.

Changes to the testing infrastructure:

The simple test benchmarks/inputs have been factored into their own submodules and given regression checks.
The UX tests in app.rs have been upgraded to handle tolerance checks as well. This involves having more magically escaped patterns for input/output files and supporting multi-line stdin.txt files to allow regression tests to set up their respective environment.

Much of the logic in internal/regression.rs is tested via the UX tests because (1) setting up a proper environment for this functionality is challenging and (2) the UX tests provide much better visual information on the expected behavior. A lot of new UX tests have been added.

diskann-benchmark-simd:
The changes here are largely meant to work as an example of adding regression tests. Since diskann-benchmark-simd is not a production critical crate, feel free to just skim this or ignore entirely.

Future Ideas

There are parts of this that are not perfect:

Error diagnostics when matching fails could be better.
This doesn't directly support situations where you change runtime values in the input file to generate the before and after comparison. It will work if the changes are sufficiently minimal to not influence benchmark matching. I decided not to include in this PR to keep the changes in API small (the internal changes in this PR are already huge).
This doesn't natively support long-term regression monitoring where we have lots of before files and build a statistical picture of trends. That opens a whole can of worms like schema stability, machine stability (if comparing run times) etc. that are way beyond the scope of what's needed to get basic regression tests running.
Adding a --verbose flag to print out all diagnostics rather than using the opinionated triage of errors then failures then successes.

Disclaimer: This PR (though not the PR description) was written with the help of AI but has been heavily edited and is something that I wouldn't be annoyed at reviewing outside of it being +3500 lines (I am sorry).

Copilot

Pull request overview

This PR adds first-class A/B regression checking to diskann-benchmark-runner by introducing a regression-capable benchmark API (tolerance inputs + before/after comparison) and a new check CLI surface, with an example adoption in diskann-benchmark-simd.

Changes:

Introduce benchmark::Regression + registry plumbing to register and discover regression-capable benchmarks and their tolerance input types.
Add tolerance-file parsing, subset-based matching, and check verify / check run execution + reporting pipeline.
Extend the UX test harness and add many new golden tests covering success/failure/error paths; add SIMD example tolerances + regression implementation.

Reviewed changes

Copilot reviewed 103 out of 143 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
diskann-benchmark-simd/src/lib.rs	Adds SIMD regression tolerance type and `Regression` implementation for kernels; updates run result schema.
diskann-benchmark-simd/src/bin.rs	Adds integration coverage for `check verify` on the SIMD example.
diskann-benchmark-simd/examples/tolerance.json	Example tolerance file for SIMD regression checks.
diskann-benchmark-runner/src/lib.rs	Exposes `benchmark` publicly and adds `internal` module for shared helpers.
diskann-benchmark-runner/src/benchmark.rs	Adds `Regression` trait, `PassFail`, and type-erased regression support in the registry layer.
diskann-benchmark-runner/src/registry.rs	Adds `register_regression` and tolerance discovery/mapping to regression-capable benchmarks.
diskann-benchmark-runner/src/internal/mod.rs	Adds shared `load_from_disk` helper and internal module organization.
diskann-benchmark-runner/src/internal/regression.rs	Implements tolerance parsing, subset matching, check job creation, execution, JSON output, and reporting.
diskann-benchmark-runner/src/app.rs	Adds `check` subcommand (`skeleton`, `tolerances`, `verify`, `run`) and upgrades UX test harness for multi-step scenarios.
diskann-benchmark-runner/src/jobs.rs	Refactors input loading/parsing and exposes raw job list/partial parsing for regression pipeline.
diskann-benchmark-runner/src/result.rs	Adds `RawResult` loader for reading previously saved benchmark outputs.
diskann-benchmark-runner/src/checker.rs	Tightens `Checker::any` tagging expectation (with clippy annotation).
diskann-benchmark-runner/src/input.rs	Adds `Wrapper::<T>::INSTANCE` const for tolerance type-erasure usage.
diskann-benchmark-runner/src/ux.rs	Adds `scrub_path` helper for deterministic UX test output.
diskann-benchmark-runner/src/utils/mod.rs	Exports new `num` utilities module.
diskann-benchmark-runner/src/utils/num.rs	Adds `relative_change` and `NonNegativeFinite` for regression/tolerance validation and comparisons.
diskann-benchmark-runner/src/utils/percentiles.rs	Adds `minimum` percentile value to output structure (and marks struct `#[non_exhaustive]`).
diskann-benchmark-runner/src/utils/fmt.rs	Adds clippy expectation annotation for panic-based bounds checks.
diskann-benchmark-runner/src/test/mod.rs	Reorganizes test benchmark registration and marks regression-capable test benchmarks.
diskann-benchmark-runner/src/test/dim.rs	Adds regression checks to dim benchmarks and introduces a non-regression “simple” benchmark.
diskann-benchmark-runner/src/test/typed.rs	Adds regression checks to typed benchmarks and introduces tolerance input used by typed regression tests.
diskann-benchmark-runner/Cargo.toml	Switches to explicit clippy lint configuration.
diskann-benchmark-runner/.clippy.toml	Allows unwrap/expect/panic in tests under clippy.
diskann-benchmark-runner/tests/regression/check-skeleton-0/stdin.txt	Adds UX test for `check skeleton`.
diskann-benchmark-runner/tests/regression/check-skeleton-0/stdout.txt	Golden output for `check skeleton`.
diskann-benchmark-runner/tests/regression/check-skeleton-0/README.md	Documents `check skeleton` UX test scenario.
diskann-benchmark-runner/tests/regression/check-tolerances-0/stdin.txt	Adds UX test for listing tolerance kinds.
diskann-benchmark-runner/tests/regression/check-tolerances-0/stdout.txt	Golden output for listing tolerance kinds.
diskann-benchmark-runner/tests/regression/check-tolerances-0/README.md	Documents tolerance listing UX test.
diskann-benchmark-runner/tests/regression/check-tolerances-1/stdin.txt	Adds UX test for describing a specific tolerance kind.
diskann-benchmark-runner/tests/regression/check-tolerances-1/stdout.txt	Golden output for tolerance kind description/skeleton.
diskann-benchmark-runner/tests/regression/check-tolerances-1/README.md	Documents tolerance kind description UX test.
diskann-benchmark-runner/tests/regression/check-tolerances-2/stdin.txt	Adds UX test for requesting an unknown tolerance kind.
diskann-benchmark-runner/tests/regression/check-tolerances-2/stdout.txt	Golden output for unknown tolerance kind.
diskann-benchmark-runner/tests/regression/check-tolerances-2/README.md	Documents unknown tolerance kind behavior.
diskann-benchmark-runner/tests/regression/check-verify-0/stdin.txt	Adds UX test for successful `check verify`.
diskann-benchmark-runner/tests/regression/check-verify-0/tolerances.json	Test tolerance file for successful verification.
diskann-benchmark-runner/tests/regression/check-verify-0/input.json	Test input file used for successful verification.
diskann-benchmark-runner/tests/regression/check-verify-0/README.md	Documents successful verification behavior (no stdout).
diskann-benchmark-runner/tests/regression/check-verify-1/stdin.txt	Adds UX test for unknown tolerance tag error.
diskann-benchmark-runner/tests/regression/check-verify-1/stdout.txt	Golden output for unknown tolerance tag error.
diskann-benchmark-runner/tests/regression/check-verify-1/tolerances.json	Test tolerance file with unknown tolerance tag.
diskann-benchmark-runner/tests/regression/check-verify-1/input.json	Input file used by unknown tolerance tag test.
diskann-benchmark-runner/tests/regression/check-verify-1/README.md	Documents unknown tolerance tag error scenario.
diskann-benchmark-runner/tests/regression/check-verify-2/stdin.txt	Adds UX test for tolerance/input match but no regression benchmark dispatch.
diskann-benchmark-runner/tests/regression/check-verify-2/stdout.txt	Golden output for “no matching regression benchmark” in verify.
diskann-benchmark-runner/tests/regression/check-verify-2/tolerances.json	Tolerance file used by the dispatch-failure verify test.
diskann-benchmark-runner/tests/regression/check-verify-2/input.json	Input file used by the dispatch-failure verify test.
diskann-benchmark-runner/tests/regression/check-verify-2/README.md	Documents dispatch-failure verify scenario.
diskann-benchmark-runner/tests/regression/check-verify-3/stdin.txt	Adds UX test covering ambiguous/orphaned/uncovered tolerance matching problems.
diskann-benchmark-runner/tests/regression/check-verify-3/stdout.txt	Golden output for matching failure diagnostics.
diskann-benchmark-runner/tests/regression/check-verify-3/tolerances.json	Tolerance file constructed to trigger matching errors.
diskann-benchmark-runner/tests/regression/check-verify-3/input.json	Input file constructed to trigger matching errors.
diskann-benchmark-runner/tests/regression/check-verify-4/stdin.txt	Adds UX test for incompatible tolerance/input tag pairing.
diskann-benchmark-runner/tests/regression/check-verify-4/stdout.txt	Golden output for incompatible tag pairing error.
diskann-benchmark-runner/tests/regression/check-verify-4/tolerances.json	Tolerance file with mismatched input tag type.
diskann-benchmark-runner/tests/regression/check-verify-4/input.json	Input file used by incompatible tag pairing test.
diskann-benchmark-runner/tests/regression/check-verify-4/README.md	Documents incompatible tag pairing test (currently needs alignment with actual scenario).
diskann-benchmark-runner/tests/regression/check-run-pass-0/stdin.txt	Adds UX test for successful `check run` and checks.json generation.
diskann-benchmark-runner/tests/regression/check-run-pass-0/stdout.txt	Golden output for successful regression checks.
diskann-benchmark-runner/tests/regression/check-run-pass-0/input.json	Input file for passing run scenario.
diskann-benchmark-runner/tests/regression/check-run-pass-0/tolerances.json	Tolerance file for passing run scenario.
diskann-benchmark-runner/tests/regression/check-run-pass-0/output.json	Golden output.json produced during setup in passing run scenario.
diskann-benchmark-runner/tests/regression/check-run-pass-0/checks.json	Golden checks.json produced by `check run` in pass scenario.
diskann-benchmark-runner/tests/regression/check-run-pass-0/README.md	Documents pass scenario coverage.
diskann-benchmark-runner/tests/regression/check-run-fail-0/stdin.txt	Adds UX test where regression check fails (but checks.json still written).
diskann-benchmark-runner/tests/regression/check-run-fail-0/stdout.txt	Golden output for failing regression checks.
diskann-benchmark-runner/tests/regression/check-run-fail-0/input.json	Input file for failing run scenario.
diskann-benchmark-runner/tests/regression/check-run-fail-0/tolerances.json	Tolerance file that triggers a failure.
diskann-benchmark-runner/tests/regression/check-run-fail-0/output.json	Golden output.json for failing run scenario setup.
diskann-benchmark-runner/tests/regression/check-run-fail-0/checks.json	Golden checks.json for failing run scenario.
diskann-benchmark-runner/tests/regression/check-run-fail-0/README.md	Documents failing regression run scenario.
diskann-benchmark-runner/tests/regression/check-run-error-0/stdin.txt	Adds UX test where check execution errors are surfaced and recorded.
diskann-benchmark-runner/tests/regression/check-run-error-0/stdout.txt	Golden output for erroring regression checks.
diskann-benchmark-runner/tests/regression/check-run-error-0/input.json	Input file for error scenario.
diskann-benchmark-runner/tests/regression/check-run-error-0/tolerances.json	Tolerance file that triggers check errors.
diskann-benchmark-runner/tests/regression/check-run-error-0/output.json	Golden output.json for error scenario setup.
diskann-benchmark-runner/tests/regression/check-run-error-0/checks.json	Golden checks.json showing error entries.
diskann-benchmark-runner/tests/regression/check-run-error-0/README.md	Documents error triage behavior and checks.json writing.
diskann-benchmark-runner/tests/regression/check-run-error-1/stdin.txt	Adds UX test for before/after length mismatch detection.
diskann-benchmark-runner/tests/regression/check-run-error-1/stdout.txt	Golden output for length mismatch error.
diskann-benchmark-runner/tests/regression/check-run-error-1/input.json	Setup input file for generating output.json with 2 entries.
diskann-benchmark-runner/tests/regression/check-run-error-1/regression_input.json	Regression input file with 1 job used to trigger length mismatch.
diskann-benchmark-runner/tests/regression/check-run-error-1/tolerances.json	Tolerance file for length mismatch scenario.
diskann-benchmark-runner/tests/regression/check-run-error-1/output.json	Golden output.json with 2 entries for mismatch scenario.
diskann-benchmark-runner/tests/regression/check-run-error-1/README.md	Documents length mismatch scenario.
diskann-benchmark-runner/tests/regression/check-run-error-2/stdin.txt	Adds UX test for input drift causing “no matching regression benchmark” during run.
diskann-benchmark-runner/tests/regression/check-run-error-2/stdout.txt	Golden output for drift/no-match error.
diskann-benchmark-runner/tests/regression/check-run-error-2/input.json	Setup input file for generating output.json.
diskann-benchmark-runner/tests/regression/check-run-error-2/regression_input.json	Drifted input file used for `check run`.
diskann-benchmark-runner/tests/regression/check-run-error-2/tolerances.json	Tolerance file for drift scenario.
diskann-benchmark-runner/tests/regression/check-run-error-2/output.json	Golden output.json from setup run.
diskann-benchmark-runner/tests/regression/check-run-error-2/README.md	Documents drift scenario expectations.
diskann-benchmark-runner/tests/regression/check-run-error-3/stdin.txt	Adds UX test for before/after schema drift (deserialization error) handling.
diskann-benchmark-runner/tests/regression/check-run-error-3/stdout.txt	Golden output for schema drift error.
diskann-benchmark-runner/tests/regression/check-run-error-3/input.json	Setup input file used to generate integer results.
diskann-benchmark-runner/tests/regression/check-run-error-3/regression_input.json	Regression input expecting string results used to trigger schema mismatch.
diskann-benchmark-runner/tests/regression/check-run-error-3/tolerances.json	Tolerance file for schema drift scenario.
diskann-benchmark-runner/tests/regression/check-run-error-3/output.json	Golden output.json containing integer results.
diskann-benchmark-runner/tests/regression/check-run-error-3/checks.json	Golden checks.json containing structured error output.
diskann-benchmark-runner/tests/regression/check-run-error-3/README.md	Documents schema drift scenario and expectations.
diskann-benchmark-runner/tests/benchmark/test-0/stdin.txt	Adds baseline UX test for `skeleton`.
diskann-benchmark-runner/tests/benchmark/test-0/stdout.txt	Golden output for `skeleton`.
diskann-benchmark-runner/tests/benchmark/test-0/README.md	Documents `skeleton` UX test.
diskann-benchmark-runner/tests/benchmark/test-1/stdin.txt	Adds baseline UX test for `inputs`.
diskann-benchmark-runner/tests/benchmark/test-1/stdout.txt	Golden output for `inputs` list.
diskann-benchmark-runner/tests/benchmark/test-1/README.md	Documents `inputs` list UX test.
diskann-benchmark-runner/tests/benchmark/test-2/stdin.txt	Adds baseline UX test for `inputs <NAME>`.
diskann-benchmark-runner/tests/benchmark/test-2/stdout.txt	Golden output for `inputs test-input-dim`.
diskann-benchmark-runner/tests/benchmark/test-2/README.md	Documents `inputs <NAME>` behavior.
diskann-benchmark-runner/tests/benchmark/test-3/stdin.txt	Adds baseline UX test for `inputs test-input-types`.
diskann-benchmark-runner/tests/benchmark/test-3/stdout.txt	Golden output for `inputs test-input-types`.
diskann-benchmark-runner/tests/benchmark/test-3/README.md	Documents typed input example output.
diskann-benchmark-runner/tests/benchmark/test-4/stdin.txt	Adds baseline UX test for `benchmarks`.
diskann-benchmark-runner/tests/benchmark/test-4/stdout.txt	Golden output for benchmark listing (now includes simple-bench).
diskann-benchmark-runner/tests/benchmark/test-4/README.md	Documents `benchmarks` listing behavior.
diskann-benchmark-runner/tests/benchmark/test-success-0/stdin.txt	Adds baseline “successful run generates output.json” UX test.
diskann-benchmark-runner/tests/benchmark/test-success-0/stdout.txt	Golden stdout for successful run.
diskann-benchmark-runner/tests/benchmark/test-success-0/input.json	Input file for successful run.
diskann-benchmark-runner/tests/benchmark/test-success-0/output.json	Golden output.json for successful run.
diskann-benchmark-runner/tests/benchmark/test-success-0/README.md	Documents successful run behavior.
diskann-benchmark-runner/tests/benchmark/test-success-1/stdin.txt	Adds baseline `--dry-run` UX test.
diskann-benchmark-runner/tests/benchmark/test-success-1/stdout.txt	Golden stdout for dry-run.
diskann-benchmark-runner/tests/benchmark/test-success-1/input.json	Input file for dry-run.
diskann-benchmark-runner/tests/benchmark/test-success-1/README.md	Documents dry-run behavior.
diskann-benchmark-runner/tests/benchmark/test-overload-0/stdin.txt	Adds UX test for MatchScore overload resolution.
diskann-benchmark-runner/tests/benchmark/test-overload-0/stdout.txt	Golden stdout for overload resolution.
diskann-benchmark-runner/tests/benchmark/test-overload-0/input.json	Input file for overload resolution test.
diskann-benchmark-runner/tests/benchmark/test-overload-0/output.json	Golden output.json for overload resolution test.
diskann-benchmark-runner/tests/benchmark/test-overload-0/README.md	Documents overload resolution expectations.
diskann-benchmark-runner/tests/benchmark/test-mismatch-0/stdin.txt	Adds UX test for mismatch diagnostics.
diskann-benchmark-runner/tests/benchmark/test-mismatch-0/stdout.txt	Golden mismatch diagnostics output.
diskann-benchmark-runner/tests/benchmark/test-mismatch-0/input.json	Input file that triggers mismatch diagnostics.
diskann-benchmark-runner/tests/benchmark/test-mismatch-0/README.md	Documents mismatch diagnostics output.
diskann-benchmark-runner/tests/benchmark/test-mismatch-1/stdin.txt	Adds UX test for ExactTypeBench mismatch reason paths.
diskann-benchmark-runner/tests/benchmark/test-mismatch-1/stdout.txt	Golden output for ExactTypeBench mismatch diagnostics.
diskann-benchmark-runner/tests/benchmark/test-mismatch-1/input.json	Input file for ExactTypeBench mismatch diagnostics.
diskann-benchmark-runner/tests/benchmark/test-mismatch-1/README.md	Documents ExactTypeBench mismatch reasons.
diskann-benchmark-runner/tests/benchmark/test-deserialization-error-0/stdin.txt	Adds UX test for input deserialization error reporting.
diskann-benchmark-runner/tests/benchmark/test-deserialization-error-0/stdout.txt	Golden output for deserialization errors.
diskann-benchmark-runner/tests/benchmark/test-deserialization-error-0/input.json	Input file containing an invalid datatype variant.
diskann-benchmark-runner/tests/benchmark/test-deserialization-error-0/README.md	Documents deserialization error reporting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-03T23:30:33Z

diskann-benchmark-simd/src/bin.rs

+    #[test]
+    fn check_verify() {
+        let input_path = PathBuf::from(env!("CARGO_MANIFEST_DIR"))
+            .join("examples")
+            .join("simd-scalar.json");
+        let tolerance_path = PathBuf::from(env!("CARGO_MANIFEST_DIR"))
+            .join("examples")
+            .join("tolerance.json");
+
+        let stdout = run_check_test(&input_path, &tolerance_path);
+        println!("stdout = {}", stdout);
+    }


check_verify test doesn't assert anything about the verify output (it only prints it). This can let regressions slip through while still passing and also adds noisy stdout to cargo test output. Consider asserting the expected behavior (e.g., verify exits successfully and prints nothing / prints a stable message), and remove the println! unless the test fails.

Copilot · 2026-04-03T23:30:33Z

diskann-benchmark-simd/src/lib.rs

+impl RunResult {
+    fn computations_per_latency(&self) -> usize {
+        self.run.num_points.get() * self.run.loops_per_measurement.get()
+    }


computations_per_latency multiplies two usize values from user-controlled input (num_points and loops_per_measurement). This can overflow silently (wrap) in release builds and skew the computed per-point timings and regression checks. Use checked_mul (and return an error / fail the check) to avoid incorrect results on large inputs.

Copilot · 2026-04-03T23:30:33Z

diskann-benchmark-runner/src/app.rs

+/// * $INPUT_FILE - Resolves to `input.json` in the same directory as the `stdin.txt` file.
+/// * $OUTPUT_FILE - Resolves to `output.json` in a temporary directory.
+/// * $TOLERANCES_FILE - Resolves to `tolerances.json` in the test directory.
+/// * $REGRESSION_INPUT_FILE - Resolves to `regression_input.json` test directory.
+/// * $CHECK_OUTPUT_FILE - Resolves to `checks.json` in a temporary directory.


The UX test docs list $INPUT_FILE, $OUTPUT_FILE, $TOLERANCES_FILE, etc., but the test harness actually recognizes $INPUT, $OUTPUT, $TOLERANCES, $REGRESSION_INPUT, and $CHECK_OUTPUT. This mismatch makes the documentation misleading; update the bullet list to match the implemented tokens (or rename the tokens for consistency).

Suggested change

/// * $INPUT_FILE - Resolves to `input.json` in the same directory as the `stdin.txt` file.

/// * $OUTPUT_FILE - Resolves to `output.json` in a temporary directory.

/// * $TOLERANCES_FILE - Resolves to `tolerances.json` in the test directory.

/// * $REGRESSION_INPUT_FILE - Resolves to `regression_input.json` test directory.

/// * $CHECK_OUTPUT_FILE - Resolves to `checks.json` in a temporary directory.

/// * $INPUT - Resolves to `input.json` in the same directory as the `stdin.txt` file.

/// * $OUTPUT - Resolves to `output.json` in a temporary directory.

/// * $TOLERANCES - Resolves to `tolerances.json` in the test directory.

/// * $REGRESSION_INPUT - Resolves to `regression_input.json` in the test directory.

/// * $CHECK_OUTPUT - Resolves to `checks.json` in a temporary directory.

diskann-benchmark-runner/src/internal/regression.rs

Copilot · 2026-04-03T23:30:34Z

diskann-benchmark-runner/src/jobs.rs

+/// To reach this point, we at least the structure of the input JSON to be correct and
+/// parseable. However, we have not yet mapped the raw JSON of any of the registered inputs.


Grammar in this doc comment is broken and makes the intent unclear. It currently reads "we at least the structure..."; consider changing it to something like "we require at least the structure of the input JSON to be correct and parseable".

Suggested change

/// To reach this point, we at least the structure of the input JSON to be correct and

/// parseable. However, we have not yet mapped the raw JSON of any of the registered inputs.

/// To reach this point, we require at least the structure of the input JSON to be correct

/// and parseable. However, we have not yet mapped the raw JSON of any of the registered

/// inputs.

Copilot · 2026-04-03T23:30:34Z

diskann-benchmark-runner/tests/regression/check-verify-4/README.md

+Here, we test that if a valid "tolerance" is linked with a valid "input" in "tolerance.json"
+but there is not registered benchmark where these two tags are linked, we get a reasonable error.


This README describes the scenario as "no registered benchmark where these two tags are linked" and refers to "tolerance.json", but the test data here is about an incompatible input/tolerance tag pair and the file is tolerances.json. Updating this description will make the intent of the regression test clearer and prevent confusion when debugging failures.

Mark Hildebrand added 2 commits April 3, 2026 15:10

Add dedicated A/B testing functionality to diskann-benchmark-runner.

9b02f0e

Small addition of test coverage to num.rs.

2c220f2

hildebrandmw requested review from a team and Copilot April 3, 2026 23:25

Copilot started reviewing on behalf of hildebrandmw April 3, 2026 23:26 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

Fix backtrace stripping to handle multiple errors.

a9dc976

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark Support for A/B Tests#900

Benchmark Support for A/B Tests#900
hildebrandmw wants to merge 3 commits intomainfrom
mhildebr/checking

hildebrandmw commented Apr 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		/// To reach this point, we at least the structure of the input JSON to be correct and
		/// parseable. However, we have not yet mapped the raw JSON of any of the registered inputs.

-/// To reach this point, we at least the structure of the input JSON to be correct and
-/// parseable. However, we have not yet mapped the raw JSON of any of the registered inputs.
+/// To reach this point, we require at least the structure of the input JSON to be correct
+/// and parseable. However, we have not yet mapped the raw JSON of any of the registered
+/// inputs.

		Here, we test that if a valid "tolerance" is linked with a valid "input" in "tolerance.json"
		but there is not registered benchmark where these two tags are linked, we get a reasonable error.

Conversation

hildebrandmw commented Apr 3, 2026

Concept

CLI Changes

Benchmark Registration Changes

Example

Suggested Reviewing Order

Future Ideas

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants