Skip to content

Benchmark Support for A/B Tests#900

Open
hildebrandmw wants to merge 3 commits intomainfrom
mhildebr/checking
Open

Benchmark Support for A/B Tests#900
hildebrandmw wants to merge 3 commits intomainfrom
mhildebr/checking

Conversation

@hildebrandmw
Copy link
Copy Markdown
Contributor

This adds support for native A/B testing in diskann-benchmark-runner with an example implementation added to diskann-benchmark-simd. The benefits of having this infrastructure at the Rust level are:

  • Code actually performing the before and after checks can run on the concrete types used in the Rust code. They do not need to dig through layers of JSON (and the corresponding maintenance headache associated with changing data structures). The compiler checks that at least structurally the logic is sound.
  • Shared infrastructure for running comparison tests locally during development and for CI.
  • Stricter checks on correctness and wellformedness of data that all users can take advantage of.

Concept

The idea is to use one input.json file to generate two output JSON files before.json and after.json for different builds or configurations of the library. The input.json is then accompanied by a tolerances.json, which contains runtime thresholds for values of interest to help accommodate runtime variability. A regression check takes all such files and performs the following steps:

  1. Correctly parse tolerances.json and input.json.
  2. Unambiguously match entries in tolerances.json to entries in input.json. I went with matching semantics rather than requiring a one-to-one correspondence to make it easier to have a single tolerance entry work as a blanket entry for multiple benchmark runs.
  3. Take the resolve tolerance/input pairs and match them with a regression-checkable benchmark.
  4. Once everything has been correctly set up, run each regression check which
    • Deserializes entries in before/after.json into the benchmark's output type or error gracefully if this cannot be done due to an incorrect environment.
    • Runs the before/after check.
    • Aggregates and displays errors/failures/successes with an exit code that indicates success or failure.
  5. An invocation of the checking functionality only returns a successful exit code if all of the above steps succeed without error and all the run checks pass.

The matching semantics work like this. Each tolerance entry looks like the following:

{
  "input": {
    "type": "type-tag",
    "content": {},
  },
  "tolerance": {
     "type": "tolerance-type-tag",
     "content": "defined-per-tolerance",
  }
}

The content field of "input" need not be deserializable to its corresponding value. Instead, we use the raw JSON of the input and match it as a "subset" against the raw JSON of the input.json using the following rules:

  • Nulls are always equal.
  • Primitive types are equal if their values compare equal.
  • An array x is a subset of an array y if x.len() <= y.len() and each entry i in x is a subset of its corresponding entry in y (i.e., we match prefixes).
  • An object x is a subset of an object y if each key in x is a key in y and each value associated with a key in x is a subset of the value of the same entry in y.
  • Heterogeneous matches are not supported. A bool cannot be a subset of an integer. This breaks the match instantly.

This means that an empty "content" field will match any struct and thus be a blanket implementation for the input with the same type tag. Or, the "content" field can be refined to be more specific as needed.

Since a single tolerance entry may match multiple inputs, we ensure that the matching is unambiguous using the following rules:

  • Each tolerance entry must be matched with at least one entry from "inputs.json".
  • Every entry in "inputs.json" must have exactly one match from "tolerances.json".

If there is any ambiguity, the app stops with an error.

CLI Changes

This adds a new check subcommand to the runner CLI with the following options:

  • check skeleton: Print a skeleton tolerance JSON file.
  • check tolerances [NAME]: List tolerance kinds, or describe one by name. This is similar to inputs [NAME]
  • check verify --tolerances <FILE> --input-file <FILE>: Validate a tolerance file against an input file. This runs up to step 3 of the checklist above and serves as a pre-flight check before any CI jobs are run. Errors with the setup that can be caught early will be and can thus save CI time.
  • check run --tolerances <FILE> --input-file <FILE> --before <FILE> --after <FILE> [--output-file <FILE>]: Run regression checks.

Benchmark Registration Changes

Regression checks are opt-in. Benchmarks that wish to opt-in implement benchmark::Regression and the singular check method. All logic for the before and after comparison lives in the check method. Such benchmarks also need to use registry::Benchmarks::register_regression to be correctly tracked as regression compatible. No independent registration of the Tolerance associated type is needed.

That is it.

Note that check should not print anything out to stdout and instead communicate success/failure solely through its return type to avoid spamming the output.

Example

You can see this in action in diskann-benchmark-simd. Run

# Generate the "before" file
cargo run --package diskann-benchmark-simd --release -- \
   run \
   --input-file ./diskann-benchmark-simd/examples/simd-scalar.json \
   --output-file before.json

# Generate the "after" file
cargo run --package diskann-benchmark-simd --release -- \
   run \
   --input-file ./diskann-benchmark-simd/examples/simd-scalar.json \
   --output-file after.json

# Run the before/after check
cargo run --package diskann-benchmark-simd --release -- \
  check run \
  --tolerances ./diskann-benchmark-simd/examples/tolerance.json \
  --input-file ./diskann-benchmark-simd/examples/simd-scalar.json \
  --before before.json \
  --after after.json

Depending on the noise in your system, you will see something like the following (note that the ci profile should be used for more reliable measurements):

Check 1 of 4 ("simd-op-f32xf32-scalar") FAILED:
     Distance,   Dim,   Min Before (ns),   Min After (ns),   Change (%),                                      Remark
====================================================================================================================
   squared_l2,   100,             0.000,            0.000,      invalid,   expected "before" to be greater than zero
   squared_l2,   128,             0.000,            0.000,      invalid,   expected "before" to be greater than zero
   squared_l2,   160,             0.000,            0.000,      invalid,   expected "before" to be greater than zero
   squared_l2,   384,            40.000,           40.000,      0.000 %,
   squared_l2,   768,            80.000,           80.000,      0.000 %,
inner_product,   100,             0.000,            0.000,      invalid,   expected "before" to be greater than zero
inner_product,   128,             0.000,            0.000,      invalid,   expected "before" to be greater than zero
inner_product,   160,             0.000,            0.000,      invalid,   expected "before" to be greater than zero
inner_product,   384,            40.000,            0.000,   -100.000 %,
inner_product,   768,           120.000,           40.000,    -66.667 %,
       cosine,   100,           120.000,           40.000,    -66.667 %,
       cosine,   128,            80.000,            0.000,   -100.000 %,
       cosine,   160,            40.000,           40.000,      0.000 %,
       cosine,   384,           120.000,           80.000,    -33.333 %,
       cosine,   768,           280.000,          160.000,    -42.857 %,

Where this particular implementation has decided that a regression cannot be meaningfully detected if an execution time got rounded to zero.

Suggested Reviewing Order

diskann-benchmark-runner

  • benchmark.rs: This contains the new Regression trait and the internal type-erasure machinery that is used inside the registry to mark a benchmark as regression-capable.
  • registry.rs: The new register_regression API as well as methods for retrieving and interrogating regression capable benchmarks.
  • internal/regression.rs: This is the meat and potatoes of this PR. This contains all the logic to perform the outlined steps and has a pretty good module-level documentation that outlines the approach taken.
  • app.rs: The new Check subcommand and routing, this is pretty straight-forward save for the changes to tests (see below).
  • utils/num.rs: I added two new opinionated utilities to help with writing checks:
    • relative_change: Compute the relative change between two values, handling corner cases.
    • NonNegativeFinite: Useful to serde-compatible assertions that tolerance entries have the obvious properties the name implies.
  • jobs.rs: The changes here are to mainly make it easier to interact with the JSON patterns used by the input files to have more uniform handling of said files.

Changes to the testing infrastructure:

  • The simple test benchmarks/inputs have been factored into their own submodules and given regression checks.
  • The UX tests in app.rs have been upgraded to handle tolerance checks as well. This involves having more magically escaped patterns for input/output files and supporting multi-line stdin.txt files to allow regression tests to set up their respective environment.

Much of the logic in internal/regression.rs is tested via the UX tests because (1) setting up a proper environment for this functionality is challenging and (2) the UX tests provide much better visual information on the expected behavior. A lot of new UX tests have been added.

diskann-benchmark-simd:
The changes here are largely meant to work as an example of adding regression tests. Since diskann-benchmark-simd is not a production critical crate, feel free to just skim this or ignore entirely.

Future Ideas

There are parts of this that are not perfect:

  • Error diagnostics when matching fails could be better.
  • This doesn't directly support situations where you change runtime values in the input file to generate the before and after comparison. It will work if the changes are sufficiently minimal to not influence benchmark matching. I decided not to include in this PR to keep the changes in API small (the internal changes in this PR are already huge).
  • This doesn't natively support long-term regression monitoring where we have lots of before files and build a statistical picture of trends. That opens a whole can of worms like schema stability, machine stability (if comparing run times) etc. that are way beyond the scope of what's needed to get basic regression tests running.
  • Adding a --verbose flag to print out all diagnostics rather than using the opinionated triage of errors then failures then successes.

Disclaimer: This PR (though not the PR description) was written with the help of AI but has been heavily edited and is something that I wouldn't be annoyed at reviewing outside of it being +3500 lines (I am sorry).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds first-class A/B regression checking to diskann-benchmark-runner by introducing a regression-capable benchmark API (tolerance inputs + before/after comparison) and a new check CLI surface, with an example adoption in diskann-benchmark-simd.

Changes:

  • Introduce benchmark::Regression + registry plumbing to register and discover regression-capable benchmarks and their tolerance input types.
  • Add tolerance-file parsing, subset-based matching, and check verify / check run execution + reporting pipeline.
  • Extend the UX test harness and add many new golden tests covering success/failure/error paths; add SIMD example tolerances + regression implementation.

Reviewed changes

Copilot reviewed 103 out of 143 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
diskann-benchmark-simd/src/lib.rs Adds SIMD regression tolerance type and Regression implementation for kernels; updates run result schema.
diskann-benchmark-simd/src/bin.rs Adds integration coverage for check verify on the SIMD example.
diskann-benchmark-simd/examples/tolerance.json Example tolerance file for SIMD regression checks.
diskann-benchmark-runner/src/lib.rs Exposes benchmark publicly and adds internal module for shared helpers.
diskann-benchmark-runner/src/benchmark.rs Adds Regression trait, PassFail, and type-erased regression support in the registry layer.
diskann-benchmark-runner/src/registry.rs Adds register_regression and tolerance discovery/mapping to regression-capable benchmarks.
diskann-benchmark-runner/src/internal/mod.rs Adds shared load_from_disk helper and internal module organization.
diskann-benchmark-runner/src/internal/regression.rs Implements tolerance parsing, subset matching, check job creation, execution, JSON output, and reporting.
diskann-benchmark-runner/src/app.rs Adds check subcommand (skeleton, tolerances, verify, run) and upgrades UX test harness for multi-step scenarios.
diskann-benchmark-runner/src/jobs.rs Refactors input loading/parsing and exposes raw job list/partial parsing for regression pipeline.
diskann-benchmark-runner/src/result.rs Adds RawResult loader for reading previously saved benchmark outputs.
diskann-benchmark-runner/src/checker.rs Tightens Checker::any tagging expectation (with clippy annotation).
diskann-benchmark-runner/src/input.rs Adds Wrapper::<T>::INSTANCE const for tolerance type-erasure usage.
diskann-benchmark-runner/src/ux.rs Adds scrub_path helper for deterministic UX test output.
diskann-benchmark-runner/src/utils/mod.rs Exports new num utilities module.
diskann-benchmark-runner/src/utils/num.rs Adds relative_change and NonNegativeFinite for regression/tolerance validation and comparisons.
diskann-benchmark-runner/src/utils/percentiles.rs Adds minimum percentile value to output structure (and marks struct #[non_exhaustive]).
diskann-benchmark-runner/src/utils/fmt.rs Adds clippy expectation annotation for panic-based bounds checks.
diskann-benchmark-runner/src/test/mod.rs Reorganizes test benchmark registration and marks regression-capable test benchmarks.
diskann-benchmark-runner/src/test/dim.rs Adds regression checks to dim benchmarks and introduces a non-regression “simple” benchmark.
diskann-benchmark-runner/src/test/typed.rs Adds regression checks to typed benchmarks and introduces tolerance input used by typed regression tests.
diskann-benchmark-runner/Cargo.toml Switches to explicit clippy lint configuration.
diskann-benchmark-runner/.clippy.toml Allows unwrap/expect/panic in tests under clippy.
diskann-benchmark-runner/tests/regression/check-skeleton-0/stdin.txt Adds UX test for check skeleton.
diskann-benchmark-runner/tests/regression/check-skeleton-0/stdout.txt Golden output for check skeleton.
diskann-benchmark-runner/tests/regression/check-skeleton-0/README.md Documents check skeleton UX test scenario.
diskann-benchmark-runner/tests/regression/check-tolerances-0/stdin.txt Adds UX test for listing tolerance kinds.
diskann-benchmark-runner/tests/regression/check-tolerances-0/stdout.txt Golden output for listing tolerance kinds.
diskann-benchmark-runner/tests/regression/check-tolerances-0/README.md Documents tolerance listing UX test.
diskann-benchmark-runner/tests/regression/check-tolerances-1/stdin.txt Adds UX test for describing a specific tolerance kind.
diskann-benchmark-runner/tests/regression/check-tolerances-1/stdout.txt Golden output for tolerance kind description/skeleton.
diskann-benchmark-runner/tests/regression/check-tolerances-1/README.md Documents tolerance kind description UX test.
diskann-benchmark-runner/tests/regression/check-tolerances-2/stdin.txt Adds UX test for requesting an unknown tolerance kind.
diskann-benchmark-runner/tests/regression/check-tolerances-2/stdout.txt Golden output for unknown tolerance kind.
diskann-benchmark-runner/tests/regression/check-tolerances-2/README.md Documents unknown tolerance kind behavior.
diskann-benchmark-runner/tests/regression/check-verify-0/stdin.txt Adds UX test for successful check verify.
diskann-benchmark-runner/tests/regression/check-verify-0/tolerances.json Test tolerance file for successful verification.
diskann-benchmark-runner/tests/regression/check-verify-0/input.json Test input file used for successful verification.
diskann-benchmark-runner/tests/regression/check-verify-0/README.md Documents successful verification behavior (no stdout).
diskann-benchmark-runner/tests/regression/check-verify-1/stdin.txt Adds UX test for unknown tolerance tag error.
diskann-benchmark-runner/tests/regression/check-verify-1/stdout.txt Golden output for unknown tolerance tag error.
diskann-benchmark-runner/tests/regression/check-verify-1/tolerances.json Test tolerance file with unknown tolerance tag.
diskann-benchmark-runner/tests/regression/check-verify-1/input.json Input file used by unknown tolerance tag test.
diskann-benchmark-runner/tests/regression/check-verify-1/README.md Documents unknown tolerance tag error scenario.
diskann-benchmark-runner/tests/regression/check-verify-2/stdin.txt Adds UX test for tolerance/input match but no regression benchmark dispatch.
diskann-benchmark-runner/tests/regression/check-verify-2/stdout.txt Golden output for “no matching regression benchmark” in verify.
diskann-benchmark-runner/tests/regression/check-verify-2/tolerances.json Tolerance file used by the dispatch-failure verify test.
diskann-benchmark-runner/tests/regression/check-verify-2/input.json Input file used by the dispatch-failure verify test.
diskann-benchmark-runner/tests/regression/check-verify-2/README.md Documents dispatch-failure verify scenario.
diskann-benchmark-runner/tests/regression/check-verify-3/stdin.txt Adds UX test covering ambiguous/orphaned/uncovered tolerance matching problems.
diskann-benchmark-runner/tests/regression/check-verify-3/stdout.txt Golden output for matching failure diagnostics.
diskann-benchmark-runner/tests/regression/check-verify-3/tolerances.json Tolerance file constructed to trigger matching errors.
diskann-benchmark-runner/tests/regression/check-verify-3/input.json Input file constructed to trigger matching errors.
diskann-benchmark-runner/tests/regression/check-verify-4/stdin.txt Adds UX test for incompatible tolerance/input tag pairing.
diskann-benchmark-runner/tests/regression/check-verify-4/stdout.txt Golden output for incompatible tag pairing error.
diskann-benchmark-runner/tests/regression/check-verify-4/tolerances.json Tolerance file with mismatched input tag type.
diskann-benchmark-runner/tests/regression/check-verify-4/input.json Input file used by incompatible tag pairing test.
diskann-benchmark-runner/tests/regression/check-verify-4/README.md Documents incompatible tag pairing test (currently needs alignment with actual scenario).
diskann-benchmark-runner/tests/regression/check-run-pass-0/stdin.txt Adds UX test for successful check run and checks.json generation.
diskann-benchmark-runner/tests/regression/check-run-pass-0/stdout.txt Golden output for successful regression checks.
diskann-benchmark-runner/tests/regression/check-run-pass-0/input.json Input file for passing run scenario.
diskann-benchmark-runner/tests/regression/check-run-pass-0/tolerances.json Tolerance file for passing run scenario.
diskann-benchmark-runner/tests/regression/check-run-pass-0/output.json Golden output.json produced during setup in passing run scenario.
diskann-benchmark-runner/tests/regression/check-run-pass-0/checks.json Golden checks.json produced by check run in pass scenario.
diskann-benchmark-runner/tests/regression/check-run-pass-0/README.md Documents pass scenario coverage.
diskann-benchmark-runner/tests/regression/check-run-fail-0/stdin.txt Adds UX test where regression check fails (but checks.json still written).
diskann-benchmark-runner/tests/regression/check-run-fail-0/stdout.txt Golden output for failing regression checks.
diskann-benchmark-runner/tests/regression/check-run-fail-0/input.json Input file for failing run scenario.
diskann-benchmark-runner/tests/regression/check-run-fail-0/tolerances.json Tolerance file that triggers a failure.
diskann-benchmark-runner/tests/regression/check-run-fail-0/output.json Golden output.json for failing run scenario setup.
diskann-benchmark-runner/tests/regression/check-run-fail-0/checks.json Golden checks.json for failing run scenario.
diskann-benchmark-runner/tests/regression/check-run-fail-0/README.md Documents failing regression run scenario.
diskann-benchmark-runner/tests/regression/check-run-error-0/stdin.txt Adds UX test where check execution errors are surfaced and recorded.
diskann-benchmark-runner/tests/regression/check-run-error-0/stdout.txt Golden output for erroring regression checks.
diskann-benchmark-runner/tests/regression/check-run-error-0/input.json Input file for error scenario.
diskann-benchmark-runner/tests/regression/check-run-error-0/tolerances.json Tolerance file that triggers check errors.
diskann-benchmark-runner/tests/regression/check-run-error-0/output.json Golden output.json for error scenario setup.
diskann-benchmark-runner/tests/regression/check-run-error-0/checks.json Golden checks.json showing error entries.
diskann-benchmark-runner/tests/regression/check-run-error-0/README.md Documents error triage behavior and checks.json writing.
diskann-benchmark-runner/tests/regression/check-run-error-1/stdin.txt Adds UX test for before/after length mismatch detection.
diskann-benchmark-runner/tests/regression/check-run-error-1/stdout.txt Golden output for length mismatch error.
diskann-benchmark-runner/tests/regression/check-run-error-1/input.json Setup input file for generating output.json with 2 entries.
diskann-benchmark-runner/tests/regression/check-run-error-1/regression_input.json Regression input file with 1 job used to trigger length mismatch.
diskann-benchmark-runner/tests/regression/check-run-error-1/tolerances.json Tolerance file for length mismatch scenario.
diskann-benchmark-runner/tests/regression/check-run-error-1/output.json Golden output.json with 2 entries for mismatch scenario.
diskann-benchmark-runner/tests/regression/check-run-error-1/README.md Documents length mismatch scenario.
diskann-benchmark-runner/tests/regression/check-run-error-2/stdin.txt Adds UX test for input drift causing “no matching regression benchmark” during run.
diskann-benchmark-runner/tests/regression/check-run-error-2/stdout.txt Golden output for drift/no-match error.
diskann-benchmark-runner/tests/regression/check-run-error-2/input.json Setup input file for generating output.json.
diskann-benchmark-runner/tests/regression/check-run-error-2/regression_input.json Drifted input file used for check run.
diskann-benchmark-runner/tests/regression/check-run-error-2/tolerances.json Tolerance file for drift scenario.
diskann-benchmark-runner/tests/regression/check-run-error-2/output.json Golden output.json from setup run.
diskann-benchmark-runner/tests/regression/check-run-error-2/README.md Documents drift scenario expectations.
diskann-benchmark-runner/tests/regression/check-run-error-3/stdin.txt Adds UX test for before/after schema drift (deserialization error) handling.
diskann-benchmark-runner/tests/regression/check-run-error-3/stdout.txt Golden output for schema drift error.
diskann-benchmark-runner/tests/regression/check-run-error-3/input.json Setup input file used to generate integer results.
diskann-benchmark-runner/tests/regression/check-run-error-3/regression_input.json Regression input expecting string results used to trigger schema mismatch.
diskann-benchmark-runner/tests/regression/check-run-error-3/tolerances.json Tolerance file for schema drift scenario.
diskann-benchmark-runner/tests/regression/check-run-error-3/output.json Golden output.json containing integer results.
diskann-benchmark-runner/tests/regression/check-run-error-3/checks.json Golden checks.json containing structured error output.
diskann-benchmark-runner/tests/regression/check-run-error-3/README.md Documents schema drift scenario and expectations.
diskann-benchmark-runner/tests/benchmark/test-0/stdin.txt Adds baseline UX test for skeleton.
diskann-benchmark-runner/tests/benchmark/test-0/stdout.txt Golden output for skeleton.
diskann-benchmark-runner/tests/benchmark/test-0/README.md Documents skeleton UX test.
diskann-benchmark-runner/tests/benchmark/test-1/stdin.txt Adds baseline UX test for inputs.
diskann-benchmark-runner/tests/benchmark/test-1/stdout.txt Golden output for inputs list.
diskann-benchmark-runner/tests/benchmark/test-1/README.md Documents inputs list UX test.
diskann-benchmark-runner/tests/benchmark/test-2/stdin.txt Adds baseline UX test for inputs <NAME>.
diskann-benchmark-runner/tests/benchmark/test-2/stdout.txt Golden output for inputs test-input-dim.
diskann-benchmark-runner/tests/benchmark/test-2/README.md Documents inputs <NAME> behavior.
diskann-benchmark-runner/tests/benchmark/test-3/stdin.txt Adds baseline UX test for inputs test-input-types.
diskann-benchmark-runner/tests/benchmark/test-3/stdout.txt Golden output for inputs test-input-types.
diskann-benchmark-runner/tests/benchmark/test-3/README.md Documents typed input example output.
diskann-benchmark-runner/tests/benchmark/test-4/stdin.txt Adds baseline UX test for benchmarks.
diskann-benchmark-runner/tests/benchmark/test-4/stdout.txt Golden output for benchmark listing (now includes simple-bench).
diskann-benchmark-runner/tests/benchmark/test-4/README.md Documents benchmarks listing behavior.
diskann-benchmark-runner/tests/benchmark/test-success-0/stdin.txt Adds baseline “successful run generates output.json” UX test.
diskann-benchmark-runner/tests/benchmark/test-success-0/stdout.txt Golden stdout for successful run.
diskann-benchmark-runner/tests/benchmark/test-success-0/input.json Input file for successful run.
diskann-benchmark-runner/tests/benchmark/test-success-0/output.json Golden output.json for successful run.
diskann-benchmark-runner/tests/benchmark/test-success-0/README.md Documents successful run behavior.
diskann-benchmark-runner/tests/benchmark/test-success-1/stdin.txt Adds baseline --dry-run UX test.
diskann-benchmark-runner/tests/benchmark/test-success-1/stdout.txt Golden stdout for dry-run.
diskann-benchmark-runner/tests/benchmark/test-success-1/input.json Input file for dry-run.
diskann-benchmark-runner/tests/benchmark/test-success-1/README.md Documents dry-run behavior.
diskann-benchmark-runner/tests/benchmark/test-overload-0/stdin.txt Adds UX test for MatchScore overload resolution.
diskann-benchmark-runner/tests/benchmark/test-overload-0/stdout.txt Golden stdout for overload resolution.
diskann-benchmark-runner/tests/benchmark/test-overload-0/input.json Input file for overload resolution test.
diskann-benchmark-runner/tests/benchmark/test-overload-0/output.json Golden output.json for overload resolution test.
diskann-benchmark-runner/tests/benchmark/test-overload-0/README.md Documents overload resolution expectations.
diskann-benchmark-runner/tests/benchmark/test-mismatch-0/stdin.txt Adds UX test for mismatch diagnostics.
diskann-benchmark-runner/tests/benchmark/test-mismatch-0/stdout.txt Golden mismatch diagnostics output.
diskann-benchmark-runner/tests/benchmark/test-mismatch-0/input.json Input file that triggers mismatch diagnostics.
diskann-benchmark-runner/tests/benchmark/test-mismatch-0/README.md Documents mismatch diagnostics output.
diskann-benchmark-runner/tests/benchmark/test-mismatch-1/stdin.txt Adds UX test for ExactTypeBench mismatch reason paths.
diskann-benchmark-runner/tests/benchmark/test-mismatch-1/stdout.txt Golden output for ExactTypeBench mismatch diagnostics.
diskann-benchmark-runner/tests/benchmark/test-mismatch-1/input.json Input file for ExactTypeBench mismatch diagnostics.
diskann-benchmark-runner/tests/benchmark/test-mismatch-1/README.md Documents ExactTypeBench mismatch reasons.
diskann-benchmark-runner/tests/benchmark/test-deserialization-error-0/stdin.txt Adds UX test for input deserialization error reporting.
diskann-benchmark-runner/tests/benchmark/test-deserialization-error-0/stdout.txt Golden output for deserialization errors.
diskann-benchmark-runner/tests/benchmark/test-deserialization-error-0/input.json Input file containing an invalid datatype variant.
diskann-benchmark-runner/tests/benchmark/test-deserialization-error-0/README.md Documents deserialization error reporting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +91 to +102
#[test]
fn check_verify() {
let input_path = PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("examples")
.join("simd-scalar.json");
let tolerance_path = PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("examples")
.join("tolerance.json");

let stdout = run_check_test(&input_path, &tolerance_path);
println!("stdout = {}", stdout);
}
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_verify test doesn't assert anything about the verify output (it only prints it). This can let regressions slip through while still passing and also adds noisy stdout to cargo test output. Consider asserting the expected behavior (e.g., verify exits successfully and prints nothing / prints a stable message), and remove the println! unless the test fails.

Copilot uses AI. Check for mistakes.
Comment on lines +741 to +744
impl RunResult {
fn computations_per_latency(&self) -> usize {
self.run.num_points.get() * self.run.loops_per_measurement.get()
}
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

computations_per_latency multiplies two usize values from user-controlled input (num_points and loops_per_measurement). This can overflow silently (wrap) in release builds and skew the computed per-point timings and regression checks. Use checked_mul (and return an error / fail the check) to avoid incorrect results on large inputs.

Copilot uses AI. Check for mistakes.
Comment on lines +431 to +435
/// * $INPUT_FILE - Resolves to `input.json` in the same directory as the `stdin.txt` file.
/// * $OUTPUT_FILE - Resolves to `output.json` in a temporary directory.
/// * $TOLERANCES_FILE - Resolves to `tolerances.json` in the test directory.
/// * $REGRESSION_INPUT_FILE - Resolves to `regression_input.json` test directory.
/// * $CHECK_OUTPUT_FILE - Resolves to `checks.json` in a temporary directory.
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UX test docs list $INPUT_FILE, $OUTPUT_FILE, $TOLERANCES_FILE, etc., but the test harness actually recognizes $INPUT, $OUTPUT, $TOLERANCES, $REGRESSION_INPUT, and $CHECK_OUTPUT. This mismatch makes the documentation misleading; update the bullet list to match the implemented tokens (or rename the tokens for consistency).

Suggested change
/// * $INPUT_FILE - Resolves to `input.json` in the same directory as the `stdin.txt` file.
/// * $OUTPUT_FILE - Resolves to `output.json` in a temporary directory.
/// * $TOLERANCES_FILE - Resolves to `tolerances.json` in the test directory.
/// * $REGRESSION_INPUT_FILE - Resolves to `regression_input.json` test directory.
/// * $CHECK_OUTPUT_FILE - Resolves to `checks.json` in a temporary directory.
/// * $INPUT - Resolves to `input.json` in the same directory as the `stdin.txt` file.
/// * $OUTPUT - Resolves to `output.json` in a temporary directory.
/// * $TOLERANCES - Resolves to `tolerances.json` in the test directory.
/// * $REGRESSION_INPUT - Resolves to `regression_input.json` in the test directory.
/// * $CHECK_OUTPUT - Resolves to `checks.json` in a temporary directory.

Copilot uses AI. Check for mistakes.
Comment on lines +119 to +120
/// To reach this point, we at least the structure of the input JSON to be correct and
/// parseable. However, we have not yet mapped the raw JSON of any of the registered inputs.
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar in this doc comment is broken and makes the intent unclear. It currently reads "we at least the structure..."; consider changing it to something like "we require at least the structure of the input JSON to be correct and parseable".

Suggested change
/// To reach this point, we at least the structure of the input JSON to be correct and
/// parseable. However, we have not yet mapped the raw JSON of any of the registered inputs.
/// To reach this point, we require at least the structure of the input JSON to be correct
/// and parseable. However, we have not yet mapped the raw JSON of any of the registered
/// inputs.

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +2
Here, we test that if a valid "tolerance" is linked with a valid "input" in "tolerance.json"
but there is not registered benchmark where these two tags are linked, we get a reasonable error.
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This README describes the scenario as "no registered benchmark where these two tags are linked" and refers to "tolerance.json", but the test data here is about an incompatible input/tolerance tag pair and the file is tolerances.json. Updating this description will make the intent of the regression test clearer and prevent confusion when debugging failures.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants