Skip to content

feat: ground truth optimization path#122

Open
andrewklatzke wants to merge 1 commit intoaklatzke/AIC-1794/optimize-method-from-ldfrom
aklatzke/AIC-1795/optimize-method-ground-truth-path
Open

feat: ground truth optimization path#122
andrewklatzke wants to merge 1 commit intoaklatzke/AIC-1794/optimize-method-from-ldfrom
aklatzke/AIC-1795/optimize-method-ground-truth-path

Conversation

@andrewklatzke
Copy link
Copy Markdown
Contributor

@andrewklatzke andrewklatzke commented Apr 3, 2026

Requirements

  • I have added test coverage for new or changed functionality
  • I have followed the repository's pull request submission guidelines
  • I have validated my changes against all supported platform versions

Describe the solution you've provided

Implements the "ground truth" path for the SDK optimizations. The existing path optimizes through a form of "chaos testing" where inputs are randomly selected and then judged based solely on the result of the output + acceptance statements + judges.

This path requires the user to pass additional data (ground_truth_responses) and the results of the calls are further compared to the results of expected responses. In the original mode, we iterate until we reach a passing result, but in this one we iterate through all N responses to collect a set of pass/fail metrics and use those instead. The history of the pass/fail/scores/rationale from that set of samples is then passed to the LLM for the optimization. Once a new variation is generated, it runs through all N samples again to ensure they're all passing.

Describe alternatives you've considered

We discussed doing this in other ways, such as:

  • Sampling the ground truth options and only running a subset until passing (does not ensure that it passes for all entries, leads to possible overfitting to a single item)
  • Only confirming that it passes on some subset of the items -- has the same issue as above

Additional context

Implementation when pulling from a config looks like:

    options = OptimizationFromConfigOptions(
        project_key="default",
        context_choices=[context_builder("user-123")],
        handle_agent_call=handle_agent_call,
        handle_judge_call=handle_judge_call,
        base_url="https://ld-stg.launchdarkly.com/"
    )

    result = await client.optimize_from_config("ground-truth-optimization", options)

Result from a simple optimization:

Screenshot 2026-04-03 at 1 32 32 PM

Note

Medium Risk
Adds a new multi-sample optimization loop and changes judge prompting to incorporate optional ground-truth expected responses, which can affect optimization outcomes and run behavior. Also introduces stricter validation (missing instructions/model fallback) that may change failure modes for existing consumers.

Overview
Adds a ground-truth optimization mode that evaluates an agent against an ordered list of samples each attempt and only succeeds when all samples pass, generating a new variation and re-running the full batch until max_attempts.

Introduces GroundTruthSample / GroundTruthOptimizationOptions, exports them publicly, and updates optimize_from_config to auto-detect groundTruthResponses and dispatch to the new ground-truth run (zipping groundTruthResponses + userInputOptions + variableChoices, with length validation).

Updates judge evaluation to accept an optional expected_response and inject it into both config-judge templates and acceptance-judge user messages, and tightens runtime validation by erroring on missing agent instructions plus seeding a default model from model_choices when the flag has none. Extensive new tests cover the new dataclasses, batch loop behavior/callbacks, config dispatch, and expected-response injection.

Reviewed by Cursor Bugbot for commit 44c8c59. Bugbot is set up for automated code reviews on this repo. Configure here.

@andrewklatzke andrewklatzke requested a review from jsonbailey April 3, 2026 22:13
@andrewklatzke andrewklatzke requested a review from a team as a code owner April 3, 2026 22:13
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 44c8c59. Configure here.

config, options, api_client, optimization_id, run_id
)
if isinstance(optimization_options, GroundTruthOptimizationOptions):
return await self._run_ground_truth_optimization(agent_config, optimization_options)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optimize_from_config silently returns incompatible types

Medium Severity

optimize_from_config now returns either a single OptimizationContext or a List[OptimizationContext] depending on whether the remote config contains ground truth responses. The docstring still says it returns "OptimizationContext from the final iteration". Existing callers accessing attributes like result.completion_response will get an AttributeError if the remote config is updated to include ground truth, since the return silently changes to a list. The caller has no way to predict the return type before calling since the config is fetched internally.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 44c8c59. Configure here.

if len(self.model_choices) < 1:
raise ValueError("model_choices must have at least 1 model")
if len(self.ground_truth_responses) < 1:
raise ValueError("ground_truth_responses must have at least 1 sample")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing judge_model validation in ground truth options

Low Severity

GroundTruthOptimizationOptions.__post_init__ is missing the judge_model is None validation that OptimizationOptions.__post_init__ has. While judge_model is typed as str, Python doesn't enforce this at runtime. Passing None would not be caught until the bridge OptimizationOptions is constructed internally, producing a confusing error referencing the wrong class.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 44c8c59. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants