feat: ground truth optimization path#122
feat: ground truth optimization path#122andrewklatzke wants to merge 1 commit intoaklatzke/AIC-1794/optimize-method-from-ldfrom
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 44c8c59. Configure here.
| config, options, api_client, optimization_id, run_id | ||
| ) | ||
| if isinstance(optimization_options, GroundTruthOptimizationOptions): | ||
| return await self._run_ground_truth_optimization(agent_config, optimization_options) |
There was a problem hiding this comment.
optimize_from_config silently returns incompatible types
Medium Severity
optimize_from_config now returns either a single OptimizationContext or a List[OptimizationContext] depending on whether the remote config contains ground truth responses. The docstring still says it returns "OptimizationContext from the final iteration". Existing callers accessing attributes like result.completion_response will get an AttributeError if the remote config is updated to include ground truth, since the return silently changes to a list. The caller has no way to predict the return type before calling since the config is fetched internally.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 44c8c59. Configure here.
| if len(self.model_choices) < 1: | ||
| raise ValueError("model_choices must have at least 1 model") | ||
| if len(self.ground_truth_responses) < 1: | ||
| raise ValueError("ground_truth_responses must have at least 1 sample") |
There was a problem hiding this comment.
Missing judge_model validation in ground truth options
Low Severity
GroundTruthOptimizationOptions.__post_init__ is missing the judge_model is None validation that OptimizationOptions.__post_init__ has. While judge_model is typed as str, Python doesn't enforce this at runtime. Passing None would not be caught until the bridge OptimizationOptions is constructed internally, producing a confusing error referencing the wrong class.
Reviewed by Cursor Bugbot for commit 44c8c59. Configure here.


Requirements
Describe the solution you've provided
Implements the "ground truth" path for the SDK optimizations. The existing path optimizes through a form of "chaos testing" where inputs are randomly selected and then judged based solely on the result of the output + acceptance statements + judges.
This path requires the user to pass additional data (ground_truth_responses) and the results of the calls are further compared to the results of expected responses. In the original mode, we iterate until we reach a passing result, but in this one we iterate through all N responses to collect a set of pass/fail metrics and use those instead. The history of the pass/fail/scores/rationale from that set of samples is then passed to the LLM for the optimization. Once a new variation is generated, it runs through all N samples again to ensure they're all passing.
Describe alternatives you've considered
We discussed doing this in other ways, such as:
Additional context
Implementation when pulling from a config looks like:
Result from a simple optimization:
Note
Medium Risk
Adds a new multi-sample optimization loop and changes judge prompting to incorporate optional ground-truth expected responses, which can affect optimization outcomes and run behavior. Also introduces stricter validation (missing instructions/model fallback) that may change failure modes for existing consumers.
Overview
Adds a ground-truth optimization mode that evaluates an agent against an ordered list of samples each attempt and only succeeds when all samples pass, generating a new variation and re-running the full batch until
max_attempts.Introduces
GroundTruthSample/GroundTruthOptimizationOptions, exports them publicly, and updatesoptimize_from_configto auto-detectgroundTruthResponsesand dispatch to the new ground-truth run (zippinggroundTruthResponses+userInputOptions+variableChoices, with length validation).Updates judge evaluation to accept an optional
expected_responseand inject it into both config-judge templates and acceptance-judge user messages, and tightens runtime validation by erroring on missing agent instructions plus seeding a default model frommodel_choiceswhen the flag has none. Extensive new tests cover the new dataclasses, batch loop behavior/callbacks, config dispatch, and expected-response injection.Reviewed by Cursor Bugbot for commit 44c8c59. Bugbot is set up for automated code reviews on this repo. Configure here.