abstract client + support for interleaved thinking#860
abstract client + support for interleaved thinking#860
Conversation
# Conflicts: # verifiers/scripts/eval.py # verifiers/types.py
mikasenghaas
left a comment
There was a problem hiding this comment.
nice! i think the main open questions from my side are:
- where and how does client resolution happen? right now, bc we have oai completion, oai chat completion, anthropic completion and anthropic message, we basically choose a client based on "which client type passsed? -> oai or anthropic" and "which message type does the env have? -> str or dict". this fails once we introduce a client that does not support e.g. str prompts
- also i would personally be in favor of having an env dictate the message type. it is now client's concern to map generic inputs -> native inputs and native output -> generic outputs. we could think about black or whitelisting specific clients from the registry for an env (e.g. a tool env cannot use oai completions client bc it doesn't support tools) so then we could show standard errs instead of api errs. but beyond that i dont think the env should care abt this
- still not fully decided on which types to use (custom vs. extend oai). former is more elegant and future proof but latter more backwards compat
also i think we should not release this with the next release. i think there's alr enough things in for ppl to adapt to and if we were to go through with this as-is i would expect a lot of breaking changes (just from our custom type defs). we should prob test the typical user migration we expect on research-envs before releasing
| ClientType = Literal["openai", "anthropic"] | ||
| MessageType = Literal["chat", "completion"] |
There was a problem hiding this comment.
i know this is still from me lol-- but im not sure whether this distinction makes sense under the new client abstraction. what we have now:
a client which maps generic message types -> native message types (could be provider specific, eg. oai chat vs. anthropic message, or "message"-specific, eg. oai completion (str) vs. oai chat (dict))
thus, having the client and message type distinct like this doesn't really makes sense to me anymore. the main difficulty with this imo is that the message type is currently owned by the env (e.g. an env may be defined as a "completion" env) which (currently) implies that the completions api is used to get responses.
under the new client abstractions i would probably argue that this should not be allowed. an env defines rollout interactions in terms of generic types and does not dictate which client is used to get the response
| # ────────────────────────────────────────────────────────────────────── | ||
| # Shared message types (provider-agnostic) | ||
| # ────────────────────────────────────────────────────────────────────── | ||
| class TextMessage(CustomBaseModel): |
There was a problem hiding this comment.
ah so we still define our own types? i thought we wanted to keep oai types and extend where necessary for now? if we do this, we should imo keep the naming as-is and just define a bunch of type aliases to oai types so that if we decide to move away from it we dont break imports for people
that said, if @willccbb is fine with new types, i am also fine with it. imo we have to define those sooner or later anyways so might as well break it now. but i would expect quite some edge cases for some time with this that we just have to accept and fix bit by bit
| prompt: Messages | ||
| completion: Messages | ||
| response: ModelResponse | ||
| prompt: Messages | str |
There was a problem hiding this comment.
the consistent thing would be to deprecate those because the conversion from our generic message type to the str type e.g. oai completions expects lives in the client now so the core vf logic should not have to deal with this. this would break the (small?) set of completion envs
| OAIChatCompletionsClient, | ||
| OAIChatCompletionsTokenClient, | ||
| OAICompletionsClient, | ||
| ) |
There was a problem hiding this comment.
Missing documentation updates for new client and types
Low Severity
This PR introduces significant user-facing changes — a new Client abstraction with four implementations, replacement of oai_tools with tool_defs, new provider-agnostic message/response types, and an interleaved_thinking parameter — but includes no updates to any files under docs/. Per the project documentation rule, changes to core user-facing functionality must be accompanied by corresponding documentation updates.
Additional Locations (1)
Triggered by project rule: BugBot Instructions
| return {k: v for k, v in sampling_args.items() if v is not None} | ||
|
|
||
| sampling_args = normalize_sampling_args(sampling_args) | ||
| state = cast(State, kwargs.pop("state")) |
There was a problem hiding this comment.
Token client state kwarg required but not defensively handled
Low Severity
OAIChatCompletionsTokenClient.get_native_response unconditionally does kwargs.pop("state") which raises a raw KeyError if state is not provided. While the normal flow through Client.get_response always passes state, nothing in the type signature or contract enforces this. A missing state yields an unhelpful KeyError instead of a descriptive error.
| # Wrap raw client in vf.Client adapter | ||
| state["client"] = resolve_client( | ||
| client, self.message_type, self.interleaved_rollouts | ||
| ) |
There was a problem hiding this comment.
New resolve_client wrapping creates fresh wrapper each rollout
Low Severity
init_state calls resolve_client(client, ...) on every rollout, allocating a new Client wrapper object per rollout even though the underlying AsyncOpenAI/AsyncAnthropic client is the same. For high-throughput evaluation with thousands of rollouts, this creates unnecessary object churn. The wrapper could be cached or created once per generate call.
Additional Locations (1)
Co-authored-by: will brown <willccbb@users.noreply.github.com> Co-authored-by: Cursor Agent <cursoragent@cursor.com>
| messages=prompt, | ||
| **normalize_sampling_args(sampling_args), | ||
| ) | ||
| return response |
There was a problem hiding this comment.
OAI chat client silently swallows extra kwargs
Low Severity
OAIChatCompletionsClient.get_native_response accepts **kwargs (which includes state passed by the framework) but silently ignores them, while AnthropicMessagesClient.get_native_response explicitly pops state before forwarding remaining kwargs to the API. This inconsistency is fragile — anyone subclassing OAIChatCompletionsClient who adds **kwargs passthrough to the API call would inadvertently send state to the provider.
Additional Locations (1)
| ] | ||
| assert result[0]["completion"] == [ | ||
| {"role": "assistant", "content": "Final DONE"}, | ||
| ] |
There was a problem hiding this comment.
Completion serialization flattens messages despite new test expecting lists
Medium Severity
The new test test_states_to_outputs_completion_keeps_messages expects completion-mode prompt/completion to remain as message lists, but state_to_output in save_utils.py still flattens them to strings via _flatten_messages_content when message_type == "completion" and "input" in state. Since make_state sets both message_type="completion" (via kwargs) and input (explicitly), should_flatten_completion_output evaluates to True, causing the output to be a concatenated string — not the list the assertions expect.
Additional Locations (1)
| # Fall back to OpenAI client for duck-typed clients (e.g., mocks, proxies) | ||
| if message_type == "completion": | ||
| return OAICompletionsClient(cast(AsyncOpenAI, client)) | ||
| return OAIChatCompletionsClient(cast(AsyncOpenAI, client)) |
There was a problem hiding this comment.
New client wrapper created on every resolve_client call
Low Severity
resolve_client allocates a new Client wrapper on every invocation even when the underlying SDK client is unchanged. In get_model_response, when a raw AsyncOpenAI/AsyncAnthropic is passed (non-None), a fresh wrapper is instantiated per call. Combined with init_state also calling resolve_client, multi-turn rollouts that pass explicit clients create redundant wrapper objects per turn.
Additional Locations (1)
| # model/client | ||
| "endpoint_id", | ||
| "model", | ||
| "api_client_type", |
There was a problem hiding this comment.
EvalConfig.interleaved_thinking never applied to environment
Medium Severity
EvalConfig has an interleaved_thinking field (defaults to True) and Environment now has a set_interleaved_thinking() method, but run_evaluation in eval_utils.py never calls vf_env.set_interleaved_thinking(config.interleaved_thinking). The config value is silently ignored and the environment always uses its constructor default. Setting interleaved_thinking: false in an eval config has no effect.
Additional Locations (1)
| if self.interleaved_thinking | ||
| else None, | ||
| thinking_blocks=thinking_blocks or None, | ||
| tool_calls=tool_calls or None, |
There was a problem hiding this comment.
Anthropic thinking_blocks kept when interleaved_thinking disabled
Low Severity
When interleaved_thinking is False, from_native_response correctly strips reasoning_content but still preserves thinking_blocks on the ResponseMessage. These blocks then round-trip through parse_response_message into future prompts via to_native_prompt's extract_thinking_blocks. If the intent of disabling interleaved thinking is to omit all thinking data from downstream processing, thinking_blocks leaks through.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| # Fall back to OpenAI client for duck-typed clients (e.g., mocks, proxies) | ||
| if message_type == "completion": | ||
| return OAICompletionsClient(cast(AsyncOpenAI, client)) | ||
| return OAIChatCompletionsClient(cast(AsyncOpenAI, client)) |
There was a problem hiding this comment.
Duck-typed clients silently lose interleaved rollouts
Medium Severity
The duck-typed fallback in resolve_client ignores interleaved_rollouts and always creates a regular OAIChatCompletionsClient. When the client object isn't an explicit AsyncOpenAI instance (e.g., subclass, mock, or proxy), the OAIChatCompletionsTokenClient path is never reached, silently disabling interleaved rollouts without any warning.
Additional Locations (1)
|
merged in #897 |


Description
based off of pr 788 which added provider agnostic types for message/client/tool/response flow.
General questions:
Other notes
prompt/completionasstr. (like continuation quality)RolloutOutputstill types these as messages, so Pydantic warns.Messages | str(should we do this?).tool_defsis canonicaltool_defs(provider-agnosticTool).oai_toolsis accepted only at boundaries, normalized once at init.ChatMessage,ChatMessages,ModelResponseremain as transition shims.MultiTurnEnvAlso fixed
client_type, so-e configs/endpoints.py -m haikuresolves to Anthropic correctly.Client Interface
vf.Clientis the adapter layer that wraps a native SDK client and standardizes everything into avf.Response.Each client implementation defines four core methods:
to_native_promptConvert
vf.Messagesinto provider-native prompt format.get_native_responseExecute the provider-native API call.
raise_from_native_response(optional)Map/raise provider-specific errors (for example overlong prompt behavior).
from_native_responseConvert provider-native output into unified
vf.Response.We intentionally moved to custom unified types (instead of continuing to normalize to OpenAI-only types) because some provider features do not map cleanly to OAI schemas. The current type system is provider-agnostic and supports multimodal/reasoning/tool patterns across clients.
Implemented Clients
OAICompletionsClientOAIChatCompletionsClientOAIChatCompletionsTokenClientAnthropicMessagesClient(Anthropic raw completions are not supported in the current implementation, I think they deprecated)
This architecture is meant to be extensible to additional providers/API surfaces (including future OAI Responses-style adapters).
Backward Compatibility
tool_defsis now canonical internally; legacyoai_toolsis accepted at boundaries and normalized at init.ChatMessage,ChatMessages,ModelResponse) as migration shims for non-migrated experimental/integration paths.CustomBaseModelkeeps dict-like access to avoid breaking existing env code that still uses mapping-style message access.Validation Runs
Tested
uv run vf-eval continuation-quality -n1 -r1 -d -vuv run vf-eval gsm8k -n1 -r1 -d -vuv run vf-eval gsm8k -n1 -r1 -d -v -m haiku -e configs/endpoints.pyuv run vf-eval alphabet-sort -n1 -r1 -d -v -m haiku -e configs/endpoints.pyuv run vf-eval wiki-search -n1 -r1 -d -v -e configs/endpoints.pyuv run vf-eval math-python -n1 -r1 -v -d -m deepseek-reasoner -e configs/endpoints.pyNot Tested Yet
uv run vf-eval alphabet-sort -n1 -r1 -v -d -m Qwen/Qwen3-4B-Instruct-2507 -b http://localhost:8000/v1uv run vf-eval alphabet-sort -n1 -r1 -v -d -m Qwen/Qwen3-4B-Instruct-2507 -b http://localhost:8000/v1 -x '{"interleaved_rollouts": true}'Type of Change
Testing
uv run pytestlocally.Checklist
Additional Notes
Note
High Risk
Touches core request/response plumbing (client selection, tool handling, and rollout state shape) and introduces a new provider adapter, so regressions could affect all environments and model backends despite added test coverage.
Overview
This PR replaces direct SDK calls inside
Environment.get_model_responsewith a provider-agnosticvf.Clientadapter that convertsMessages/Tooldefinitions to native provider requests and normalizes outputs into a unifiedvf.Response(including usage, tool calls, and optional reasoning content). It adds first-class Anthropic support viaAnthropicMessagesClient, plus an OpenAI token-capable client (OAIChatCompletionsTokenClient) to support interleaved rollouts/thinking.Tools are migrated from legacy
oai_tools(OpenAI schema) to canonicaltool_defs(vf.Tool), with strict validation and boundary normalization (including intercepted agent tool schemas and MCP tool conversion). Prompt rendering in OpenEnv environments now normalizes raw messages into typedMessages, and completion-mode env/tests are updated to treat stored prompts/completions as message lists.Endpoint configuration is expanded and normalized to preserve an explicit provider
type/api_client_type(e.g., Anthropic + DeepSeek variants),pyproject.tomladds theanthropicdependency, and extensive new tests cover client auth/overlong-prompt behavior, multimodal content parts, interception serialization, and endpoint client-type resolution.Written by Cursor Bugbot for commit 476fb12. This will update automatically on new commits. Configure here.