Skip to content

abstract client + support for interleaved thinking#860

Closed
eligotts wants to merge 52 commits intomainfrom
eli/anthropic-client
Closed

abstract client + support for interleaved thinking#860
eligotts wants to merge 52 commits intomainfrom
eli/anthropic-client

Conversation

@eligotts
Copy link
Copy Markdown
Contributor

@eligotts eligotts commented Feb 8, 2026

Description

based off of pr 788 which added provider agnostic types for message/client/tool/response flow.

General questions:

  • right now both client implementations are rather large to handle the mixed types (new custom types and legacy oai types). How much backwards compatibility should these support?
  • theres also a significant amount of backwards compat bloat in environments.py due to oai_tools vs tool_defs- how do we want to handle tools and what format they should be in?
  • in multiturn_env.py, we now need to add an explicit check for 'completion' type in get_prompt_messages and render_completion because parse_response_message is now chat-shaped. this is kind of ugly i think, so wondering what best route forward is here to support completion type environments
  • just generally how this migration should handle mixed message types, how much support we should maintain for oai based types

Other notes

  1. Completion warning is expected
  • Completion envs intentionally store prompt/completion as str. (like continuation quality)
  • RolloutOutput still types these as messages, so Pydantic warns.
  • Behavior is correct; follow-up is widen type to Messages | str (should we do this?).
  1. tool_defs is canonical
  • Internal runtime now uses only tool_defs (provider-agnostic Tool).
  • Legacy oai_tools is accepted only at boundaries, normalized once at init.
  • This preserves old env inputs without dual-state bugs.
  1. Legacy aliases kept temporarily
  • ChatMessage, ChatMessages, ModelResponse remain as transition shims.
  • Core runtime is migrated, but some experimental/integration modules still reference aliases.
  • We’ll remove aliases after those paths finish migration.
  1. Explicit completion path in MultiTurnEnv
  • Old helpers previously hid chat/completion branching.
  • After moving to unified response parsing, explicit completion logic is required.
  • Added completion-specific assembly to keep multi-turn completion envs correct.

Also fixed

  • Endpoint normalization now preserves client_type, so -e configs/endpoints.py -m haiku resolves to Anthropic correctly.

Client Interface

vf.Client is the adapter layer that wraps a native SDK client and standardizes everything into a vf.Response.

Each client implementation defines four core methods:

  1. to_native_prompt
    Convert vf.Messages into provider-native prompt format.
  2. get_native_response
    Execute the provider-native API call.
  3. raise_from_native_response (optional)
    Map/raise provider-specific errors (for example overlong prompt behavior).
  4. from_native_response
    Convert provider-native output into unified vf.Response.

We intentionally moved to custom unified types (instead of continuing to normalize to OpenAI-only types) because some provider features do not map cleanly to OAI schemas. The current type system is provider-agnostic and supports multimodal/reasoning/tool patterns across clients.

Implemented Clients

  • OAICompletionsClient
  • OAIChatCompletionsClient
  • OAIChatCompletionsTokenClient
  • AnthropicMessagesClient

(Anthropic raw completions are not supported in the current implementation, I think they deprecated)

This architecture is meant to be extensible to additional providers/API surfaces (including future OAI Responses-style adapters).


Backward Compatibility

  • Public environment APIs remain backward compatible.
  • tool_defs is now canonical internally; legacy oai_tools is accepted at boundaries and normalized at init.
  • We keep temporary legacy aliases (ChatMessage, ChatMessages, ModelResponse) as migration shims for non-migrated experimental/integration paths.
  • CustomBaseModel keeps dict-like access to avoid breaking existing env code that still uses mapping-style message access.
  • Completion-mode environments intentionally keep raw string prompt/completion behavior for compatibility (known non-fatal Pydantic serializer warning; follow-up planned to widen output typing).

Validation Runs

Tested

  • uv run vf-eval continuation-quality -n1 -r1 -d -v
  • uv run vf-eval gsm8k -n1 -r1 -d -v
  • uv run vf-eval gsm8k -n1 -r1 -d -v -m haiku -e configs/endpoints.py
  • uv run vf-eval alphabet-sort -n1 -r1 -d -v -m haiku -e configs/endpoints.py
  • uv run vf-eval wiki-search -n1 -r1 -d -v -e configs/endpoints.py
  • uv run vf-eval math-python -n1 -r1 -v -d -m deepseek-reasoner -e configs/endpoints.py

Not Tested Yet

  • Multi-turn + OAI chat completions tokens (PRIME-RL vLLM server)
    uv run vf-eval alphabet-sort -n1 -r1 -v -d -m Qwen/Qwen3-4B-Instruct-2507 -b http://localhost:8000/v1
  • Multi-turn + OAI chat completions tokens + interleaved rollouts (PRIME-RL vLLM server)
    uv run vf-eval alphabet-sort -n1 -r1 -v -d -m Qwen/Qwen3-4B-Instruct-2507 -b http://localhost:8000/v1 -x '{"interleaved_rollouts": true}'

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test improvement

Testing

  • All existing tests pass when running uv run pytest locally.
  • New tests have been added to cover the changes

Checklist

  • My code follows the style guidelines of this project as outlined in AGENTS.md
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Additional Notes


Note

High Risk
Touches core request/response plumbing (client selection, tool handling, and rollout state shape) and introduces a new provider adapter, so regressions could affect all environments and model backends despite added test coverage.

Overview
This PR replaces direct SDK calls inside Environment.get_model_response with a provider-agnostic vf.Client adapter that converts Messages/Tool definitions to native provider requests and normalizes outputs into a unified vf.Response (including usage, tool calls, and optional reasoning content). It adds first-class Anthropic support via AnthropicMessagesClient, plus an OpenAI token-capable client (OAIChatCompletionsTokenClient) to support interleaved rollouts/thinking.

Tools are migrated from legacy oai_tools (OpenAI schema) to canonical tool_defs (vf.Tool), with strict validation and boundary normalization (including intercepted agent tool schemas and MCP tool conversion). Prompt rendering in OpenEnv environments now normalizes raw messages into typed Messages, and completion-mode env/tests are updated to treat stored prompts/completions as message lists.

Endpoint configuration is expanded and normalized to preserve an explicit provider type/api_client_type (e.g., Anthropic + DeepSeek variants), pyproject.toml adds the anthropic dependency, and extensive new tests cover client auth/overlong-prompt behavior, multimodal content parts, interception serialization, and endpoint client-type resolution.

Written by Cursor Bugbot for commit 476fb12. This will update automatically on new commits. Configure here.

@eligotts eligotts changed the title Eli/anthropic client abstract client + support for interleaved thinking Feb 8, 2026
Comment thread verifiers/envs/experimental/rlm_env.py Outdated
Copy link
Copy Markdown
Member

@mikasenghaas mikasenghaas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! i think the main open questions from my side are:

  • where and how does client resolution happen? right now, bc we have oai completion, oai chat completion, anthropic completion and anthropic message, we basically choose a client based on "which client type passsed? -> oai or anthropic" and "which message type does the env have? -> str or dict". this fails once we introduce a client that does not support e.g. str prompts
  • also i would personally be in favor of having an env dictate the message type. it is now client's concern to map generic inputs -> native inputs and native output -> generic outputs. we could think about black or whitelisting specific clients from the registry for an env (e.g. a tool env cannot use oai completions client bc it doesn't support tools) so then we could show standard errs instead of api errs. but beyond that i dont think the env should care abt this
  • still not fully decided on which types to use (custom vs. extend oai). former is more elegant and future proof but latter more backwards compat

also i think we should not release this with the next release. i think there's alr enough things in for ppl to adapt to and if we were to go through with this as-is i would expect a lot of breaking changes (just from our custom type defs). we should prob test the typical user migration we expect on research-envs before releasing

Comment thread verifiers/types.py
Comment on lines +50 to 51
ClientType = Literal["openai", "anthropic"]
MessageType = Literal["chat", "completion"]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i know this is still from me lol-- but im not sure whether this distinction makes sense under the new client abstraction. what we have now:

a client which maps generic message types -> native message types (could be provider specific, eg. oai chat vs. anthropic message, or "message"-specific, eg. oai completion (str) vs. oai chat (dict))

thus, having the client and message type distinct like this doesn't really makes sense to me anymore. the main difficulty with this imo is that the message type is currently owned by the env (e.g. an env may be defined as a "completion" env) which (currently) implies that the completions api is used to get responses.

under the new client abstractions i would probably argue that this should not be allowed. an env defines rollout interactions in terms of generic types and does not dictate which client is used to get the response

Comment thread verifiers/types.py Outdated
Comment thread verifiers/types.py
# ──────────────────────────────────────────────────────────────────────
# Shared message types (provider-agnostic)
# ──────────────────────────────────────────────────────────────────────
class TextMessage(CustomBaseModel):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah so we still define our own types? i thought we wanted to keep oai types and extend where necessary for now? if we do this, we should imo keep the naming as-is and just define a bunch of type aliases to oai types so that if we decide to move away from it we dont break imports for people

that said, if @willccbb is fine with new types, i am also fine with it. imo we have to define those sooner or later anyways so might as well break it now. but i would expect quite some edge cases for some time with this that we just have to accept and fix bit by bit

Comment thread verifiers/types.py Outdated
Comment thread verifiers/types.py Outdated
prompt: Messages
completion: Messages
response: ModelResponse
prompt: Messages | str
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the consistent thing would be to deprecate those because the conversion from our generic message type to the str type e.g. oai completions expects lives in the client now so the core vf logic should not have to deal with this. this would break the (small?) set of completion envs

Comment thread verifiers/types.py Outdated
Comment thread verifiers/utils/interception_utils.py
Comment thread verifiers/clients/openai/openai_clients.py
Comment thread verifiers/clients/anthropic/anthropic_clients.py
Comment thread verifiers/envs/environment.py
Comment thread verifiers/clients/openai/openai_clients.py
Comment thread verifiers/clients/anthropic/anthropic_clients.py
Comment thread verifiers/__init__.py
OAIChatCompletionsClient,
OAIChatCompletionsTokenClient,
OAICompletionsClient,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing documentation updates for new client and types

Low Severity

This PR introduces significant user-facing changes — a new Client abstraction with four implementations, replacement of oai_tools with tool_defs, new provider-agnostic message/response types, and an interleaved_thinking parameter — but includes no updates to any files under docs/. Per the project documentation rule, changes to core user-facing functionality must be accompanied by corresponding documentation updates.

Additional Locations (1)

Fix in Cursor Fix in Web

Triggered by project rule: BugBot Instructions

Comment thread verifiers/clients/openai/openai_clients.py
Comment thread verifiers/clients/anthropic/anthropic_clients.py
Comment thread verifiers/clients/anthropic/anthropic_clients.py
return {k: v for k, v in sampling_args.items() if v is not None}

sampling_args = normalize_sampling_args(sampling_args)
state = cast(State, kwargs.pop("state"))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Token client state kwarg required but not defensively handled

Low Severity

OAIChatCompletionsTokenClient.get_native_response unconditionally does kwargs.pop("state") which raises a raw KeyError if state is not provided. While the normal flow through Client.get_response always passes state, nothing in the type signature or contract enforces this. A missing state yields an unhelpful KeyError instead of a descriptive error.

Fix in Cursor Fix in Web

# Wrap raw client in vf.Client adapter
state["client"] = resolve_client(
client, self.message_type, self.interleaved_rollouts
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New resolve_client wrapping creates fresh wrapper each rollout

Low Severity

init_state calls resolve_client(client, ...) on every rollout, allocating a new Client wrapper object per rollout even though the underlying AsyncOpenAI/AsyncAnthropic client is the same. For high-throughput evaluation with thousands of rollouts, this creates unnecessary object churn. The wrapper could be cached or created once per generate call.

Additional Locations (1)

Fix in Cursor Fix in Web

Comment thread verifiers/clients/openai/openai_clients.py
eligotts and others added 2 commits February 10, 2026 15:32
Co-authored-by: will brown <willccbb@users.noreply.github.com>
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
messages=prompt,
**normalize_sampling_args(sampling_args),
)
return response
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OAI chat client silently swallows extra kwargs

Low Severity

OAIChatCompletionsClient.get_native_response accepts **kwargs (which includes state passed by the framework) but silently ignores them, while AnthropicMessagesClient.get_native_response explicitly pops state before forwarding remaining kwargs to the API. This inconsistency is fragile — anyone subclassing OAIChatCompletionsClient who adds **kwargs passthrough to the API call would inadvertently send state to the provider.

Additional Locations (1)

Fix in Cursor Fix in Web

Comment thread tests/test_save_utils.py
]
assert result[0]["completion"] == [
{"role": "assistant", "content": "Final DONE"},
]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completion serialization flattens messages despite new test expecting lists

Medium Severity

The new test test_states_to_outputs_completion_keeps_messages expects completion-mode prompt/completion to remain as message lists, but state_to_output in save_utils.py still flattens them to strings via _flatten_messages_content when message_type == "completion" and "input" in state. Since make_state sets both message_type="completion" (via kwargs) and input (explicitly), should_flatten_completion_output evaluates to True, causing the output to be a concatenated string — not the list the assertions expect.

Additional Locations (1)

Fix in Cursor Fix in Web

# Fall back to OpenAI client for duck-typed clients (e.g., mocks, proxies)
if message_type == "completion":
return OAICompletionsClient(cast(AsyncOpenAI, client))
return OAIChatCompletionsClient(cast(AsyncOpenAI, client))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New client wrapper created on every resolve_client call

Low Severity

resolve_client allocates a new Client wrapper on every invocation even when the underlying SDK client is unchanged. In get_model_response, when a raw AsyncOpenAI/AsyncAnthropic is passed (non-None), a fresh wrapper is instantiated per call. Combined with init_state also calling resolve_client, multi-turn rollouts that pass explicit clients create redundant wrapper objects per turn.

Additional Locations (1)

Fix in Cursor Fix in Web

# model/client
"endpoint_id",
"model",
"api_client_type",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EvalConfig.interleaved_thinking never applied to environment

Medium Severity

EvalConfig has an interleaved_thinking field (defaults to True) and Environment now has a set_interleaved_thinking() method, but run_evaluation in eval_utils.py never calls vf_env.set_interleaved_thinking(config.interleaved_thinking). The config value is silently ignored and the environment always uses its constructor default. Setting interleaved_thinking: false in an eval config has no effect.

Additional Locations (1)

Fix in Cursor Fix in Web

if self.interleaved_thinking
else None,
thinking_blocks=thinking_blocks or None,
tool_calls=tool_calls or None,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anthropic thinking_blocks kept when interleaved_thinking disabled

Low Severity

When interleaved_thinking is False, from_native_response correctly strips reasoning_content but still preserves thinking_blocks on the ResponseMessage. These blocks then round-trip through parse_response_message into future prompts via to_native_prompt's extract_thinking_blocks. If the intent of disabling interleaved thinking is to omit all thinking data from downstream processing, thinking_blocks leaks through.

Fix in Cursor Fix in Web

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

# Fall back to OpenAI client for duck-typed clients (e.g., mocks, proxies)
if message_type == "completion":
return OAICompletionsClient(cast(AsyncOpenAI, client))
return OAIChatCompletionsClient(cast(AsyncOpenAI, client))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duck-typed clients silently lose interleaved rollouts

Medium Severity

The duck-typed fallback in resolve_client ignores interleaved_rollouts and always creates a regular OAIChatCompletionsClient. When the client object isn't an explicit AsyncOpenAI instance (e.g., subclass, mock, or proxy), the OAIChatCompletionsTokenClient path is never reached, silently disabling interleaved rollouts without any warning.

Additional Locations (1)

Fix in Cursor Fix in Web

@mikasenghaas
Copy link
Copy Markdown
Member

merged in #897

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants