abstract client + support for interleaved thinking by eligotts · Pull Request #860 · PrimeIntellect-ai/verifiers

eligotts · 2026-02-08T01:56:38Z

Description

based off of pr 788 which added provider agnostic types for message/client/tool/response flow.

General questions:

right now both client implementations are rather large to handle the mixed types (new custom types and legacy oai types). How much backwards compatibility should these support?
theres also a significant amount of backwards compat bloat in environments.py due to oai_tools vs tool_defs- how do we want to handle tools and what format they should be in?
in multiturn_env.py, we now need to add an explicit check for 'completion' type in get_prompt_messages and render_completion because parse_response_message is now chat-shaped. this is kind of ugly i think, so wondering what best route forward is here to support completion type environments
just generally how this migration should handle mixed message types, how much support we should maintain for oai based types

Other notes

Completion warning is expected

Completion envs intentionally store prompt/completion as str. (like continuation quality)
RolloutOutput still types these as messages, so Pydantic warns.
Behavior is correct; follow-up is widen type to Messages | str (should we do this?).

tool_defs is canonical

Internal runtime now uses only tool_defs (provider-agnostic Tool).
Legacy oai_tools is accepted only at boundaries, normalized once at init.
This preserves old env inputs without dual-state bugs.

Legacy aliases kept temporarily

ChatMessage, ChatMessages, ModelResponse remain as transition shims.
Core runtime is migrated, but some experimental/integration modules still reference aliases.
We’ll remove aliases after those paths finish migration.

Explicit completion path in MultiTurnEnv

Old helpers previously hid chat/completion branching.
After moving to unified response parsing, explicit completion logic is required.
Added completion-specific assembly to keep multi-turn completion envs correct.

Also fixed

Endpoint normalization now preserves client_type, so -e configs/endpoints.py -m haiku resolves to Anthropic correctly.

Client Interface

vf.Client is the adapter layer that wraps a native SDK client and standardizes everything into a vf.Response.

Each client implementation defines four core methods:

to_native_prompt
Convert vf.Messages into provider-native prompt format.
get_native_response
Execute the provider-native API call.
raise_from_native_response (optional)
Map/raise provider-specific errors (for example overlong prompt behavior).
from_native_response
Convert provider-native output into unified vf.Response.

We intentionally moved to custom unified types (instead of continuing to normalize to OpenAI-only types) because some provider features do not map cleanly to OAI schemas. The current type system is provider-agnostic and supports multimodal/reasoning/tool patterns across clients.

Implemented Clients

OAICompletionsClient
OAIChatCompletionsClient
OAIChatCompletionsTokenClient
AnthropicMessagesClient

(Anthropic raw completions are not supported in the current implementation, I think they deprecated)

This architecture is meant to be extensible to additional providers/API surfaces (including future OAI Responses-style adapters).

Backward Compatibility

Public environment APIs remain backward compatible.
tool_defs is now canonical internally; legacy oai_tools is accepted at boundaries and normalized at init.
We keep temporary legacy aliases (ChatMessage, ChatMessages, ModelResponse) as migration shims for non-migrated experimental/integration paths.
CustomBaseModel keeps dict-like access to avoid breaking existing env code that still uses mapping-style message access.
Completion-mode environments intentionally keep raw string prompt/completion behavior for compatibility (known non-fatal Pydantic serializer warning; follow-up planned to widen output typing).

Validation Runs

Tested

uv run vf-eval continuation-quality -n1 -r1 -d -v
uv run vf-eval gsm8k -n1 -r1 -d -v
uv run vf-eval gsm8k -n1 -r1 -d -v -m haiku -e configs/endpoints.py
uv run vf-eval alphabet-sort -n1 -r1 -d -v -m haiku -e configs/endpoints.py
uv run vf-eval wiki-search -n1 -r1 -d -v -e configs/endpoints.py
uv run vf-eval math-python -n1 -r1 -v -d -m deepseek-reasoner -e configs/endpoints.py

Not Tested Yet

Multi-turn + OAI chat completions tokens (PRIME-RL vLLM server)
uv run vf-eval alphabet-sort -n1 -r1 -v -d -m Qwen/Qwen3-4B-Instruct-2507 -b http://localhost:8000/v1
Multi-turn + OAI chat completions tokens + interleaved rollouts (PRIME-RL vLLM server)
uv run vf-eval alphabet-sort -n1 -r1 -v -d -m Qwen/Qwen3-4B-Instruct-2507 -b http://localhost:8000/v1 -x '{"interleaved_rollouts": true}'

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Test improvement

Testing

All existing tests pass when running uv run pytest locally.
New tests have been added to cover the changes

Checklist

My code follows the style guidelines of this project as outlined in AGENTS.md
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Additional Notes

Note

High Risk
Touches core request/response plumbing (client selection, tool handling, and rollout state shape) and introduces a new provider adapter, so regressions could affect all environments and model backends despite added test coverage.

Overview
This PR replaces direct SDK calls inside Environment.get_model_response with a provider-agnostic vf.Client adapter that converts Messages/Tool definitions to native provider requests and normalizes outputs into a unified vf.Response (including usage, tool calls, and optional reasoning content). It adds first-class Anthropic support via AnthropicMessagesClient, plus an OpenAI token-capable client (OAIChatCompletionsTokenClient) to support interleaved rollouts/thinking.

Tools are migrated from legacy oai_tools (OpenAI schema) to canonical tool_defs (vf.Tool), with strict validation and boundary normalization (including intercepted agent tool schemas and MCP tool conversion). Prompt rendering in OpenEnv environments now normalizes raw messages into typed Messages, and completion-mode env/tests are updated to treat stored prompts/completions as message lists.

Endpoint configuration is expanded and normalized to preserve an explicit provider type/api_client_type (e.g., Anthropic + DeepSeek variants), pyproject.toml adds the anthropic dependency, and extensive new tests cover client auth/overlong-prompt behavior, multimodal content parts, interception serialization, and endpoint client-type resolution.

^{Written by Cursor Bugbot for commit 476fb12. This will update automatically on new commits. Configure here.}

…d tool flow

…plumbing

…types

# Conflicts: # verifiers/scripts/eval.py # verifiers/types.py

mikasenghaas

nice! i think the main open questions from my side are:

where and how does client resolution happen? right now, bc we have oai completion, oai chat completion, anthropic completion and anthropic message, we basically choose a client based on "which client type passsed? -> oai or anthropic" and "which message type does the env have? -> str or dict". this fails once we introduce a client that does not support e.g. str prompts
also i would personally be in favor of having an env dictate the message type. it is now client's concern to map generic inputs -> native inputs and native output -> generic outputs. we could think about black or whitelisting specific clients from the registry for an env (e.g. a tool env cannot use oai completions client bc it doesn't support tools) so then we could show standard errs instead of api errs. but beyond that i dont think the env should care abt this
still not fully decided on which types to use (custom vs. extend oai). former is more elegant and future proof but latter more backwards compat

also i think we should not release this with the next release. i think there's alr enough things in for ppl to adapt to and if we were to go through with this as-is i would expect a lot of breaking changes (just from our custom type defs). we should prob test the typical user migration we expect on research-envs before releasing

mikasenghaas · 2026-02-08T22:21:57Z

+ClientType = Literal["openai", "anthropic"]
 MessageType = Literal["chat", "completion"]


i know this is still from me lol-- but im not sure whether this distinction makes sense under the new client abstraction. what we have now:

a client which maps generic message types -> native message types (could be provider specific, eg. oai chat vs. anthropic message, or "message"-specific, eg. oai completion (str) vs. oai chat (dict))

thus, having the client and message type distinct like this doesn't really makes sense to me anymore. the main difficulty with this imo is that the message type is currently owned by the env (e.g. an env may be defined as a "completion" env) which (currently) implies that the completions api is used to get responses.

under the new client abstractions i would probably argue that this should not be allowed. an env defines rollout interactions in terms of generic types and does not dictate which client is used to get the response

mikasenghaas · 2026-02-08T22:25:50Z

+# ──────────────────────────────────────────────────────────────────────
+# Shared message types (provider-agnostic)
+# ──────────────────────────────────────────────────────────────────────
+class TextMessage(CustomBaseModel):


ah so we still define our own types? i thought we wanted to keep oai types and extend where necessary for now? if we do this, we should imo keep the naming as-is and just define a bunch of type aliases to oai types so that if we decide to move away from it we dont break imports for people

that said, if @willccbb is fine with new types, i am also fine with it. imo we have to define those sooner or later anyways so might as well break it now. but i would expect quite some edge cases for some time with this that we just have to accept and fix bit by bit

mikasenghaas · 2026-02-08T22:29:20Z

-    prompt: Messages
-    completion: Messages
-    response: ModelResponse
+    prompt: Messages | str


the consistent thing would be to deprecate those because the conversion from our generic message type to the str type e.g. oai completions expects lives in the client now so the core vf logic should not have to deal with this. this would break the (small?) set of completion envs

cursor · 2026-02-10T08:46:47Z

+    OAIChatCompletionsClient,
+    OAIChatCompletionsTokenClient,
+    OAICompletionsClient,
+)


Missing documentation updates for new client and types

Low Severity

This PR introduces significant user-facing changes — a new Client abstraction with four implementations, replacement of oai_tools with tool_defs, new provider-agnostic message/response types, and an interleaved_thinking parameter — but includes no updates to any files under docs/. Per the project documentation rule, changes to core user-facing functionality must be accompanied by corresponding documentation updates.

Additional Locations (1)

verifiers/envs/environment.py#L106-L107

^{Triggered by project rule: BugBot Instructions}

…to anthropic client

cursor · 2026-02-10T20:23:35Z

+            return {k: v for k, v in sampling_args.items() if v is not None}
+
+        sampling_args = normalize_sampling_args(sampling_args)
+        state = cast(State, kwargs.pop("state"))


Token client state kwarg required but not defensively handled

Low Severity

OAIChatCompletionsTokenClient.get_native_response unconditionally does kwargs.pop("state") which raises a raw KeyError if state is not provided. While the normal flow through Client.get_response always passes state, nothing in the type signature or contract enforces this. A missing state yields an unhelpful KeyError instead of a descriptive error.

cursor · 2026-02-10T20:23:35Z

+        # Wrap raw client in vf.Client adapter
+        state["client"] = resolve_client(
+            client, self.message_type, self.interleaved_rollouts
+        )


New resolve_client wrapping creates fresh wrapper each rollout

Low Severity

init_state calls resolve_client(client, ...) on every rollout, allocating a new Client wrapper object per rollout even though the underlying AsyncOpenAI/AsyncAnthropic client is the same. For high-throughput evaluation with thousands of rollouts, this creates unnecessary object churn. The wrapper could be cached or created once per generate call.

Additional Locations (1)

verifiers/clients/__init__.py#L25-L59

Co-authored-by: will brown <willccbb@users.noreply.github.com> Co-authored-by: Cursor Agent <cursoragent@cursor.com>

cursor · 2026-02-11T00:40:59Z

+                messages=prompt,
+                **normalize_sampling_args(sampling_args),
+            )
+        return response


OAI chat client silently swallows extra kwargs

Low Severity

OAIChatCompletionsClient.get_native_response accepts **kwargs (which includes state passed by the framework) but silently ignores them, while AnthropicMessagesClient.get_native_response explicitly pops state before forwarding remaining kwargs to the API. This inconsistency is fragile — anyone subclassing OAIChatCompletionsClient who adds **kwargs passthrough to the API call would inadvertently send state to the provider.

Additional Locations (1)

verifiers/clients/anthropic/anthropic_clients.py#L280-L320

cursor · 2026-02-11T01:36:41Z

+        ]
+        assert result[0]["completion"] == [
+            {"role": "assistant", "content": "Final DONE"},
+        ]


Completion serialization flattens messages despite new test expecting lists

Medium Severity

The new test test_states_to_outputs_completion_keeps_messages expects completion-mode prompt/completion to remain as message lists, but state_to_output in save_utils.py still flattens them to strings via _flatten_messages_content when message_type == "completion" and "input" in state. Since make_state sets both message_type="completion" (via kwargs) and input (explicitly), should_flatten_completion_output evaluates to True, causing the output to be a concatenated string — not the list the assertions expect.

Additional Locations (1)

verifiers/utils/save_utils.py#L212-L231

cursor · 2026-02-11T01:36:41Z

+        # Fall back to OpenAI client for duck-typed clients (e.g., mocks, proxies)
+        if message_type == "completion":
+            return OAICompletionsClient(cast(AsyncOpenAI, client))
+        return OAIChatCompletionsClient(cast(AsyncOpenAI, client))


New client wrapper created on every resolve_client call

Low Severity

resolve_client allocates a new Client wrapper on every invocation even when the underlying SDK client is unchanged. In get_model_response, when a raw AsyncOpenAI/AsyncAnthropic is passed (non-None), a fresh wrapper is instantiated per call. Combined with init_state also calling resolve_client, multi-turn rollouts that pass explicit clients create redundant wrapper objects per turn.

Additional Locations (1)

verifiers/envs/environment.py#L547-L551

cursor · 2026-02-11T02:48:59Z

        # model/client
        "endpoint_id",
        "model",
+        "api_client_type",


EvalConfig.interleaved_thinking never applied to environment

Medium Severity

EvalConfig has an interleaved_thinking field (defaults to True) and Environment now has a set_interleaved_thinking() method, but run_evaluation in eval_utils.py never calls vf_env.set_interleaved_thinking(config.interleaved_thinking). The config value is silently ignored and the environment always uses its constructor default. Setting interleaved_thinking: false in an eval config has no effect.

Additional Locations (1)

verifiers/envs/environment.py#L1335-L1338

cursor · 2026-02-11T02:48:59Z

+                if self.interleaved_thinking
+                else None,
+                thinking_blocks=thinking_blocks or None,
+                tool_calls=tool_calls or None,


Anthropic thinking_blocks kept when interleaved_thinking disabled

Low Severity

When interleaved_thinking is False, from_native_response correctly strips reasoning_content but still preserves thinking_blocks on the ResponseMessage. These blocks then round-trip through parse_response_message into future prompts via to_native_prompt's extract_thinking_blocks. If the intent of disabling interleaved thinking is to omit all thinking data from downstream processing, thinking_blocks leaks through.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-02-11T04:41:59Z

+        # Fall back to OpenAI client for duck-typed clients (e.g., mocks, proxies)
+        if message_type == "completion":
+            return OAICompletionsClient(cast(AsyncOpenAI, client))
+        return OAIChatCompletionsClient(cast(AsyncOpenAI, client))


Duck-typed clients silently lose interleaved rollouts

Medium Severity

The duck-typed fallback in resolve_client ignores interleaved_rollouts and always creates a regular OAIChatCompletionsClient. When the client object isn't an explicit AsyncOpenAI instance (e.g., subclass, mock, or proxy), the OAIChatCompletionsTokenClient path is never reached, silently disabling interleaved rollouts without any warning.

Additional Locations (1)

verifiers/clients/__init__.py#L32-L37

mikasenghaas · 2026-02-14T11:31:07Z

merged in #897

eligotts added 9 commits February 7, 2026 17:42

feat(clients): introduce unified client interface and provider adapters

61633b4

refactor(environment): migrate rollout core to Response + tool_defs

2b4f413

refactor(envs): align tool/multiturn/integrations with new message an…

85671c6

…d tool flow

refactor(eval): unify eval outputs/serialization and improve display …

b978e38

…plumbing

test(migration): update env/tool trajectory tests for unified client/…

2508876

…types

test(envs): harden env smoke tests and fixture setup for clean installs

af4f274

chore(endpoints): annotate endpoint registry with explicit client types

797b48c

merged in main

341e212

dont strip client types

e302ddf

eligotts changed the title ~~Eli/anthropic client~~ abstract client + support for interleaved thinking Feb 8, 2026

eligotts added 4 commits February 7, 2026 19:13

merged in usage stuff

354bff8

small reversions

5ddc50a

format checks

de5f81a

Merge remote-tracking branch 'origin/main' into eli/anthropic-client

86acbb1

# Conflicts: # verifiers/scripts/eval.py # verifiers/types.py

willccbb reviewed Feb 8, 2026

View reviewed changes

Comment thread verifiers/envs/experimental/rlm_env.py Outdated

mikasenghaas reviewed Feb 8, 2026

View reviewed changes

eligotts added 14 commits February 8, 2026 16:38

cleaner handling of pure completions with custom types

520b226

formatting

c608032

patch single turn to convert message list back to raw string

f919282

safe serialization, output has both oai_tools and tool_defs

b59f8e6

test needs to accept kwargs

e4465aa

update tests with tool_defs

2357f3a

rename convert_func_to_tool_def

4dc4720

formatting

2eefcf1

fix lines in types

1ffd26e

harden tool_defs/oai_tools to error when oai tools are passed in

d970eb0

harden downstream reading of tool_defs

382c0c7

use vf.client

084439b

much tighter use of custom types

4c8ce64

get rid of legacy response.choices

035378a

eligotts added 2 commits February 9, 2026 22:28

merge conflicts

0e09fee

ty check

5c42a01