-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Summary
- Context: The
OpenAiProviderRoutingServicemanages provider selection and health for OpenAI-compatible LLM calls, including a backoff mechanism for the primary provider when transient failures occur. - Bug: The service treats
InterruptedExceptionand allOpenAIIoExceptioninstances as eligible for primary-provider backoff. In a Reactor-based application, these exceptions are the standard way blocking IO operations signal that a request has been cancelled by the user. - Actual vs. expected: When a user cancels a request, the thread is interrupted, causing an
InterruptedExceptionor anOpenAIIoException. Currently, this disables the primary provider for all users for 10 minutes (default). Expected behavior is to ignore cancellations when determining provider health. - Impact: A single user cancelling a request can disable the primary LLM provider for all users for 10 minutes. This leads to unnecessary service degradation or total unavailability if only one provider is configured.
Code with bug
boolean shouldBackoffPrimary(Throwable throwable) {
if (isRateLimit(throwable)) {
return true;
}
if (throwable instanceof OpenAIIoException) { // <-- BUG 🔴 [Treats all IO errors, including cancellation-induced ones, as failures]
return true;
}
if (throwable instanceof InterruptedException) { // <-- BUG 🔴 [Treats thread interruption as provider failure]
Thread.currentThread().interrupt();
return true;
}Evidence
1. Logic Verification
In OpenAIStreamingService.java, requests are executed on the boundedElastic scheduler. When a request is cancelled by the client, Reactor interrupts the executing thread. The OpenAI SDK (using OkHttp) responds by throwing an InterruptedIOException wrapped in an OpenAIIoException.
2. Unit Test Reproduction
The following test case (added to a temporary test file) confirms that the current implementation of shouldBackoffPrimary returns true for interruption-related exceptions, which immediately triggers a 10-minute backoff in recordProviderFailure.
@Test
void shouldNotBackoffOnInterruptedIoException() {
RateLimitService rateLimitService = mock(RateLimitService.class);
OpenAiProviderRoutingService routingService = new OpenAiProviderRoutingService(rateLimitService, 600, "github_models");
// Simulate an interruption-caused IO exception (standard on cancellation)
InterruptedIOException interruptedIo = new InterruptedIOException("interrupted");
OpenAIIoException openAiIoException = new OpenAIIoException("io", interruptedIo);
assertTrue(routingService.shouldBackoffPrimary(openAiIoException)); // Returns true, causing backoff
}Why has this bug gone undetected?
In testing environments, user cancellations are rare. In production, if multiple providers are configured, the system silently falls back to the secondary provider, hiding the fact that the primary was unnecessarily disabled. The 10-minute duration is long enough to cause real degradation but short enough that it might be mistaken for "intermittent network issues".
Recommended fix
Update shouldBackoffPrimary to explicitly exclude interruptions and InterruptedIOException:
boolean shouldBackoffPrimary(Throwable throwable) {
if (throwable instanceof InterruptedException ||
throwable.getCause() instanceof InterruptedException ||
throwable.getCause() instanceof java.io.InterruptedIOException) {
if (throwable instanceof InterruptedException) {
Thread.currentThread().interrupt();
}
return false; // <-- FIX 🟢 [Ignore cancellations when deciding to back off]
}
// ...