Skip to content

[Detail Bug] Primary provider backoff incorrectly triggered by cancellations (Interrupted/OpenAIIoException) #38

@detail-app

Description

@detail-app

Summary

  • Context: The OpenAiProviderRoutingService manages provider selection and health for OpenAI-compatible LLM calls, including a backoff mechanism for the primary provider when transient failures occur.
  • Bug: The service treats InterruptedException and all OpenAIIoException instances as eligible for primary-provider backoff. In a Reactor-based application, these exceptions are the standard way blocking IO operations signal that a request has been cancelled by the user.
  • Actual vs. expected: When a user cancels a request, the thread is interrupted, causing an InterruptedException or an OpenAIIoException. Currently, this disables the primary provider for all users for 10 minutes (default). Expected behavior is to ignore cancellations when determining provider health.
  • Impact: A single user cancelling a request can disable the primary LLM provider for all users for 10 minutes. This leads to unnecessary service degradation or total unavailability if only one provider is configured.

Code with bug

    boolean shouldBackoffPrimary(Throwable throwable) {
        if (isRateLimit(throwable)) {
            return true;
        }
        if (throwable instanceof OpenAIIoException) { // <-- BUG 🔴 [Treats all IO errors, including cancellation-induced ones, as failures]
            return true;
        }
        if (throwable instanceof InterruptedException) { // <-- BUG 🔴 [Treats thread interruption as provider failure]
            Thread.currentThread().interrupt();
            return true;
        }

Evidence

1. Logic Verification

In OpenAIStreamingService.java, requests are executed on the boundedElastic scheduler. When a request is cancelled by the client, Reactor interrupts the executing thread. The OpenAI SDK (using OkHttp) responds by throwing an InterruptedIOException wrapped in an OpenAIIoException.

2. Unit Test Reproduction

The following test case (added to a temporary test file) confirms that the current implementation of shouldBackoffPrimary returns true for interruption-related exceptions, which immediately triggers a 10-minute backoff in recordProviderFailure.

    @Test
    void shouldNotBackoffOnInterruptedIoException() {
        RateLimitService rateLimitService = mock(RateLimitService.class);
        OpenAiProviderRoutingService routingService = new OpenAiProviderRoutingService(rateLimitService, 600, "github_models");

        // Simulate an interruption-caused IO exception (standard on cancellation)
        InterruptedIOException interruptedIo = new InterruptedIOException("interrupted");
        OpenAIIoException openAiIoException = new OpenAIIoException("io", interruptedIo);

        assertTrue(routingService.shouldBackoffPrimary(openAiIoException)); // Returns true, causing backoff
    }

Why has this bug gone undetected?

In testing environments, user cancellations are rare. In production, if multiple providers are configured, the system silently falls back to the secondary provider, hiding the fact that the primary was unnecessarily disabled. The 10-minute duration is long enough to cause real degradation but short enough that it might be mistaken for "intermittent network issues".

Recommended fix

Update shouldBackoffPrimary to explicitly exclude interruptions and InterruptedIOException:

    boolean shouldBackoffPrimary(Throwable throwable) {
        if (throwable instanceof InterruptedException || 
            throwable.getCause() instanceof InterruptedException ||
            throwable.getCause() instanceof java.io.InterruptedIOException) {
            if (throwable instanceof InterruptedException) {
                Thread.currentThread().interrupt();
            }
            return false; // <-- FIX 🟢 [Ignore cancellations when deciding to back off]
        }
        // ...

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdetail

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions