-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Summary
- Context: The
RateLimitServicecoordinates provider availability using persistent state inRateLimitStateand in-memory state inProviderCircuitStateto handle API rate limits (HTTP 429). - Bug:
RateLimitStateapplies an extremely aggressive exponential backoff usingDuration.ofHoursstarting from the second consecutive failure, whileRateLimitHeaderParserincorrectly picks the minimum reset time when multiple headers (requests vs. tokens) are provided. - Actual vs. expected: A second consecutive rate limit hit causes a 2-hour lockout (plus reset time), whereas it should apply a much smaller backoff (e.g., seconds or minutes); additionally, the parser picks the shortest wait time among multiple headers instead of the longest.
- Impact: The system suffers from significant availability issues, where a provider can be disabled for hours after just two normal rate-limit events, and premature retries caused by the parser bug increase the likelihood of triggering this severe lockout.
Code with bug
1. Excessive Backoff in RateLimitState.java
// src/main/java/com/williamcallahan/javachat/service/RateLimitState.java
if (failures > 1) {
Duration additionalBackoff = Duration.ofHours((long) Math.pow(2, failures - 1)); // <-- BUG 🔴 [Uses hours instead of seconds/minutes; 2nd failure = 2hr lockout]
Duration maxBackoff = Duration.ofDays(7); // Never back off more than a week
if (additionalBackoff.compareTo(maxBackoff) > 0) {
additionalBackoff = maxBackoff;
}
state.rateLimitedUntil = state.rateLimitedUntil.plus(additionalBackoff);2. Incorrect Minimum Reset Selection in RateLimitHeaderParser.java
// src/main/java/com/williamcallahan/javachat/service/RateLimitHeaderParser.java
long candidateSeconds = minPositive( // <-- BUG 🔴 [Should use max() to ensure all limits have reset before retrying]
firstHeaderValue(headers, "x-ratelimit-reset-requests")
.map(this::parseDurationSeconds)
.orElse(0L),
firstHeaderValue(headers, "x-ratelimit-reset-tokens")
.map(this::parseDurationSeconds)
.orElse(0L),
// ...Evidence
Reproduction: Excessive Backoff
I created a test ExcessiveBackoffTest.java that records two consecutive failures and checks the remaining wait time. The test confirms that after the second failure, the lockout duration is over 2 hours.
@Test
void recordRateLimit_appliesExcessiveBackoffOnSecondFailure() {
Instant now = Instant.now();
// First failure sets state.consecutiveFailures to 1
rateLimitState.recordRateLimit(PROVIDER_NAME, now.plusSeconds(1), "1m");
// Second failure sets state.consecutiveFailures to 2, triggering backoff
rateLimitState.recordRateLimit(PROVIDER_NAME, now.plusSeconds(1), "1m");
Duration waitTime2 = rateLimitState.getRemainingWaitTime(PROVIDER_NAME);
// Actual: ~7201s (2 hours + 1s)
assertTrue(waitTime2.toHours() >= 2);
}Reproduction: Header Parser Bug
I created RateLimitHeaderParserBugTest.java to verify the reset time selection. When provided with a 1s request reset and a 60s token reset, the parser incorrectly returned a 1s reset.
@Test
void parseResetInstant_shouldPickMaxResetTime() {
Headers headers = Headers.builder()
.put("x-ratelimit-reset-requests", "1s")
.put("x-ratelimit-reset-tokens", "60s")
.build();
Optional<Instant> resetInstant = parser.parseResetInstant(headers);
long secondsUntilReset = Duration.between(Instant.now(), resetInstant.get()).getSeconds();
// Actual: 1s (due to minPositive)
// Expected: 60s
assertTrue(secondsUntilReset < 10);
}Why has this bug gone undetected?
These bugs have likely gone undetected because 429 errors are infrequent under low load or with higher-tier API keys. Furthermore, many users might assume a long lockout is a strict policy from the API provider (OpenAI/GitHub) rather than a bug in the local throttling logic. The interplay between the two bugs (one causing premature retries and the other penalizing them heavily) creates a "death spiral" that only becomes apparent when a provider's limits are actually reached.
Recommended fix
- In
RateLimitState.java, changeDuration.ofHoursto a more appropriate unit such asDuration.ofSecondsorDuration.ofMinutes, matching the in-memory backoff logic inProviderCircuitState. - In
RateLimitHeaderParser.java, replace theminPositivehelper with amaxPositivelogic to ensure the system waits until ALL relevant rate limits have reset before attempting a new request.