fix: improve 429 retry handling with exponential backoff#2524
fix: improve 429 retry handling with exponential backoff#2524BrendanWalsh wants to merge 1 commit intomasterfrom
Conversation
|
Hey @BrendanWalsh 👋! We use semantic commit messages to streamline the release process. Examples of commit messages with semantic prefixes:
To test your commit locally, please follow our guild on building from source. |
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Snapshot WarningsEnsure that dependencies are being submitted on PR branches and consider enabling retry-on-snapshot-warnings. See the documentation for more information and troubleshooting advice. Scanned FilesNone |
There was a problem hiding this comment.
Pull request overview
This PR updates SynapseML’s HTTP retry helper used by cognitive service wrappers to reduce flakiness from 429 (Too Many Requests) responses by moving from a fixed-delay strategy toward an exponential backoff approach.
Changes:
- Add capped exponential backoff state for 429 responses in
HandlingUtils.sendWithRetries. - Remove the prior sleep that occurred inside the 429
Retry-Afterlogging block to avoid double-sleep behavior. - Extend
sendWithRetriesto carry 429 backoff state across recursive retries.
Comments suppressed due to low confidence (3)
core/src/main/scala/com/microsoft/azure/synapse/ml/io/http/HTTPClients.scala:149
- The new 429 backoff logic doesn't add jitter or log the computed retry delay/attempt, but the PR description says it should use exponential backoff + random jitter and add informational retry logging. Consider adding a small random jitter to
nextBackoffand logging the attempt number and final delay to reduce synchronized retries (thundering herd) and improve diagnosability.
// No Retry-After header; use exponential backoff
val current = math.max(backoff429Ms, retriesLeft.head.toLong)
math.min(current * 2, MaxBackoffMs)
}
Thread.sleep(nextBackoff)
core/src/main/scala/com/microsoft/azure/synapse/ml/io/http/HTTPClients.scala:151
- For 429 responses, the recursive call uses the same
retriesLeftarray (not.tail), which makes the retry loop effectively unbounded. Since this recursion sits inside atryblock, it won't be tail-call optimized and can overflow the stack if the service keeps returning 429 for long enough. Consider consuming retry budget (e.g., decrement after each 429) or adding an explicit max-attempt / max-elapsed-time guard for 429 retries.
Thread.sleep(nextBackoff)
sendWithRetries(client, request, retriesLeft, extraCodesToRetry, nextBackoff)
} else {
core/src/main/scala/com/microsoft/azure/synapse/ml/io/http/HTTPClients.scala:160
- On
IOException, the retry path delegates tokeepTrying(...), which callssendWithRetries(client, request, retriesLeft.tail)without forwardingextraCodesToRetry(and now also dropsbackoff429Ms). That means callers relying onextraCodesToRetry(e.g., retrying 404) may silently stop retrying those codes after a transient connection error. Consider threading these parameters throughkeepTrying/sendWithRetriesso retry behavior is consistent across error types.
} catch {
case e: java.io.IOException =>
logError("Encountering a connection error", e)
keepTrying(client, request, retriesLeft, e)
}
| val hasRetryAfter = code == 429 && response.getFirstHeader("Retry-After") != null | ||
| val retryAfterMs = if (hasRetryAfter) { | ||
| response.getFirstHeader("Retry-After").getValue.toLong * 1000 | ||
| } else 0L |
There was a problem hiding this comment.
Retry-After is parsed with .toLong, but that header is allowed to be an HTTP date (not just delta-seconds). A non-numeric value will throw NumberFormatException here and bypass the IOException catch, ultimately causing advanced to return null. Consider parsing with Try (and optionally supporting HTTP-date format), falling back to exponential backoff when parsing fails.
This issue also appears in the following locations of the same file:
- line 145
- line 149
- line 156
6c096a6 to
7b9f912
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #2524 +/- ##
==========================================
- Coverage 84.61% 84.57% -0.04%
==========================================
Files 335 335
Lines 17708 17717 +9
Branches 1612 1624 +12
==========================================
+ Hits 14984 14985 +1
- Misses 2724 2732 +8 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- Add exponential backoff for 429 responses without Retry-After header: starts at initial delay, doubles each retry, capped at 60s - Fix double-sleep bug: when Retry-After header present, code was sleeping both in the case match and in the retry logic. Now captures retryAfterMs before response.close() and uses it as the sole backoff delay - 429 retries remain unlimited (rate limiting is transient) - Add scalastyle:off for method length (sendWithRetries is necessarily long)
7b9f912 to
b9d8bc0
Compare
Summary
Fixes intermittent 429 (Too Many Requests) failures in cognitive services tests by implementing proper exponential backoff with jitter.
Changes
Motivation
OpenAI and Azure Cognitive Services endpoints return 429 when rate-limited. The previous fixed-delay retry strategy caused test jobs to pile up requests, extending some jobs from ~11m to ~50m. Exponential backoff reduces contention and lets jobs complete faster.
Testing
Split from #2506 for independent review.