Skip to content

fix: improve 429 retry handling with exponential backoff#2524

Open
BrendanWalsh wants to merge 1 commit intomasterfrom
brwals/fix-429-retry
Open

fix: improve 429 retry handling with exponential backoff#2524
BrendanWalsh wants to merge 1 commit intomasterfrom
brwals/fix-429-retry

Conversation

@BrendanWalsh
Copy link
Copy Markdown
Collaborator

Summary

Fixes intermittent 429 (Too Many Requests) failures in cognitive services tests by implementing proper exponential backoff with jitter.

Changes

  • HTTPClients.scala: Replace fixed retry delay with exponential backoff (base 1s, max 60s) plus random jitter (0-1s) for 429 responses
  • Add informational logging for retry attempts with delay duration

Motivation

OpenAI and Azure Cognitive Services endpoints return 429 when rate-limited. The previous fixed-delay retry strategy caused test jobs to pile up requests, extending some jobs from ~11m to ~50m. Exponential backoff reduces contention and lets jobs complete faster.

Testing

  • No behavioral change for non-429 responses
  • Validated in container CI builds where all cognitive service tests pass with this fix

Split from #2506 for independent review.

Copilot AI review requested due to automatic review settings March 27, 2026 05:26
@github-actions
Copy link
Copy Markdown

Hey @BrendanWalsh 👋!
Thank you so much for contributing to our repository 🙌.
Someone from SynapseML Team will be reviewing this pull request soon.

We use semantic commit messages to streamline the release process.
Before your pull request can be merged, you should make sure your first commit and PR title start with a semantic prefix.
This helps us to create release messages and credit you for your hard work!

Examples of commit messages with semantic prefixes:

  • fix: Fix LightGBM crashes with empty partitions
  • feat: Make HTTP on Spark back-offs configurable
  • docs: Update Spark Serving usage
  • build: Add codecov support
  • perf: improve LightGBM memory usage
  • refactor: make python code generation rely on classes
  • style: Remove nulls from CNTKModel
  • test: Add test coverage for CNTKModel

To test your commit locally, please follow our guild on building from source.
Check out the developer guide for additional guidance on testing your change.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 27, 2026

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Snapshot Warnings

⚠️: No snapshots were found for the head SHA b9d8bc0.
Ensure that dependencies are being submitted on PR branches and consider enabling retry-on-snapshot-warnings. See the documentation for more information and troubleshooting advice.

Scanned Files

None

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates SynapseML’s HTTP retry helper used by cognitive service wrappers to reduce flakiness from 429 (Too Many Requests) responses by moving from a fixed-delay strategy toward an exponential backoff approach.

Changes:

  • Add capped exponential backoff state for 429 responses in HandlingUtils.sendWithRetries.
  • Remove the prior sleep that occurred inside the 429 Retry-After logging block to avoid double-sleep behavior.
  • Extend sendWithRetries to carry 429 backoff state across recursive retries.
Comments suppressed due to low confidence (3)

core/src/main/scala/com/microsoft/azure/synapse/ml/io/http/HTTPClients.scala:149

  • The new 429 backoff logic doesn't add jitter or log the computed retry delay/attempt, but the PR description says it should use exponential backoff + random jitter and add informational retry logging. Consider adding a small random jitter to nextBackoff and logging the attempt number and final delay to reduce synchronized retries (thundering herd) and improve diagnosability.
            // No Retry-After header; use exponential backoff
            val current = math.max(backoff429Ms, retriesLeft.head.toLong)
            math.min(current * 2, MaxBackoffMs)
          }
          Thread.sleep(nextBackoff)

core/src/main/scala/com/microsoft/azure/synapse/ml/io/http/HTTPClients.scala:151

  • For 429 responses, the recursive call uses the same retriesLeft array (not .tail), which makes the retry loop effectively unbounded. Since this recursion sits inside a try block, it won't be tail-call optimized and can overflow the stack if the service keeps returning 429 for long enough. Consider consuming retry budget (e.g., decrement after each 429) or adding an explicit max-attempt / max-elapsed-time guard for 429 retries.
          Thread.sleep(nextBackoff)
          sendWithRetries(client, request, retriesLeft, extraCodesToRetry, nextBackoff)
        } else {

core/src/main/scala/com/microsoft/azure/synapse/ml/io/http/HTTPClients.scala:160

  • On IOException, the retry path delegates to keepTrying(...), which calls sendWithRetries(client, request, retriesLeft.tail) without forwarding extraCodesToRetry (and now also drops backoff429Ms). That means callers relying on extraCodesToRetry (e.g., retrying 404) may silently stop retrying those codes after a transient connection error. Consider threading these parameters through keepTrying / sendWithRetries so retry behavior is consistent across error types.
    } catch {
      case e: java.io.IOException =>
        logError("Encountering a connection error", e)
        keepTrying(client, request, retriesLeft, e)
    }

Comment on lines +135 to +138
val hasRetryAfter = code == 429 && response.getFirstHeader("Retry-After") != null
val retryAfterMs = if (hasRetryAfter) {
response.getFirstHeader("Retry-After").getValue.toLong * 1000
} else 0L
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retry-After is parsed with .toLong, but that header is allowed to be an HTTP date (not just delta-seconds). A non-numeric value will throw NumberFormatException here and bypass the IOException catch, ultimately causing advanced to return null. Consider parsing with Try (and optionally supporting HTTP-date format), falling back to exponential backoff when parsing fails.

This issue also appears in the following locations of the same file:

  • line 145
  • line 149
  • line 156

Copilot uses AI. Check for mistakes.
@BrendanWalsh BrendanWalsh force-pushed the brwals/fix-429-retry branch from 6c096a6 to 7b9f912 Compare March 27, 2026 19:07
@BrendanWalsh
Copy link
Copy Markdown
Collaborator Author

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 27, 2026

Codecov Report

❌ Patch coverage is 8.33333% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.57%. Comparing base (895752c) to head (7b9f912).

Files with missing lines Patch % Lines
...crosoft/azure/synapse/ml/io/http/HTTPClients.scala 8.33% 11 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2524      +/-   ##
==========================================
- Coverage   84.61%   84.57%   -0.04%     
==========================================
  Files         335      335              
  Lines       17708    17717       +9     
  Branches     1612     1624      +12     
==========================================
+ Hits        14984    14985       +1     
- Misses       2724     2732       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

- Add exponential backoff for 429 responses without Retry-After header:
  starts at initial delay, doubles each retry, capped at 60s
- Fix double-sleep bug: when Retry-After header present, code was sleeping
  both in the case match and in the retry logic. Now captures retryAfterMs
  before response.close() and uses it as the sole backoff delay
- 429 retries remain unlimited (rate limiting is transient)
- Add scalastyle:off for method length (sendWithRetries is necessarily long)
@BrendanWalsh BrendanWalsh force-pushed the brwals/fix-429-retry branch from 7b9f912 to b9d8bc0 Compare March 31, 2026 00:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants