fix: improve 429 retry handling with exponential backoff by BrendanWalsh · Pull Request #2524 · microsoft/SynapseML

BrendanWalsh · 2026-03-27T05:26:56Z

Summary

Fixes intermittent 429 (Too Many Requests) failures in cognitive services tests by implementing proper exponential backoff with jitter.

Changes

HTTPClients.scala: Replace fixed retry delay with exponential backoff (base 1s, max 60s) plus random jitter (0-1s) for 429 responses
Add informational logging for retry attempts with delay duration

Motivation

OpenAI and Azure Cognitive Services endpoints return 429 when rate-limited. The previous fixed-delay retry strategy caused test jobs to pile up requests, extending some jobs from ~11m to ~50m. Exponential backoff reduces contention and lets jobs complete faster.

Testing

No behavioral change for non-429 responses
Validated in container CI builds where all cognitive service tests pass with this fix

Split from #2506 for independent review.

github-actions · 2026-03-27T05:27:31Z

Hey @BrendanWalsh 👋!
Thank you so much for contributing to our repository 🙌.
Someone from SynapseML Team will be reviewing this pull request soon.

We use semantic commit messages to streamline the release process.
Before your pull request can be merged, you should make sure your first commit and PR title start with a semantic prefix.
This helps us to create release messages and credit you for your hard work!

Examples of commit messages with semantic prefixes:

fix: Fix LightGBM crashes with empty partitions
feat: Make HTTP on Spark back-offs configurable
docs: Update Spark Serving usage
build: Add codecov support
perf: improve LightGBM memory usage
refactor: make python code generation rely on classes
style: Remove nulls from CNTKModel
test: Add test coverage for CNTKModel

To test your commit locally, please follow our guild on building from source.
Check out the developer guide for additional guidance on testing your change.

github-actions · 2026-03-27T05:27:39Z

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Snapshot Warnings

⚠️: No snapshots were found for the head SHA b9d8bc0.

Ensure that dependencies are being submitted on PR branches and consider enabling retry-on-snapshot-warnings. See the documentation for more information and troubleshooting advice.

Scanned Files

None

Copilot

Pull request overview

This PR updates SynapseML’s HTTP retry helper used by cognitive service wrappers to reduce flakiness from 429 (Too Many Requests) responses by moving from a fixed-delay strategy toward an exponential backoff approach.

Changes:

Add capped exponential backoff state for 429 responses in HandlingUtils.sendWithRetries.
Remove the prior sleep that occurred inside the 429 Retry-After logging block to avoid double-sleep behavior.
Extend sendWithRetries to carry 429 backoff state across recursive retries.

Comments suppressed due to low confidence (3)

core/src/main/scala/com/microsoft/azure/synapse/ml/io/http/HTTPClients.scala:149

The new 429 backoff logic doesn't add jitter or log the computed retry delay/attempt, but the PR description says it should use exponential backoff + random jitter and add informational retry logging. Consider adding a small random jitter to nextBackoff and logging the attempt number and final delay to reduce synchronized retries (thundering herd) and improve diagnosability.

            // No Retry-After header; use exponential backoff
            val current = math.max(backoff429Ms, retriesLeft.head.toLong)
            math.min(current * 2, MaxBackoffMs)
          }
          Thread.sleep(nextBackoff)

core/src/main/scala/com/microsoft/azure/synapse/ml/io/http/HTTPClients.scala:151

For 429 responses, the recursive call uses the same retriesLeft array (not .tail), which makes the retry loop effectively unbounded. Since this recursion sits inside a try block, it won't be tail-call optimized and can overflow the stack if the service keeps returning 429 for long enough. Consider consuming retry budget (e.g., decrement after each 429) or adding an explicit max-attempt / max-elapsed-time guard for 429 retries.

          Thread.sleep(nextBackoff)
          sendWithRetries(client, request, retriesLeft, extraCodesToRetry, nextBackoff)
        } else {

core/src/main/scala/com/microsoft/azure/synapse/ml/io/http/HTTPClients.scala:160

On IOException, the retry path delegates to keepTrying(...), which calls sendWithRetries(client, request, retriesLeft.tail) without forwarding extraCodesToRetry (and now also drops backoff429Ms). That means callers relying on extraCodesToRetry (e.g., retrying 404) may silently stop retrying those codes after a transient connection error. Consider threading these parameters through keepTrying / sendWithRetries so retry behavior is consistent across error types.

    } catch {
      case e: java.io.IOException =>
        logError("Encountering a connection error", e)
        keepTrying(client, request, retriesLeft, e)
    }

Copilot · 2026-03-27T05:31:12Z

core/src/main/scala/com/microsoft/azure/synapse/ml/io/http/HTTPClients.scala

+        val hasRetryAfter = code == 429 && response.getFirstHeader("Retry-After") != null
+        val retryAfterMs = if (hasRetryAfter) {
+          response.getFirstHeader("Retry-After").getValue.toLong * 1000
+        } else 0L


Retry-After is parsed with .toLong, but that header is allowed to be an HTTP date (not just delta-seconds). A non-numeric value will throw NumberFormatException here and bypass the IOException catch, ultimately causing advanced to return null. Consider parsing with Try (and optionally supporting HTTP-date format), falling back to exponential backoff when parsing fails.

This issue also appears in the following locations of the same file:

line 145

line 149

line 156

BrendanWalsh · 2026-03-27T19:25:45Z

/azp run

azure-pipelines · 2026-03-27T19:25:56Z

Azure Pipelines successfully started running 1 pipeline(s).

codecov-commenter · 2026-03-27T19:46:36Z

Codecov Report

❌ Patch coverage is 8.33333% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.57%. Comparing base (895752c) to head (7b9f912).

Files with missing lines	Patch %	Lines
...crosoft/azure/synapse/ml/io/http/HTTPClients.scala	8.33%	11 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2524      +/-   ##
==========================================
- Coverage   84.61%   84.57%   -0.04%     
==========================================
  Files         335      335              
  Lines       17708    17717       +9     
  Branches     1612     1624      +12     
==========================================
+ Hits        14984    14985       +1     
- Misses       2724     2732       +8

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

- Add exponential backoff for 429 responses without Retry-After header: starts at initial delay, doubles each retry, capped at 60s - Fix double-sleep bug: when Retry-After header present, code was sleeping both in the case match and in the retry logic. Now captures retryAfterMs before response.close() and uses it as the sole backoff delay - 429 retries remain unlimited (rate limiting is transient) - Add scalastyle:off for method length (sendWithRetries is necessarily long)

Copilot AI review requested due to automatic review settings March 27, 2026 05:26

Copilot started reviewing on behalf of BrendanWalsh March 27, 2026 05:28 View session

BrendanWalsh mentioned this pull request Mar 27, 2026

ci: containerize CI pipeline with pre-built Docker image #2529

Open

Copilot AI reviewed Mar 27, 2026

View reviewed changes

BrendanWalsh force-pushed the brwals/fix-429-retry branch from 6c096a6 to 7b9f912 Compare March 27, 2026 19:07

BrendanWalsh force-pushed the brwals/fix-429-retry branch from 7b9f912 to b9d8bc0 Compare March 31, 2026 00:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: improve 429 retry handling with exponential backoff#2524

fix: improve 429 retry handling with exponential backoff#2524
BrendanWalsh wants to merge 1 commit intomasterfrom
brwals/fix-429-retry

BrendanWalsh commented Mar 27, 2026

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

github-actions bot commented Mar 27, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

BrendanWalsh commented Mar 27, 2026

Uh oh!

azure-pipelines bot commented Mar 27, 2026

Uh oh!

codecov-commenter commented Mar 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

BrendanWalsh commented Mar 27, 2026

Summary

Changes

Motivation

Testing

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

github-actions bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependency Review

Snapshot Warnings

Scanned Files

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

BrendanWalsh commented Mar 27, 2026

Uh oh!

azure-pipelines bot commented Mar 27, 2026

Uh oh!

codecov-commenter commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Mar 27, 2026 •

edited

Loading

codecov-commenter commented Mar 27, 2026 •

edited

Loading