fix(http): disable connection pooling to prevent stale connections in Lambda by jchrostek-dd · Pull Request #1094 · DataDog/datadog-lambda-extension

jchrostek-dd · 2026-03-11T15:13:24Z

Summary

Fixes a regression introduced in v93 where customers see a sharp increase in "Max retries exceeded, returning request error" errors after upgrading from v92.

Disables HTTP connection pooling for the trace/stats flusher by setting pool_max_idle_per_host(0)
Prevents stale connections from being reused after Lambda freeze/resume cycles

Problem

PR #1018 introduced HTTP client caching for performance improvements. However, the cached client maintains a connection pool that doesn't work well with Lambda's freeze/resume execution model:

Lambda executes, HTTP client created with connection pool
Extension flushes traces, connections remain open in pool
Lambda freezes (paused between invocations - can be seconds to minutes)
Lambda resumes, cached client reuses stale connections
TCP errors → "Max retries exceeded"

In v92, a new HTTP client was created per-flush, so there were never stale connections to reuse.

Solution

Disable connection pooling by setting pool_max_idle_per_host(0). This ensures each request gets a fresh connection, avoiding stale connection issues while still benefiting from client caching (TLS session reuse, configuration reuse, etc.).

This matches the pattern used in libdatadog's new_client_periodic() which explicitly disables pooling with the comment:

"This client does not keep connections because otherwise we would get a pipe closed every second connection because of low keep alive in the agent."

… Lambda ## Problem After upgrading from extension v92 to v93, customers reported a sharp increase in "Max retries exceeded, returning request error" errors (SVLS-8672, GitHub issue #1092). ## Root Cause PR #1018 introduced HTTP client caching for performance improvements. However, the cached client maintains a connection pool that doesn't work well with Lambda's freeze/resume execution model: 1. Lambda executes, HTTP client created with connection pool 2. Extension flushes traces, connections remain open in pool 3. Lambda freezes (paused between invocations - seconds to minutes) 4. Lambda resumes, cached client reuses stale connections 5. TCP errors → "Max retries exceeded" In v92, a new HTTP client was created per-flush, so there were never stale connections to reuse. ## Solution Disable connection pooling by setting `pool_max_idle_per_host(0)`. This ensures each request gets a fresh connection, avoiding stale connection issues while still benefiting from client caching. This matches the pattern used in libdatadog's `new_client_periodic()` which explicitly disables pooling with the comment: "This client does not keep connections because otherwise we would get a pipe closed every second connection because of low keep alive in the agent." Fixes: SVLS-8672 Fixes: #1092 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

duncanista · 2026-03-11T15:21:31Z

This ensures each request gets a fresh connection, avoiding stale connection issues while still benefiting from client caching

What's the overhead in creating a fresh connection on every request?

jchrostek-dd · 2026-03-11T15:22:33Z

This ensures each request gets a fresh connection, avoiding stale connection issues while still benefiting from client caching

What's the overhead in creating a fresh connection on every request?

This is what we originally had in v92. Only in v93 are we not doing this.

duncanista · 2026-03-11T17:11:35Z

This is what we originally had in v92. Only in v93 are we not doing this.

In v92, we were creating one client per request and the configuration you're adding wasn't set, I'd say these are fundamentally different although kinda similar in what they are doing, might be good having an RC with this change to see the potential overhead difference, as v93 has performance improvements

Copilot

Pull request overview

Disables HTTP connection pooling in the trace/stats HTTP client to prevent reuse of stale connections across AWS Lambda freeze/resume cycles, addressing retry errors introduced by client caching.

Changes:

Configures the Hyper client builder with pool_max_idle_per_host(0) for both proxied and non-proxied connectors.
Adds inline documentation explaining the Lambda freeze/resume stale-connection failure mode and rationale.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copilot · 2026-03-11T17:16:58Z

bottlecap/src/traces/http_client.rs

+        // Disable connection pooling to avoid stale connections after Lambda freeze/resume.
+        // In Lambda, the execution environment can be frozen for seconds to minutes between
+        // invocations. Pooled connections become stale during this time, causing failures
+        // when reused. Setting pool_max_idle_per_host(0) ensures each request gets a fresh
+        // connection, matching the pattern used in libdatadog's new_client_periodic().
+        let client = http_common::client_builder()
+            .pool_max_idle_per_host(0)
+            .build(proxy_connector);


The pooling configuration and explanatory comment are duplicated across both proxy branches. To reduce drift risk and simplify future changes, consider factoring the builder configuration into a single let builder = http_common::client_builder().pool_max_idle_per_host(0); (or a small helper) and reuse it for both build(...) calls; likewise, keep one canonical comment (or a brief reference) in one place.

Copilot · 2026-03-11T17:16:58Z

bottlecap/src/traces/http_client.rs

+        // Disable connection pooling to avoid stale connections after Lambda freeze/resume.
+        // See comment above for detailed explanation.
+        Ok(http_common::client_builder()
+            .pool_max_idle_per_host(0)
+            .build(proxy_connector))


The pooling configuration and explanatory comment are duplicated across both proxy branches. To reduce drift risk and simplify future changes, consider factoring the builder configuration into a single let builder = http_common::client_builder().pool_max_idle_per_host(0); (or a small helper) and reuse it for both build(...) calls; likewise, keep one canonical comment (or a brief reference) in one place.

jchrostek-dd · 2026-03-11T17:41:53Z

Evidence: v92 had per-flush client creation, v93 introduced caching

v92 - Per-Flush Client Creation

Struct has NO http_client field:
https://github.com/DataDog/datadog-lambda-extension/blob/v92/bottlecap/src/traces/trace_flusher.rs#L51-L56

pub struct ServerlessTraceFlusher {
    pub aggregator_handle: AggregatorHandle,
    pub config: Arc<Config>,
    pub api_key_factory: Arc<ApiKeyFactory>,
    pub additional_endpoints: Vec<Endpoint>,
    // NO http_client field!
}

Client created inside send() on every call:
https://github.com/DataDog/datadog-lambda-extension/blob/v92/bottlecap/src/traces/trace_flusher.rs#L119-L124

async fn send(...) -> Option<Vec<SendData>> {
    // ...
    let Ok(http_client) =
        ServerlessTraceFlusher::get_http_client(proxy_https.as_ref(), tls_cert_file.as_ref())
    else {
        error!("TRACES | Failed to create HTTP client");
        return None;
    };
    // ...
}

get_http_client() creates a NEW client each time (no caching):
https://github.com/DataDog/datadog-lambda-extension/blob/v92/bottlecap/src/traces/trace_flusher.rs#L164-L194

v93 - Cached Client (introduced in PR #1018)

Struct HAS http_client field with OnceCell (cached):
https://github.com/DataDog/datadog-lambda-extension/blob/v93/bottlecap/src/traces/trace_flusher.rs#L22-L35

pub struct TraceFlusher {
    // ...
    /// Cached HTTP client, lazily initialized on first use.
    http_client: OnceCell<HttpClient>,  // <-- CACHED!
}

get_or_init_http_client() returns SAME cached client:
https://github.com/DataDog/datadog-lambda-extension/blob/v93/bottlecap/src/traces/trace_flusher.rs#L166-L182

async fn get_or_init_http_client(&self) -> Option<HttpClient> {
    match self
        .http_client
        .get_or_try_init(|| async {  // OnceCell - only runs once!
            http_client::create_client(...)
        })
        .await
    // ...
}

The PR that introduced caching

PR #1018 - "chore(flushing): standardize code with refactoring on some flushers and retries"
#1018

The PR description explicitly states:

"ensuring that we only create one client and we can reuse as much as possible for performance improvements"

Why this matters for Lambda

The cached client maintains a connection pool. When Lambda freezes between invocations, pooled connections become stale. When Lambda resumes and the cached client tries to reuse these stale connections, they fail → "Max retries exceeded".

In v92, each flush created a new client with an empty pool, so there were never stale connections to reuse.

jchrostek-dd requested a review from a team as a code owner March 11, 2026 15:13

jchrostek-dd requested a review from duncanista March 11, 2026 15:13

duncanista requested a review from Copilot March 11, 2026 17:11

Copilot AI reviewed Mar 11, 2026

View reviewed changes

Copilot started reviewing on behalf of duncanista March 11, 2026 17:31 View session

Merge branch 'main' into john/svls-8672-fix-connection-pooling

7b4ac28

duncanista approved these changes Mar 11, 2026

View reviewed changes

jchrostek-dd merged commit d906255 into main Mar 11, 2026
38 of 41 checks passed

jchrostek-dd deleted the john/svls-8672-fix-connection-pooling branch March 11, 2026 18:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(http): disable connection pooling to prevent stale connections in Lambda#1094

fix(http): disable connection pooling to prevent stale connections in Lambda#1094
jchrostek-dd merged 2 commits intomainfrom
john/svls-8672-fix-connection-pooling

jchrostek-dd commented Mar 11, 2026 •

edited

Loading

Uh oh!

duncanista commented Mar 11, 2026

Uh oh!

jchrostek-dd commented Mar 11, 2026 •

edited

Loading

Uh oh!

duncanista commented Mar 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

jchrostek-dd commented Mar 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jchrostek-dd commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Related

Uh oh!

duncanista commented Mar 11, 2026

Uh oh!

jchrostek-dd commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

duncanista commented Mar 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

jchrostek-dd commented Mar 11, 2026

Evidence: v92 had per-flush client creation, v93 introduced caching

v92 - Per-Flush Client Creation

v93 - Cached Client (introduced in PR #1018)

The PR that introduced caching

Why this matters for Lambda

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jchrostek-dd commented Mar 11, 2026 •

edited

Loading

jchrostek-dd commented Mar 11, 2026 •

edited

Loading