fix(http): disable connection pooling to prevent stale connections in Lambda#1094
Conversation
… Lambda ## Problem After upgrading from extension v92 to v93, customers reported a sharp increase in "Max retries exceeded, returning request error" errors (SVLS-8672, GitHub issue #1092). ## Root Cause PR #1018 introduced HTTP client caching for performance improvements. However, the cached client maintains a connection pool that doesn't work well with Lambda's freeze/resume execution model: 1. Lambda executes, HTTP client created with connection pool 2. Extension flushes traces, connections remain open in pool 3. Lambda freezes (paused between invocations - seconds to minutes) 4. Lambda resumes, cached client reuses stale connections 5. TCP errors → "Max retries exceeded" In v92, a new HTTP client was created per-flush, so there were never stale connections to reuse. ## Solution Disable connection pooling by setting `pool_max_idle_per_host(0)`. This ensures each request gets a fresh connection, avoiding stale connection issues while still benefiting from client caching. This matches the pattern used in libdatadog's `new_client_periodic()` which explicitly disables pooling with the comment: "This client does not keep connections because otherwise we would get a pipe closed every second connection because of low keep alive in the agent." Fixes: SVLS-8672 Fixes: #1092 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
What's the overhead in creating a fresh connection on every request? |
This is what we originally had in v92. Only in v93 are we not doing this. |
In v92, we were creating one client per request and the configuration you're adding wasn't set, I'd say these are fundamentally different although kinda similar in what they are doing, might be good having an RC with this change to see the potential overhead difference, as v93 has performance improvements |
There was a problem hiding this comment.
Pull request overview
Disables HTTP connection pooling in the trace/stats HTTP client to prevent reuse of stale connections across AWS Lambda freeze/resume cycles, addressing retry errors introduced by client caching.
Changes:
- Configures the Hyper client builder with
pool_max_idle_per_host(0)for both proxied and non-proxied connectors. - Adds inline documentation explaining the Lambda freeze/resume stale-connection failure mode and rationale.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| // Disable connection pooling to avoid stale connections after Lambda freeze/resume. | ||
| // In Lambda, the execution environment can be frozen for seconds to minutes between | ||
| // invocations. Pooled connections become stale during this time, causing failures | ||
| // when reused. Setting pool_max_idle_per_host(0) ensures each request gets a fresh | ||
| // connection, matching the pattern used in libdatadog's new_client_periodic(). | ||
| let client = http_common::client_builder() | ||
| .pool_max_idle_per_host(0) | ||
| .build(proxy_connector); |
There was a problem hiding this comment.
The pooling configuration and explanatory comment are duplicated across both proxy branches. To reduce drift risk and simplify future changes, consider factoring the builder configuration into a single let builder = http_common::client_builder().pool_max_idle_per_host(0); (or a small helper) and reuse it for both build(...) calls; likewise, keep one canonical comment (or a brief reference) in one place.
| // Disable connection pooling to avoid stale connections after Lambda freeze/resume. | ||
| // See comment above for detailed explanation. | ||
| Ok(http_common::client_builder() | ||
| .pool_max_idle_per_host(0) | ||
| .build(proxy_connector)) |
There was a problem hiding this comment.
The pooling configuration and explanatory comment are duplicated across both proxy branches. To reduce drift risk and simplify future changes, consider factoring the builder configuration into a single let builder = http_common::client_builder().pool_max_idle_per_host(0); (or a small helper) and reuse it for both build(...) calls; likewise, keep one canonical comment (or a brief reference) in one place.
Evidence: v92 had per-flush client creation, v93 introduced cachingv92 - Per-Flush Client CreationStruct has NO http_client field: pub struct ServerlessTraceFlusher {
pub aggregator_handle: AggregatorHandle,
pub config: Arc<Config>,
pub api_key_factory: Arc<ApiKeyFactory>,
pub additional_endpoints: Vec<Endpoint>,
// NO http_client field!
}Client created inside send() on every call: async fn send(...) -> Option<Vec<SendData>> {
// ...
let Ok(http_client) =
ServerlessTraceFlusher::get_http_client(proxy_https.as_ref(), tls_cert_file.as_ref())
else {
error!("TRACES | Failed to create HTTP client");
return None;
};
// ...
}get_http_client() creates a NEW client each time (no caching): v93 - Cached Client (introduced in PR #1018)Struct HAS http_client field with OnceCell (cached): pub struct TraceFlusher {
// ...
/// Cached HTTP client, lazily initialized on first use.
http_client: OnceCell<HttpClient>, // <-- CACHED!
}get_or_init_http_client() returns SAME cached client: async fn get_or_init_http_client(&self) -> Option<HttpClient> {
match self
.http_client
.get_or_try_init(|| async { // OnceCell - only runs once!
http_client::create_client(...)
})
.await
// ...
}The PR that introduced cachingPR #1018 - "chore(flushing): standardize code with refactoring on some flushers and retries" The PR description explicitly states:
Why this matters for LambdaThe cached client maintains a connection pool. When Lambda freezes between invocations, pooled connections become stale. When Lambda resumes and the cached client tries to reuse these stale connections, they fail → "Max retries exceeded". In v92, each flush created a new client with an empty pool, so there were never stale connections to reuse. |
Summary
Fixes a regression introduced in v93 where customers see a sharp increase in "Max retries exceeded, returning request error" errors after upgrading from v92.
pool_max_idle_per_host(0)Problem
PR #1018 introduced HTTP client caching for performance improvements. However, the cached client maintains a connection pool that doesn't work well with Lambda's freeze/resume execution model:
In v92, a new HTTP client was created per-flush, so there were never stale connections to reuse.
Solution
Disable connection pooling by setting
pool_max_idle_per_host(0). This ensures each request gets a fresh connection, avoiding stale connection issues while still benefiting from client caching (TLS session reuse, configuration reuse, etc.).This matches the pattern used in libdatadog's
new_client_periodic()which explicitly disables pooling with the comment:Related
🤖 Generated with Claude Code