Skip to content

fix: disable TCP keepalive by default to avoid connection timeouts#276

Open
viraatc wants to merge 2 commits intomainfrom
feat/viraatc-drop-keepalive
Open

fix: disable TCP keepalive by default to avoid connection timeouts#276
viraatc wants to merge 2 commits intomainfrom
feat/viraatc-drop-keepalive

Conversation

@viraatc
Copy link
Copy Markdown
Collaborator

@viraatc viraatc commented Apr 10, 2026

Keepalive probes caused connection timeout errors in offline and high-concurrency modes. Dead connections are already handled by connection_lost/eof_received callbacks in the protocol layer.

attempt to close #202

What does this PR do?

Type of change

  • Bug fix
  • New feature
  • Documentation update
  • Refactor/cleanup

Related issues

Testing

  • Tests added/updated
  • All tests pass locally
  • Manual testing completed

Checklist

  • Code follows project style
  • Pre-commit hooks pass
  • Documentation updated (if needed)

Keepalive probes caused connection timeout errors in offline and
high-concurrency modes. Dead connections are already handled by
connection_lost/eof_received callbacks in the protocol layer.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@viraatc viraatc requested review from a team and Copilot April 10, 2026 01:09
@github-actions github-actions bot requested a review from arekay-nv April 10, 2026 01:09
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 10, 2026

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request disables TCP keepalive for the inference endpoint client to mitigate connection timeout errors observed in high-concurrency environments. A review comment points out that disabling keepalive may lead to "zombie" connections during silent network drops, as the current implementation lacks timeouts in the read methods; it is recommended to implement application-level timeouts or use TCP_USER_TIMEOUT as a safeguard.

Comment thread src/inference_endpoint/endpoint_client/http.py
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Disables TCP keepalive probes by default in the endpoint HTTP client to reduce connection timeout errors observed in offline and high-concurrency modes (issue #202), relying instead on protocol/connection lifecycle handling in the client.

Changes:

  • Set _SocketConfig.SO_KEEPALIVE default to 0 (disabled).
  • Gate application of TCP_KEEP* tuning behind SO_KEEPALIVE being enabled.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/inference_endpoint/endpoint_client/http.py Outdated
Comment thread src/inference_endpoint/endpoint_client/http.py Outdated
- Reword misleading "(disabled)" comments on TCP_KEEP* fields to
  "only used when SO_KEEPALIVE is enabled"
- Add test asserting socket SO_KEEPALIVE option on pool connections

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

"max_throughput" mode causes connection timeouts

2 participants