Fix database connection churn in production and Celery workers#882
Fix database connection churn in production and Celery workers#882
Conversation
- Enable CONN_HEALTH_CHECKS in production settings so Django validates persistent connections before reuse, preventing stale connection errors and reducing TIME_WAIT churn from failed-then-replaced connections. - Add TCP keepalive OPTIONS (connect_timeout, keepalives, keepalives_idle, keepalives_interval, keepalives_count) to production database config. These prevent Cloud SQL proxy and intermediate infrastructure from silently dropping idle connections. Settings are environment-configurable. - Add Celery worker_process_init signal to close inherited database connections when child workers are forked, preventing stale connection accumulation from the prefork pool parent process. https://claude.ai/code/session_01HWzoYENK11CAvkCBLwraz8
Code Review: Database Connection Churn FixOverviewThis PR addresses production database connection issues by implementing three complementary fixes:
Positive Aspects ✓1. Excellent Problem Analysis
2. Strong Documentation
3. Consistency with Test Configuration
4. Best Practice Implementation
Critical Issues
|
Code Review: Database Connection Churn FixThank you for this well-documented PR addressing production database connection issues. This is a solid fix with good attention to detail. Here's my comprehensive review: ✅ Strengths1. Well-Researched Solution
2. Code Quality
3. Configuration Management
🔍 Observations & Questions1. TCP Keepalive Timing DifferencesProduction (config/settings/production.py:31): "keepalives_interval": env.int("DATABASE_KEEPALIVES_INTERVAL", default=10),Test (config/settings/test.py:26): "keepalives_interval": 5,Question: Is the 10-second interval in production intentional? Test uses 5 seconds. For Cloud SQL proxy, the more aggressive 5-second interval might be better to detect stale connections faster, but 10 seconds reduces keepalive traffic. Was this a deliberate tradeoff? Recommendation: Add a brief comment in production.py explaining the rationale (e.g., 2. Connection Health Checks + Keepalive InteractionYou've enabled both
However, with aggressive keepalive settings (30s idle, 10s interval, 5 retries = ~80s total), most connections won't go idle long enough to trigger health check failures. Question: Have you profiled in production to see if Recommendation: This is fine as-is (redundancy is good for production stability), but consider logging when health checks fail to validate both mechanisms are working as expected. 3.
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Signed-off-by: JSIV <5049984+JSv4@users.noreply.github.com>
Code ReviewThis is a solid, focused fix targeting a real production pain point. The three-pronged approach (health checks + keepalives + fork-safe cleanup) is the correct solution for connection churn in a prefork Celery deployment. The changes are minimal and well-documented. A few items worth addressing before merge: Issues1. Missing blank line in CHANGELOG.md The new changelog entry is missing a blank line separator before the next section heading: +#### High Database Connection Count and Churn in Production
+- ...
+- ...
+- ...
#### Security: LLM Prompt Injection Protection for Approval BypassA blank line is needed between the two 2. Per Django docs, A guard comment at the setting would help: # CONN_HEALTH_CHECKS requires CONN_MAX_AGE > 0 to have any effect.
# If CONN_MAX_AGE is 0, persistent connections are disabled and health
# checks are never reached.
DATABASES["default"]["CONN_HEALTH_CHECKS"] = True # noqa F4053. The test config uses 4. Keepalive timeout vs With the defaults as set:
This means Django will discard the connection at 60s, before the keepalive mechanism has a chance to rescue it from a stale state. The keepalives do still provide value (they prevent the OS from silently dropping the TCP connection before Django closes it cleanly), but they do not extend connection lifetime. This is fine architecturally — just worth a comment so future maintainers don't wonder why keepalives are set to values that seem to exceed If the intent is to have keepalives actively rescue idle connections, consider either:
5. The signal handler is correct for the default prefork pool, but it is a no-op (or not invoked at all) for # NOTE: This signal fires only in the prefork pool (the default).
# For gevent/eventlet/solo pools, no fork occurs and this handler is never called.Minor / Non-blocking
All other keepalive parameters are env-overridable, but "keepalives": env.int("DATABASE_KEEPALIVES", default=1),Deferred import in signal handler The VerdictApprove with minor suggestions. The core fix is correct and addresses a well-understood class of Celery+PostgreSQL connection problems. Items 1 and 4 are the most actionable; the rest are documentation/commentary improvements. None are blockers if the team is comfortable with the current behavior. |
- Align keepalives_interval default to 5s (was 10s) to match test config and improve stale connection detection speed through Cloud SQL proxy - Expand Celery signal handler docstring to document expected pool type (prefork) and note harmless behavior on other pool types - Add unit test for worker_process_init signal handler - Add manual test script for verifying connection health post-deployment - Add DATABASE_SSL_MODE to sample env file for completeness - Condense CHANGELOG entry per review feedback https://claude.ai/code/session_01HWzoYENK11CAvkCBLwraz8
PR Review: Fix database connection churn in production and Celery workersOverall this is a well-scoped, well-documented fix for a real production issue. The changes are minimal, targeted, and include tests and docs. A few items worth considering before merging:
|
| Area | Status |
|---|---|
| Correctness of core fix | ✅ Correct and idiomatic |
| Test coverage | ✅ Unit test covers the signal handler |
| Documentation | ✅ Test script + env sample + changelog updated |
| Security impact | ✅ Neutral — no new attack surface |
Docstring accuracy (gevent/eventlet) |
|
DATABASE_SSL_MODE comment in env file |
No blocking issues. The two flagged items are minor; address them at your discretion.
|
Mistakenly committed to main as 1e073cf |
Summary
This PR addresses high database connection count and churn issues in production by implementing connection health checks, TCP keepalive settings, and proper connection cleanup in Celery worker processes.
Key Changes
Enabled Django connection health checks (
config/settings/production.py): AddedCONN_HEALTH_CHECKS = Trueto validate that persistent connections are still alive before reuse. This prevents errors from stale connections that may have been closed by Cloud SQL proxy or PostgreSQL idle timeouts.Added TCP keepalive options to production database config (
config/settings/production.py): Configuredconnect_timeout,keepalives,keepalives_idle,keepalives_interval, andkeepalives_countto prevent intermediate infrastructure (Cloud SQL proxy, firewalls, load balancers) from silently dropping idle connections. These settings were already present in test configuration but were missing from production.Added Celery worker process initialization signal (
config/celery_app.py): Implementedworker_process_initsignal handler to explicitly close inherited database connections when Celery forks worker child processes. This prevents stale connection accumulation and errors, since inherited connections from the parent process are invalid in the child.Updated environment variable documentation (
docs/sample_env_files/backend/production/django.env): Added commented-out examples for the new database connection timeout and keepalive configuration options.Implementation Details
The Celery signal handler closes all database connections immediately after a worker process is initialized, forcing Django to establish fresh connections. This is critical because when Celery uses the prefork pool, child processes inherit file descriptors for database connections that are no longer valid in the child process context.
The TCP keepalive settings work in conjunction with connection health checks to maintain connection validity across network boundaries and prevent silent connection drops that would otherwise cause connection churn (TIME_WAIT accumulation).
https://claude.ai/code/session_01HWzoYENK11CAvkCBLwraz8