Add exponential backoff to coordinator for persistent lock failures#888
Add exponential backoff to coordinator for persistent lock failures#888
Conversation
When a lock is persistently unreachable, the coordinator now tracks consecutive failures and applies exponential backoff after 3 failures. For poll-based providers, the update_interval is dynamically increased (60s * 2^n, capped at 30 minutes). Drift checks are also skipped during backoff. On success, the counter resets and the original interval is restored. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds exponential backoff behavior to LockUsercodeUpdateCoordinator so repeated lock communication failures reduce retry noise/resource usage, with counters reset on success and drift checks gated during backoff.
Changes:
- Add backoff constants (
threshold,initial,max) toconst.py. - Track consecutive update failures in the coordinator and apply exponential backoff to
update_intervalfor poll-based locks. - Add tests covering failure tracking, interval growth/capping, reset behavior, and drift-check gating.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
custom_components/lock_code_manager/coordinator.py |
Implements consecutive-failure tracking, exponential backoff, and drift-check gating during backoff. |
custom_components/lock_code_manager/const.py |
Adds coordinator backoff constants used by the coordinator logic. |
tests/test_coordinator.py |
Adds a suite of tests validating backoff behavior for poll vs push providers and drift-check gating. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…nore - Reset backoff state in push_update() when data actually changes, so push-based providers recover from backoff without needing async_get_usercodes() to succeed first - Differentiate log messages for poll vs push providers in _apply_backoff() - Remove incorrect type: ignore comment in _reset_backoff() - Only restore update_interval in _reset_backoff() for poll-based providers - Add tests for push_update backoff reset (data changed + unchanged) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Entire-Checkpoint: 3b174f2659d4
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| self.update_interval = timedelta(seconds=backoff_secs) | ||
| _LOGGER.warning( | ||
| "Update failed %d consecutive times for %s, " | ||
| "backing off polling interval to %ds", | ||
| self._consecutive_failures, | ||
| self._lock.lock.entity_id, | ||
| backoff_secs, | ||
| ) |
There was a problem hiding this comment.
In _apply_backoff(), the first backoff step sets update_interval to BACKOFF_INITIAL_SECONDS (60s). With the current default usercode_scan_interval also being 60s, this doesn’t actually “back off” the polling interval, but the warning message claims it does. Consider only logging when the interval increases (compare previous vs new), and/or include both old/new intervals in the message so it’s not misleading/noisy.
| self.update_interval = timedelta(seconds=backoff_secs) | |
| _LOGGER.warning( | |
| "Update failed %d consecutive times for %s, " | |
| "backing off polling interval to %ds", | |
| self._consecutive_failures, | |
| self._lock.lock.entity_id, | |
| backoff_secs, | |
| ) | |
| new_interval = timedelta(seconds=backoff_secs) | |
| # Use the current update interval if available, otherwise fall back to | |
| # the original configured interval for comparison. | |
| previous_interval = self.update_interval or self._original_update_interval | |
| if new_interval > previous_interval: | |
| self.update_interval = new_interval | |
| _LOGGER.warning( | |
| "Update failed %d consecutive times for %s, " | |
| "backing off polling interval from %ds to %ds", | |
| self._consecutive_failures, | |
| self._lock.lock.entity_id, | |
| int(previous_interval.total_seconds()), | |
| backoff_secs, | |
| ) | |
| elif new_interval != previous_interval: | |
| # Interval changed but did not increase (unexpected with current | |
| # backoff policy) – still apply without logging a misleading warning. | |
| self.update_interval = new_interval |
| if self._consecutive_failures >= BACKOFF_FAILURE_THRESHOLD: | ||
| _LOGGER.debug( | ||
| "Skipping drift check for %s (in backoff after %d failures)", | ||
| self._lock.lock.entity_id, | ||
| self._consecutive_failures, | ||
| ) | ||
| return |
There was a problem hiding this comment.
_async_drift_check() skips drift checks when _consecutive_failures is above the threshold, but drift-check failures themselves never increment _consecutive_failures. For push-based locks (where update_interval=None), drift checks may be the only periodic lock I/O, so the system may never enter “backoff” and will keep attempting hard refreshes and logging warnings indefinitely when the lock is unreachable. Consider applying the same backoff accounting on hard-refresh failures (increment on LockCodeManagerError, reset on success), or track a separate failure counter for drift checks.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #888 +/- ##
==========================================
+ Coverage 96.01% 96.13% +0.11%
==========================================
Files 29 29
Lines 2611 2637 +26
Branches 83 83
==========================================
+ Hits 2507 2535 +28
+ Misses 104 102 -2
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Proposed change
When a lock is persistently unreachable, the coordinator retries at fixed intervals
indefinitely, wasting resources and generating noise. This adds exponential backoff
to the coordinator so that after repeated failures, retry intervals increase
(60s → 120s → 240s → ... → 30min max), then reset on successful reconnection.
Inspired by FutureTense/keymaster#571.
Changes:
LockUsercodeUpdateCoordinatorupdate_intervalusing exponential backoff (poll-based providers)Constants added to
const.py:BACKOFF_FAILURE_THRESHOLD = 3BACKOFF_INITIAL_SECONDS = 60BACKOFF_MAX_SECONDS = 1800(30 min)Type of change
Additional information