Skip to content

Add exponential backoff to coordinator for persistent lock failures#888

Open
raman325 wants to merge 2 commits intomainfrom
fix/coordinator-backoff
Open

Add exponential backoff to coordinator for persistent lock failures#888
raman325 wants to merge 2 commits intomainfrom
fix/coordinator-backoff

Conversation

@raman325
Copy link
Owner

@raman325 raman325 commented Mar 2, 2026

Proposed change

When a lock is persistently unreachable, the coordinator retries at fixed intervals
indefinitely, wasting resources and generating noise. This adds exponential backoff
to the coordinator so that after repeated failures, retry intervals increase
(60s → 120s → 240s → ... → 30min max), then reset on successful reconnection.

Inspired by FutureTense/keymaster#571.

Changes:

  • Track consecutive update failures in LockUsercodeUpdateCoordinator
  • After 3 consecutive failures, dynamically increase update_interval using exponential backoff (poll-based providers)
  • Gate drift checks during backoff to avoid redundant hard refreshes
  • Reset counters and restore original interval on success
  • No new timers or parallel retry processes — works within DataUpdateCoordinator's standard scheduling

Constants added to const.py:

  • BACKOFF_FAILURE_THRESHOLD = 3
  • BACKOFF_INITIAL_SECONDS = 60
  • BACKOFF_MAX_SECONDS = 1800 (30 min)

Type of change

  • New feature (which adds functionality)

Additional information

  • This PR is related to issue:
  • 13 new tests covering all backoff behavior (failure tracking, exponential growth, cap at max, reset on success, drift check gating, push vs poll provider handling)

When a lock is persistently unreachable, the coordinator now tracks
consecutive failures and applies exponential backoff after 3 failures.
For poll-based providers, the update_interval is dynamically increased
(60s * 2^n, capped at 30 minutes). Drift checks are also skipped
during backoff. On success, the counter resets and the original
interval is restored.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 2, 2026 21:03
@github-actions github-actions bot added python Pull requests that update Python code enhancement New feature or request labels Mar 2, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds exponential backoff behavior to LockUsercodeUpdateCoordinator so repeated lock communication failures reduce retry noise/resource usage, with counters reset on success and drift checks gated during backoff.

Changes:

  • Add backoff constants (threshold, initial, max) to const.py.
  • Track consecutive update failures in the coordinator and apply exponential backoff to update_interval for poll-based locks.
  • Add tests covering failure tracking, interval growth/capping, reset behavior, and drift-check gating.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
custom_components/lock_code_manager/coordinator.py Implements consecutive-failure tracking, exponential backoff, and drift-check gating during backoff.
custom_components/lock_code_manager/const.py Adds coordinator backoff constants used by the coordinator logic.
tests/test_coordinator.py Adds a suite of tests validating backoff behavior for poll vs push providers and drift-check gating.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…nore

- Reset backoff state in push_update() when data actually changes,
  so push-based providers recover from backoff without needing
  async_get_usercodes() to succeed first
- Differentiate log messages for poll vs push providers in _apply_backoff()
- Remove incorrect type: ignore comment in _reset_backoff()
- Only restore update_interval in _reset_backoff() for poll-based providers
- Add tests for push_update backoff reset (data changed + unchanged)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: 3b174f2659d4
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +100 to +107
self.update_interval = timedelta(seconds=backoff_secs)
_LOGGER.warning(
"Update failed %d consecutive times for %s, "
"backing off polling interval to %ds",
self._consecutive_failures,
self._lock.lock.entity_id,
backoff_secs,
)
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _apply_backoff(), the first backoff step sets update_interval to BACKOFF_INITIAL_SECONDS (60s). With the current default usercode_scan_interval also being 60s, this doesn’t actually “back off” the polling interval, but the warning message claims it does. Consider only logging when the interval increases (compare previous vs new), and/or include both old/new intervals in the message so it’s not misleading/noisy.

Suggested change
self.update_interval = timedelta(seconds=backoff_secs)
_LOGGER.warning(
"Update failed %d consecutive times for %s, "
"backing off polling interval to %ds",
self._consecutive_failures,
self._lock.lock.entity_id,
backoff_secs,
)
new_interval = timedelta(seconds=backoff_secs)
# Use the current update interval if available, otherwise fall back to
# the original configured interval for comparison.
previous_interval = self.update_interval or self._original_update_interval
if new_interval > previous_interval:
self.update_interval = new_interval
_LOGGER.warning(
"Update failed %d consecutive times for %s, "
"backing off polling interval from %ds to %ds",
self._consecutive_failures,
self._lock.lock.entity_id,
int(previous_interval.total_seconds()),
backoff_secs,
)
elif new_interval != previous_interval:
# Interval changed but did not increase (unexpected with current
# backoff policy) – still apply without logging a misleading warning.
self.update_interval = new_interval

Copilot uses AI. Check for mistakes.
Comment on lines +154 to +160
if self._consecutive_failures >= BACKOFF_FAILURE_THRESHOLD:
_LOGGER.debug(
"Skipping drift check for %s (in backoff after %d failures)",
self._lock.lock.entity_id,
self._consecutive_failures,
)
return
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_async_drift_check() skips drift checks when _consecutive_failures is above the threshold, but drift-check failures themselves never increment _consecutive_failures. For push-based locks (where update_interval=None), drift checks may be the only periodic lock I/O, so the system may never enter “backoff” and will keep attempting hard refreshes and logging warnings indefinitely when the lock is unreachable. Consider applying the same backoff accounting on hard-refresh failures (increment on LockCodeManagerError, reset on success), or track a separate failure counter for drift checks.

Copilot uses AI. Check for mistakes.
@codecov
Copy link

codecov bot commented Mar 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.13%. Comparing base (712bd04) to head (08aa7dd).
⚠️ Report is 10 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #888      +/-   ##
==========================================
+ Coverage   96.01%   96.13%   +0.11%     
==========================================
  Files          29       29              
  Lines        2611     2637      +26     
  Branches       83       83              
==========================================
+ Hits         2507     2535      +28     
+ Misses        104      102       -2     
Flag Coverage Δ
python 96.09% <100.00%> (+0.12%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
custom_components/lock_code_manager/const.py 100.00% <100.00%> (ø)
custom_components/lock_code_manager/coordinator.py 97.80% <100.00%> (+3.68%) ⬆️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python Pull requests that update Python code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants