Skip to content

fix: Skip Z-Wave commands to dead nodes and add exponential backoff#571

Open
richardpowellus wants to merge 7 commits intoFutureTense:mainfrom
richardpowellus:fix/dead-node-backoff
Open

fix: Skip Z-Wave commands to dead nodes and add exponential backoff#571
richardpowellus wants to merge 7 commits intoFutureTense:mainfrom
richardpowellus:fix/dead-node-backoff

Conversation

@richardpowellus
Copy link

@richardpowellus richardpowellus commented Feb 28, 2026

Problem

When a Z-Wave node goes dead (e.g., a lock with a dead battery), KeyMaster continues sending Z-Wave commands to it on every coordinator update cycle. These commands all fail with error 204, but KeyMaster retries them indefinitely with no backoff.

In my case, this resulted in 19,694 failed Z-Wave commands over a few days, which flooded the Z-Wave mesh and caused other devices to become unresponsive. The lock battery reported 99% and then jumped straight to 0% (common with Schlage deadbolts), so there was no gradual degradation — just a sudden dead node with KeyMaster hammering it nonstop.

Root Cause

  1. Provider layer (providers/zwave_js.py): Z-Wave write commands (async_set_usercode, async_clear_usercode, async_refresh_usercode) execute regardless of node.status, even when the node is NodeStatus.DEAD.

  2. Coordinator layer (coordinator.py): _async_update_data calls _update_lock_data_connect_and_update_lock on every cycle with no failure tracking or backoff. A dead node generates the same burst of failed commands every update.

Fix (3 layers)

Layer 1 — Provider: Dead-node gating on write commands

Adds _is_node_alive() to ZWaveJSLockProvider that checks node.status != NodeStatus.DEAD. Write operations that send actual Z-Wave RF commands return early when the node is dead:

  • async_refresh_usercode() — sends Z-Wave UserCode Get
  • async_set_usercode() — sends Z-Wave UserCode Set
  • async_clear_usercode() — sends Z-Wave UserCode Set

Read/connect operations are not blocked since they use cached data from zwave-js-server without transmitting Z-Wave RF commands:

  • async_connect() — warns when node is dead but proceeds (sets up internal references only)
  • async_is_connected() — checks connection state without node aliveness
  • async_get_usercodes() — reads cached usercode values from zwave-js-server

This ensures KeyMaster entities remain available and display cached data even when the node is temporarily dead, while preventing useless Z-Wave traffic.

Layer 2 — Coordinator: Exponential backoff

Adds per-lock failure tracking (_consecutive_failures, _next_retry_time dicts) to the coordinator. After BACKOFF_FAILURE_THRESHOLD (3) consecutive failures, the coordinator skips updates for that lock using exponential backoff from BACKOFF_INITIAL_SECONDS (60s) up to BACKOFF_MAX_SECONDS (1800s / 30 min). Counters auto-reset when a lock reconnects successfully.

Layer 3 — Constants

Adds three new constants: BACKOFF_INITIAL_SECONDS, BACKOFF_MAX_SECONDS, BACKOFF_FAILURE_THRESHOLD.

Changes

  • providers/zwave_js.py: _is_node_alive() method + guards on 3 write methods + warning-only on connect
  • coordinator.py: backoff tracking in __init__, _connect_and_update_lock, and _update_lock_data
  • const.py: 3 backoff constants

Testing

Deployed to a Home Assistant Yellow (HAOS 17.1, HA 2026.2.3) with a Zooz ZAC93 800 Series controller, 14 Z-Wave nodes, and two Schlage Touchscreen Deadbolts managed by KeyMaster. After applying the fix:

  • KeyMaster loads cleanly with zero errors
  • Dead lock node generates zero Z-Wave write commands (previously ~15,000+)
  • KeyMaster entities remain available and show cached data when node is dead
  • Live lock (garage rear door) continues to sync code slots normally
  • When the dead lock's batteries were replaced, the backoff reset and KeyMaster resumed normal operation automatically

When a Z-Wave lock node goes dead, KeyMaster was continuing to send
commands every 15-60 seconds, generating thousands of failed Z-Wave
commands (ZW0204) that flood the mesh and degrade all other devices.

This fix adds three layers of protection:

1. Provider: Check node.status before every Z-Wave command. If the node
   is dead, return immediately without sending radio commands.

2. Provider: Gate async_connect/async_is_connected on node liveness.
   Dead nodes report as disconnected so the coordinator skips them.

3. Coordinator: Exponential backoff on consecutive connection failures.
   After 3 failures, back off from 60s up to 30 minutes between retries.
   Automatically resets when the lock reconnects.
The original _is_node_alive() guards were too aggressive - they blocked
async_connect() and async_get_usercodes() which don't send Z-Wave
commands and can safely use cached data.

Changes:
- async_connect(): warn when node is dead but proceed with connection
  (cached data still accessible, backoff handles retries at coordinator)
- async_is_connected(): remove _is_node_alive() from connection check
  (node references remain valid even when dead)
- async_get_usercodes(): remove _is_node_alive() guard (reads cached
  usercode values from zwave-js-server, no Z-Wave RF commands sent)

Guards retained on actual Z-Wave write commands:
- async_refresh_usercode() (sends Z-Wave UserCode Get)
- async_set_usercode() (sends Z-Wave UserCode Set)
- async_clear_usercode() (sends Z-Wave UserCode Set)
Copy link
Collaborator

@firstof9 firstof9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please run ruff to fix formatting.

@codecov-commenter
Copy link

codecov-commenter commented Mar 1, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 96.66667% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.90%. Comparing base (cdb4922) to head (7bcfdaa).
⚠️ Report is 47 commits behind head on main.

Files with missing lines Patch % Lines
custom_components/keymaster/coordinator.py 93.10% 2 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #571      +/-   ##
==========================================
+ Coverage   84.14%   86.90%   +2.76%     
==========================================
  Files          10       25      +15     
  Lines         801     2811    +2010     
==========================================
+ Hits          674     2443    +1769     
- Misses        127      368     +241     
Flag Coverage Δ
python 86.90% <96.66%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

- Fix import sorting in zwave_js.py (NodeStatus)
- Simplify conditional: usercode if usercode else None -> usercode or None
- Fix TRY300: move return from try to else block in _is_node_alive()
- Fix E225: remove spaces around power operator in coordinator.py
- Add 8 provider tests for dead-node detection (_is_node_alive, skip
  operations when dead, connect warns but proceeds)
- Add 6 coordinator tests for exponential backoff (failure tracking,
  threshold activation, skip during backoff, retry after expiry, counter
  reset on success, max cap)
@firstof9
Copy link
Collaborator

firstof9 commented Mar 1, 2026

Sometimes sending a ping to the node will bring it back to life. Perhaps adding in a ping request on the first failure?

On the first connection failure, ping the node (NoOperation CC) to try
to bring it back. Nodes can be falsely marked dead due to RF interference
or routing issues, and a single ping can recover them. If it works, the
next update cycle reconnects automatically.

Also applies ruff formatting fixes from prior review.
@richardpowellus
Copy link
Author

Great suggestion — you're right. I dug into the zwave-js source and the driver does not automatically attempt to recover dead nodes (no periodic re-ping). Recovery only happens if something externally pings the node or it spontaneously sends a message. Since false-dead scenarios are common (RF interference, routing issues, controller hiccups), a single ping attempt before backoff makes a lot of sense.

Added in f6de9ec — on the first connection failure, we ping the node. If it recovers, the next update cycle reconnects naturally. The zwave-js node.ping() already uses tryReallyHard=true for dead nodes (explorer frames, route discovery), so it gives the best chance of recovery with a single RF command.

Copy link
Collaborator

@firstof9 firstof9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like some tests need to be added for the new code.

- Add TestZWaveJSLockProviderPingNode (4 tests): no node, success,
  failure, and exception handling for async_ping_node
- Add TestBaseLockProviderPingNode (1 test): verify base class
  async_ping_node returns False by default
@richardpowellus
Copy link
Author

Hi @firstof9 — tests have been added for all new code:

Commit ac8197a (already on the PR):

  • TestZWaveJSLockProviderDeadNode — 8 tests covering _is_node_alive() (alive/dead/no-node/exception) and dead-node gating on async_refresh_usercode, async_set_usercode, async_clear_usercode, and async_connect
  • TestUpdateLockDataBackoff — 9 tests covering consecutive failure tracking, backoff activation after threshold, skip during backoff, retry after expiry, counter reset on success, max cap, and ping-on-first-failure behavior

Commit e2f3b13 (just pushed):

  • TestZWaveJSLockProviderPingNode — 4 tests covering async_ping_node(): no node, success, failure, exception
  • TestBaseLockProviderPingNode — 1 test verifying the base class default return False

All 395 tests pass (152 in the changed test files). Overall coverage is 81% (above the 80% threshold). All new code lines are covered.

@firstof9
Copy link
Collaborator

firstof9 commented Mar 3, 2026

Looks like there some ruff errors to address. I'd prefer to see 100% patch test coverage if possible.

Thanks again for contributing!

@tykeal
Copy link
Collaborator

tykeal commented Mar 6, 2026

@richardpowellus please correct the ruff linting issues. You've got 100% coverage now from the looks of things, so it's just CI blocking on this.

@tykeal
Copy link
Collaborator

tykeal commented Mar 7, 2026

@richardpowellus you've dropped below 100% coverage of your change again. Please fix!

@tykeal tykeal added the bugfix Fixes a bug label Mar 9, 2026
… _update_child_code_slots

Adds tests for three previously-uncovered code paths that were
reformatted in this PR, bringing patch coverage to 100%:

- TestResetCodeSlot: exercises reset_code_slot success, lock-not-found,
  and slot-not-found paths
- TestUpdateSlotActiveState: exercises update_slot_active_state success,
  lock-not-found, and slot-not-found paths
- TestUpdateChildCodeSlotsSync: exercises _update_child_code_slots
  attribute sync, no-parent-slots early return, and override_parent guard

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix Fixes a bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants