fix: Skip Z-Wave commands to dead nodes and add exponential backoff#571
fix: Skip Z-Wave commands to dead nodes and add exponential backoff#571richardpowellus wants to merge 7 commits intoFutureTense:mainfrom
Conversation
When a Z-Wave lock node goes dead, KeyMaster was continuing to send commands every 15-60 seconds, generating thousands of failed Z-Wave commands (ZW0204) that flood the mesh and degrade all other devices. This fix adds three layers of protection: 1. Provider: Check node.status before every Z-Wave command. If the node is dead, return immediately without sending radio commands. 2. Provider: Gate async_connect/async_is_connected on node liveness. Dead nodes report as disconnected so the coordinator skips them. 3. Coordinator: Exponential backoff on consecutive connection failures. After 3 failures, back off from 60s up to 30 minutes between retries. Automatically resets when the lock reconnects.
The original _is_node_alive() guards were too aggressive - they blocked async_connect() and async_get_usercodes() which don't send Z-Wave commands and can safely use cached data. Changes: - async_connect(): warn when node is dead but proceed with connection (cached data still accessible, backoff handles retries at coordinator) - async_is_connected(): remove _is_node_alive() from connection check (node references remain valid even when dead) - async_get_usercodes(): remove _is_node_alive() guard (reads cached usercode values from zwave-js-server, no Z-Wave RF commands sent) Guards retained on actual Z-Wave write commands: - async_refresh_usercode() (sends Z-Wave UserCode Get) - async_set_usercode() (sends Z-Wave UserCode Set) - async_clear_usercode() (sends Z-Wave UserCode Set)
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #571 +/- ##
==========================================
+ Coverage 84.14% 86.90% +2.76%
==========================================
Files 10 25 +15
Lines 801 2811 +2010
==========================================
+ Hits 674 2443 +1769
- Misses 127 368 +241
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- Fix import sorting in zwave_js.py (NodeStatus) - Simplify conditional: usercode if usercode else None -> usercode or None - Fix TRY300: move return from try to else block in _is_node_alive() - Fix E225: remove spaces around power operator in coordinator.py - Add 8 provider tests for dead-node detection (_is_node_alive, skip operations when dead, connect warns but proceeds) - Add 6 coordinator tests for exponential backoff (failure tracking, threshold activation, skip during backoff, retry after expiry, counter reset on success, max cap)
|
Sometimes sending a ping to the node will bring it back to life. Perhaps adding in a ping request on the first failure? |
On the first connection failure, ping the node (NoOperation CC) to try to bring it back. Nodes can be falsely marked dead due to RF interference or routing issues, and a single ping can recover them. If it works, the next update cycle reconnects automatically. Also applies ruff formatting fixes from prior review.
|
Great suggestion — you're right. I dug into the zwave-js source and the driver does not automatically attempt to recover dead nodes (no periodic re-ping). Recovery only happens if something externally pings the node or it spontaneously sends a message. Since false-dead scenarios are common (RF interference, routing issues, controller hiccups), a single ping attempt before backoff makes a lot of sense. Added in f6de9ec — on the first connection failure, we ping the node. If it recovers, the next update cycle reconnects naturally. The zwave-js |
firstof9
left a comment
There was a problem hiding this comment.
Looks like some tests need to be added for the new code.
- Add TestZWaveJSLockProviderPingNode (4 tests): no node, success, failure, and exception handling for async_ping_node - Add TestBaseLockProviderPingNode (1 test): verify base class async_ping_node returns False by default
|
Hi @firstof9 — tests have been added for all new code: Commit
Commit
All 395 tests pass (152 in the changed test files). Overall coverage is 81% (above the 80% threshold). All new code lines are covered. |
|
Looks like there some ruff errors to address. I'd prefer to see 100% patch test coverage if possible. Thanks again for contributing! |
|
@richardpowellus please correct the ruff linting issues. You've got 100% coverage now from the looks of things, so it's just CI blocking on this. |
|
@richardpowellus you've dropped below 100% coverage of your change again. Please fix! |
… _update_child_code_slots Adds tests for three previously-uncovered code paths that were reformatted in this PR, bringing patch coverage to 100%: - TestResetCodeSlot: exercises reset_code_slot success, lock-not-found, and slot-not-found paths - TestUpdateSlotActiveState: exercises update_slot_active_state success, lock-not-found, and slot-not-found paths - TestUpdateChildCodeSlotsSync: exercises _update_child_code_slots attribute sync, no-parent-slots early return, and override_parent guard Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Problem
When a Z-Wave node goes dead (e.g., a lock with a dead battery), KeyMaster continues sending Z-Wave commands to it on every coordinator update cycle. These commands all fail with error 204, but KeyMaster retries them indefinitely with no backoff.
In my case, this resulted in 19,694 failed Z-Wave commands over a few days, which flooded the Z-Wave mesh and caused other devices to become unresponsive. The lock battery reported 99% and then jumped straight to 0% (common with Schlage deadbolts), so there was no gradual degradation — just a sudden dead node with KeyMaster hammering it nonstop.
Root Cause
Provider layer (
providers/zwave_js.py): Z-Wave write commands (async_set_usercode,async_clear_usercode,async_refresh_usercode) execute regardless ofnode.status, even when the node isNodeStatus.DEAD.Coordinator layer (
coordinator.py):_async_update_datacalls_update_lock_data→_connect_and_update_lockon every cycle with no failure tracking or backoff. A dead node generates the same burst of failed commands every update.Fix (3 layers)
Layer 1 — Provider: Dead-node gating on write commands
Adds
_is_node_alive()toZWaveJSLockProviderthat checksnode.status != NodeStatus.DEAD. Write operations that send actual Z-Wave RF commands return early when the node is dead:async_refresh_usercode()— sends Z-Wave UserCode Getasync_set_usercode()— sends Z-Wave UserCode Setasync_clear_usercode()— sends Z-Wave UserCode SetRead/connect operations are not blocked since they use cached data from zwave-js-server without transmitting Z-Wave RF commands:
async_connect()— warns when node is dead but proceeds (sets up internal references only)async_is_connected()— checks connection state without node alivenessasync_get_usercodes()— reads cached usercode values from zwave-js-serverThis ensures KeyMaster entities remain available and display cached data even when the node is temporarily dead, while preventing useless Z-Wave traffic.
Layer 2 — Coordinator: Exponential backoff
Adds per-lock failure tracking (
_consecutive_failures,_next_retry_timedicts) to the coordinator. AfterBACKOFF_FAILURE_THRESHOLD(3) consecutive failures, the coordinator skips updates for that lock using exponential backoff fromBACKOFF_INITIAL_SECONDS(60s) up toBACKOFF_MAX_SECONDS(1800s / 30 min). Counters auto-reset when a lock reconnects successfully.Layer 3 — Constants
Adds three new constants:
BACKOFF_INITIAL_SECONDS,BACKOFF_MAX_SECONDS,BACKOFF_FAILURE_THRESHOLD.Changes
providers/zwave_js.py:_is_node_alive()method + guards on 3 write methods + warning-only on connectcoordinator.py: backoff tracking in__init__,_connect_and_update_lock, and_update_lock_dataconst.py: 3 backoff constantsTesting
Deployed to a Home Assistant Yellow (HAOS 17.1, HA 2026.2.3) with a Zooz ZAC93 800 Series controller, 14 Z-Wave nodes, and two Schlage Touchscreen Deadbolts managed by KeyMaster. After applying the fix: