fix: Skip Z-Wave commands to dead nodes and add exponential backoff by richardpowellus · Pull Request #571 · FutureTense/keymaster

richardpowellus · 2026-02-28T23:31:55Z

Problem

When a Z-Wave node goes dead (e.g., a lock with a dead battery), KeyMaster continues sending Z-Wave commands to it on every coordinator update cycle. These commands all fail with error 204, but KeyMaster retries them indefinitely with no backoff.

In my case, this resulted in 19,694 failed Z-Wave commands over a few days, which flooded the Z-Wave mesh and caused other devices to become unresponsive. The lock battery reported 99% and then jumped straight to 0% (common with Schlage deadbolts), so there was no gradual degradation — just a sudden dead node with KeyMaster hammering it nonstop.

Root Cause

Provider layer (providers/zwave_js.py): Z-Wave write commands (async_set_usercode, async_clear_usercode, async_refresh_usercode) execute regardless of node.status, even when the node is NodeStatus.DEAD.
Coordinator layer (coordinator.py): _async_update_data calls _update_lock_data → _connect_and_update_lock on every cycle with no failure tracking or backoff. A dead node generates the same burst of failed commands every update.

Fix (3 layers)

Layer 1 — Provider: Dead-node gating on write commands

Adds _is_node_alive() to ZWaveJSLockProvider that checks node.status != NodeStatus.DEAD. Write operations that send actual Z-Wave RF commands return early when the node is dead:

async_refresh_usercode() — sends Z-Wave UserCode Get
async_set_usercode() — sends Z-Wave UserCode Set
async_clear_usercode() — sends Z-Wave UserCode Set

Read/connect operations are not blocked since they use cached data from zwave-js-server without transmitting Z-Wave RF commands:

async_connect() — warns when node is dead but proceeds (sets up internal references only)
async_is_connected() — checks connection state without node aliveness
async_get_usercodes() — reads cached usercode values from zwave-js-server

This ensures KeyMaster entities remain available and display cached data even when the node is temporarily dead, while preventing useless Z-Wave traffic.

Layer 2 — Coordinator: Exponential backoff

Adds per-lock failure tracking (_consecutive_failures, _next_retry_time dicts) to the coordinator. After BACKOFF_FAILURE_THRESHOLD (3) consecutive failures, the coordinator skips updates for that lock using exponential backoff from BACKOFF_INITIAL_SECONDS (60s) up to BACKOFF_MAX_SECONDS (1800s / 30 min). Counters auto-reset when a lock reconnects successfully.

Layer 3 — Constants

Adds three new constants: BACKOFF_INITIAL_SECONDS, BACKOFF_MAX_SECONDS, BACKOFF_FAILURE_THRESHOLD.

Changes

providers/zwave_js.py: _is_node_alive() method + guards on 3 write methods + warning-only on connect
coordinator.py: backoff tracking in __init__, _connect_and_update_lock, and _update_lock_data
const.py: 3 backoff constants

Testing

Deployed to a Home Assistant Yellow (HAOS 17.1, HA 2026.2.3) with a Zooz ZAC93 800 Series controller, 14 Z-Wave nodes, and two Schlage Touchscreen Deadbolts managed by KeyMaster. After applying the fix:

KeyMaster loads cleanly with zero errors
Dead lock node generates zero Z-Wave write commands (previously ~15,000+)
KeyMaster entities remain available and show cached data when node is dead
Live lock (garage rear door) continues to sync code slots normally
When the dead lock's batteries were replaced, the backoff reset and KeyMaster resumed normal operation automatically

When a Z-Wave lock node goes dead, KeyMaster was continuing to send commands every 15-60 seconds, generating thousands of failed Z-Wave commands (ZW0204) that flood the mesh and degrade all other devices. This fix adds three layers of protection: 1. Provider: Check node.status before every Z-Wave command. If the node is dead, return immediately without sending radio commands. 2. Provider: Gate async_connect/async_is_connected on node liveness. Dead nodes report as disconnected so the coordinator skips them. 3. Coordinator: Exponential backoff on consecutive connection failures. After 3 failures, back off from 60s up to 30 minutes between retries. Automatically resets when the lock reconnects.

The original _is_node_alive() guards were too aggressive - they blocked async_connect() and async_get_usercodes() which don't send Z-Wave commands and can safely use cached data. Changes: - async_connect(): warn when node is dead but proceed with connection (cached data still accessible, backoff handles retries at coordinator) - async_is_connected(): remove _is_node_alive() from connection check (node references remain valid even when dead) - async_get_usercodes(): remove _is_node_alive() guard (reads cached usercode values from zwave-js-server, no Z-Wave RF commands sent) Guards retained on actual Z-Wave write commands: - async_refresh_usercode() (sends Z-Wave UserCode Get) - async_set_usercode() (sends Z-Wave UserCode Set) - async_clear_usercode() (sends Z-Wave UserCode Set)

firstof9

Please run ruff to fix formatting.

codecov-commenter · 2026-03-01T02:04:08Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 96.66667% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.90%. Comparing base (cdb4922) to head (7bcfdaa).
⚠️ Report is 47 commits behind head on main.

Files with missing lines	Patch %	Lines
custom_components/keymaster/coordinator.py	93.10%	2 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #571      +/-   ##
==========================================
+ Coverage   84.14%   86.90%   +2.76%     
==========================================
  Files          10       25      +15     
  Lines         801     2811    +2010     
==========================================
+ Hits          674     2443    +1769     
- Misses        127      368     +241

Flag	Coverage Δ
python	`86.90% <96.66%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

- Fix import sorting in zwave_js.py (NodeStatus) - Simplify conditional: usercode if usercode else None -> usercode or None - Fix TRY300: move return from try to else block in _is_node_alive() - Fix E225: remove spaces around power operator in coordinator.py - Add 8 provider tests for dead-node detection (_is_node_alive, skip operations when dead, connect warns but proceeds) - Add 6 coordinator tests for exponential backoff (failure tracking, threshold activation, skip during backoff, retry after expiry, counter reset on success, max cap)

firstof9 · 2026-03-01T16:37:08Z

Sometimes sending a ping to the node will bring it back to life. Perhaps adding in a ping request on the first failure?

On the first connection failure, ping the node (NoOperation CC) to try to bring it back. Nodes can be falsely marked dead due to RF interference or routing issues, and a single ping can recover them. If it works, the next update cycle reconnects automatically. Also applies ruff formatting fixes from prior review.

richardpowellus · 2026-03-01T17:44:11Z

Great suggestion — you're right. I dug into the zwave-js source and the driver does not automatically attempt to recover dead nodes (no periodic re-ping). Recovery only happens if something externally pings the node or it spontaneously sends a message. Since false-dead scenarios are common (RF interference, routing issues, controller hiccups), a single ping attempt before backoff makes a lot of sense.

Added in f6de9ec — on the first connection failure, we ping the node. If it recovers, the next update cycle reconnects naturally. The zwave-js node.ping() already uses tryReallyHard=true for dead nodes (explorer frames, route discovery), so it gives the best chance of recovery with a single RF command.

firstof9

Looks like some tests need to be added for the new code.

- Add TestZWaveJSLockProviderPingNode (4 tests): no node, success, failure, and exception handling for async_ping_node - Add TestBaseLockProviderPingNode (1 test): verify base class async_ping_node returns False by default

richardpowellus · 2026-03-03T07:33:21Z

Hi @firstof9 — tests have been added for all new code:

Commit ac8197a (already on the PR):

TestZWaveJSLockProviderDeadNode — 8 tests covering _is_node_alive() (alive/dead/no-node/exception) and dead-node gating on async_refresh_usercode, async_set_usercode, async_clear_usercode, and async_connect
TestUpdateLockDataBackoff — 9 tests covering consecutive failure tracking, backoff activation after threshold, skip during backoff, retry after expiry, counter reset on success, max cap, and ping-on-first-failure behavior

Commit e2f3b13 (just pushed):

TestZWaveJSLockProviderPingNode — 4 tests covering async_ping_node(): no node, success, failure, exception
TestBaseLockProviderPingNode — 1 test verifying the base class default return False

All 395 tests pass (152 in the changed test files). Overall coverage is 81% (above the 80% threshold). All new code lines are covered.

firstof9 · 2026-03-03T15:47:23Z

Looks like there some ruff errors to address. I'd prefer to see 100% patch test coverage if possible.

Thanks again for contributing!

tykeal · 2026-03-06T23:44:20Z

@richardpowellus please correct the ruff linting issues. You've got 100% coverage now from the looks of things, so it's just CI blocking on this.

tykeal · 2026-03-07T13:50:37Z

@richardpowellus you've dropped below 100% coverage of your change again. Please fix!

… _update_child_code_slots Adds tests for three previously-uncovered code paths that were reformatted in this PR, bringing patch coverage to 100%: - TestResetCodeSlot: exercises reset_code_slot success, lock-not-found, and slot-not-found paths - TestUpdateSlotActiveState: exercises update_slot_active_state success, lock-not-found, and slot-not-found paths - TestUpdateChildCodeSlotsSync: exercises _update_child_code_slots attribute sync, no-parent-slots early return, and override_parent guard Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

richardpowellus added 2 commits February 28, 2026 15:19

firstof9 requested changes Mar 1, 2026

View reviewed changes

This was referenced Mar 2, 2026

Add exponential backoff to coordinator for persistent lock failures raman325/lock_code_manager#888

Open

Gate operations on device availability (dead-node check) raman325/lock_code_manager#889

Open

test: add ping_node tests for ZWaveJS and base provider

e2f3b13

- Add TestZWaveJSLockProviderPingNode (4 tests): no node, success, failure, and exception handling for async_ping_node - Add TestBaseLockProviderPingNode (1 test): verify base class async_ping_node returns False by default

style: move imports to module level to fix ruff PLC0415

73ee0f3

tykeal added the bugfix Fixes a bug label Mar 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Skip Z-Wave commands to dead nodes and add exponential backoff#571

fix: Skip Z-Wave commands to dead nodes and add exponential backoff#571
richardpowellus wants to merge 7 commits intoFutureTense:mainfrom
richardpowellus:fix/dead-node-backoff

richardpowellus commented Feb 28, 2026 •

edited

Loading

Uh oh!

firstof9 left a comment •

edited

Loading

Uh oh!

codecov-commenter commented Mar 1, 2026 •

edited

Loading

Uh oh!

firstof9 commented Mar 1, 2026

Uh oh!

richardpowellus commented Mar 1, 2026

Uh oh!

firstof9 left a comment

Uh oh!

richardpowellus commented Mar 3, 2026

Uh oh!

firstof9 commented Mar 3, 2026

Uh oh!

tykeal commented Mar 6, 2026

Uh oh!

tykeal commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

richardpowellus commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

Fix (3 layers)

Layer 1 — Provider: Dead-node gating on write commands

Layer 2 — Coordinator: Exponential backoff

Layer 3 — Constants

Changes

Testing

Uh oh!

firstof9 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

firstof9 commented Mar 1, 2026

Uh oh!

richardpowellus commented Mar 1, 2026

Uh oh!

firstof9 left a comment

Choose a reason for hiding this comment

Uh oh!

richardpowellus commented Mar 3, 2026

Uh oh!

firstof9 commented Mar 3, 2026

Uh oh!

tykeal commented Mar 6, 2026

Uh oh!

tykeal commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

richardpowellus commented Feb 28, 2026 •

edited

Loading

firstof9 left a comment •

edited

Loading

codecov-commenter commented Mar 1, 2026 •

edited

Loading