Skip to content

sd: defer SUBSCRIBE_NACK when TCP connection not yet committed (#971)#1025

Open
aki1770-del wants to merge 1 commit intoCOVESA:masterfrom
aki1770-del:fix/vsomeip-tcp-accept-subscription-race-971
Open

sd: defer SUBSCRIBE_NACK when TCP connection not yet committed (#971)#1025
aki1770-del wants to merge 1 commit intoCOVESA:masterfrom
aki1770-del:fix/vsomeip-tcp-accept-subscription-race-971

Conversation

@aki1770-del
Copy link
Copy Markdown

Problem

With ≥2 io_context threads, accept_cbk (which writes the new TCP connection into connections_) and handle_eventgroup_subscription (which reads connections_ via is_tcp_connected) run on different threads with no ordering guarantee. If the UDP SUBSCRIBE arrives before accept_cbk finishes committing the connection, is_tcp_connected returns false and a spurious SUBSCRIBE_NACK is sent — even though the TCP three-way handshake already completed.

Maintainer confirmed: "It rarely happened."

Race window:

Thread A: accept_cbk — blocked at socket-option setup (connections_ not yet written)
Thread B: handle_eventgroup_subscription — is_tcp_connected reads connections_ → empty → NACK
Thread A: accept_cbk completes, writes connections_[remote] — too late

connections_mutex_ prevents map corruption but enforces no happens-before between the two async paths.

Fix

When is_tcp_connected returns false for a TTL > 0 subscription, park the subscription in pending_tcp_subscriptions_ with a 100 ms steady_timer instead of immediately NACKing.

When the timer fires, retry_pending_tcp_subscription re-checks is_tcp_connected:

  • Connection now present → call handle_eventgroup_subscription normally → ACK sent
  • Still absent after 100 ms → genuine failure → definitive NACK sent

The 100 ms window covers TCP handshake + kernel accept + accept_cbk dispatch latency by a large margin. A duplicate SUBSCRIBE for the same (service, instance, eventgroup, address, port) tuple cancels the previous timer before arming a new one.

Scope

2 files, SD layer only — no cross-layer dependency introduced:

  • implementation/service_discovery/include/service_discovery_impl.hpppending_tcp_sub_t struct + retry_pending_tcp_subscription declaration + pending_tcp_subscriptions_ map + mutex
  • implementation/service_discovery/src/service_discovery_impl.cpp — deferred-NACK path at both is_tcp_connected call sites + retry_pending_tcp_subscription implementation

Behavior

Scenario Before After
Single io thread (no race possible) NACK if not connected Same — accept_cbk always runs before SUBSCRIBE on one thread
Race window (≥2 threads) Spurious NACK 100 ms defer → retry → ACK on connection present
Genuine connection failure NACK NACK after 100 ms (within SD retry interval of ~3 s)
Duplicate SUBSCRIBE during defer Second NACK Previous timer cancelled, new timer armed

Fixes #971.

…OVESA#971)

With >=2 io_context threads, accept_cbk (which writes the new TCP
connection into connections_) and handle_eventgroup_subscription (which
reads connections_ via is_tcp_connected) can execute concurrently with
no ordering guarantee. If the SUBSCRIBE arrives over UDP before
accept_cbk commits the connection, is_tcp_connected returns false and
a spurious SUBSCRIBE_NACK is emitted even though the TCP handshake
already completed.

Fix: when is_tcp_connected returns false for a TTL>0 subscription,
park the subscription in pending_tcp_subscriptions_ with a 100 ms
steady_timer instead of immediately NACKing. When the timer fires,
retry is_tcp_connected: if the connection is now present, complete the
subscription normally; if still absent after 100 ms, send the definitive
NACK. The 100 ms window covers the TCP handshake + kernel accept +
accept_cbk dispatch latency by a large margin.

connections_mutex_ already prevents map corruption; this fix closes
the logical TOCTOU ordering gap without adding cross-layer dependencies.
The change is self-contained in the SD layer (2 files).

Fixes COVESA#971.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: <title> Race conditon between tcp_server_enpoint_impl::accept_cbk, handle_eventgroup_subscription

1 participant