sd: defer SUBSCRIBE_NACK when TCP connection not yet committed (#971) by aki1770-del · Pull Request #1025 · COVESA/vsomeip

aki1770-del · 2026-04-02T22:37:25Z

Problem

With ≥2 io_context threads, accept_cbk (which writes the new TCP connection into connections_) and handle_eventgroup_subscription (which reads connections_ via is_tcp_connected) run on different threads with no ordering guarantee. If the UDP SUBSCRIBE arrives before accept_cbk finishes committing the connection, is_tcp_connected returns false and a spurious SUBSCRIBE_NACK is sent — even though the TCP three-way handshake already completed.

Maintainer confirmed: "It rarely happened."

Race window:

Thread A: accept_cbk — blocked at socket-option setup (connections_ not yet written)
Thread B: handle_eventgroup_subscription — is_tcp_connected reads connections_ → empty → NACK
Thread A: accept_cbk completes, writes connections_[remote] — too late

connections_mutex_ prevents map corruption but enforces no happens-before between the two async paths.

Fix

When is_tcp_connected returns false for a TTL > 0 subscription, park the subscription in pending_tcp_subscriptions_ with a 100 ms steady_timer instead of immediately NACKing.

When the timer fires, retry_pending_tcp_subscription re-checks is_tcp_connected:

Connection now present → call handle_eventgroup_subscription normally → ACK sent
Still absent after 100 ms → genuine failure → definitive NACK sent

The 100 ms window covers TCP handshake + kernel accept + accept_cbk dispatch latency by a large margin. A duplicate SUBSCRIBE for the same (service, instance, eventgroup, address, port) tuple cancels the previous timer before arming a new one.

Scope

2 files, SD layer only — no cross-layer dependency introduced:

implementation/service_discovery/include/service_discovery_impl.hpp — pending_tcp_sub_t struct + retry_pending_tcp_subscription declaration + pending_tcp_subscriptions_ map + mutex
implementation/service_discovery/src/service_discovery_impl.cpp — deferred-NACK path at both is_tcp_connected call sites + retry_pending_tcp_subscription implementation

Behavior

Scenario	Before	After
Single io thread (no race possible)	NACK if not connected	Same — `accept_cbk` always runs before SUBSCRIBE on one thread
Race window (≥2 threads)	Spurious NACK	100 ms defer → retry → ACK on connection present
Genuine connection failure	NACK	NACK after 100 ms (within SD retry interval of ~3 s)
Duplicate SUBSCRIBE during defer	Second NACK	Previous timer cancelled, new timer armed

Fixes #971.

…OVESA#971) With >=2 io_context threads, accept_cbk (which writes the new TCP connection into connections_) and handle_eventgroup_subscription (which reads connections_ via is_tcp_connected) can execute concurrently with no ordering guarantee. If the SUBSCRIBE arrives over UDP before accept_cbk commits the connection, is_tcp_connected returns false and a spurious SUBSCRIBE_NACK is emitted even though the TCP handshake already completed. Fix: when is_tcp_connected returns false for a TTL>0 subscription, park the subscription in pending_tcp_subscriptions_ with a 100 ms steady_timer instead of immediately NACKing. When the timer fires, retry is_tcp_connected: if the connection is now present, complete the subscription normally; if still absent after 100 ms, send the definitive NACK. The 100 ms window covers the TCP handshake + kernel accept + accept_cbk dispatch latency by a large margin. connections_mutex_ already prevents map corruption; this fix closes the logical TOCTOU ordering gap without adding cross-layer dependencies. The change is self-contained in the SD layer (2 files). Fixes COVESA#971.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sd: defer SUBSCRIBE_NACK when TCP connection not yet committed (#971)#1025

sd: defer SUBSCRIBE_NACK when TCP connection not yet committed (#971)#1025
aki1770-del wants to merge 1 commit intoCOVESA:masterfrom
aki1770-del:fix/vsomeip-tcp-accept-subscription-race-971

aki1770-del commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aki1770-del commented Apr 2, 2026

Problem

Fix

Scope

Behavior

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant