Skip to content

FFE: add delivery delay test proving Agent bypass gap#6396

Draft
leoromanovsky wants to merge 1 commit intomainfrom
worktree-ffe-fast-lane
Draft

FFE: add delivery delay test proving Agent bypass gap#6396
leoromanovsky wants to merge 1 commit intomainfrom
worktree-ffe-fast-lane

Conversation

@leoromanovsky
Copy link
Contributor

Motivation

Datadog's feature flagging SDK (FFE) relies on Remote Configuration to deliver flag definitions to tracers. In production, tracers wait 60-130+ seconds at startup before flag configuration arrives. The root cause is a structural mismatch in the Agent: the RC cache bypass fires when a new client connects, but not when an existing client registers a new product. All tracer products (APM, ASM, FFE) share a single RC client. By the time FFE_FLAGS is registered, the client is already active -- no bypass fires, and the Agent only discovers FFE_FLAGS on its next background poll (50-60s later).

This PR adds a system-test that codifies this delay so we can prove the problem exists today, and later prove it is fixed when the Agent ships a product-aware bypass.

Changes

  • containers.py: Changed the Agent RC refresh interval from a hard assignment to setdefault, so scenarios can override it via agent_env. Existing debugger scenarios are unaffected (they do not pass this key).
  • __init__.py: Added a new FEATURE_FLAGGING_AND_EXPERIMENTATION_BACKEND scenario. Same as the existing FFE scenario but with rc_backend_enabled=True (configs flow through the real Agent instead of being mocked by the proxy) and a 60s Agent refresh interval to match real customer defaults. Tracer polls every 1s.
  • test_dynamic_evaluation.py: Added Test_FFE_RC_Delivery_Delay. Posts an FFE_FLAGS config to the mocked backend, sleeps 5 seconds, then asserts the tracer has NOT received it. With a 60s Agent refresh and no product-aware bypass, the Agent has not polled the backend yet, so the tracer gets nothing.

Decisions

  • 60s Agent refresh, not 1s: The whole point is to use the default customer configuration. With a fast refresh the test passes trivially. With 60s, the only way FFE_FLAGS can arrive in under 5s is if the bypass fires. This makes the test a proof by contradiction.
  • Negative assertion (config NOT delivered): Today with a stock Agent this test should pass -- FFE_FLAGS is not delivered within 5s. When the Agent ships the product-aware bypass fix, we flip the assertion to assert delivery and the test becomes a regression gate.
  • Separate scenario instead of modifying existing: The existing FEATURE_FLAGGING_AND_EXPERIMENTATION scenario uses proxy-mocked RC (no Agent in the loop). We need rc_backend_enabled=True to exercise the real Agent RC path, which requires a separate scenario.
  • setdefault instead of a new constructor parameter: Minimal change. The agent_env dict is already passed through; we just needed to stop the hardcoded 5s from clobbering it.

Adds a backend-mode FFE scenario and a test that demonstrates FFE_FLAGS
is not delivered to the tracer within 5 seconds when the Agent uses a
60s background refresh interval. This codifies the root cause described
in the RC fast-lane proposal: the Agent's cache bypass fires on new
clients, not new products, so FFE_FLAGS sits undelivered until the next
background poll.
@github-actions
Copy link
Contributor

github-actions bot commented Mar 1, 2026

CODEOWNERS have been resolved as:

tests/ffe/test_dynamic_evaluation.py                                    @DataDog/feature-flagging-and-experimentation-sdk @DataDog/system-tests-core
utils/_context/_scenarios/__init__.py                                   @DataDog/system-tests-core
utils/_context/containers.py                                            @DataDog/system-tests-core

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant