[system_tests] Fix flaky test_app_started_is_first_message by using seq_id ordering#6387
Conversation
|
|
eba115d to
c14cbfb
Compare
c14cbfb to
a418fd7
Compare
There was a problem hiding this comment.
From framework usage, AGTM.
But about the test logic, I'm a little bit concerned by the fact we are using seq_id as source of truth. The previous logic was using the capture order, and in consequence, there is an assertion that seq_id is coherent with that order.
I understand that at some point we may hit some painful non-deterministic behavior. But on the other hand, this assertion may be quite useful : are we 100% that a incoherent seq_id order has no effect on backend ?
Is there a way to keep a certain degree of assertion on this order?
This point is not my decision to make, but do you mind getting a review from someone familiar with the feature, to ensure the new test logic does not open the door to a downgrade of those assertions ?
Thanks @cbeauchesne for catching this potential issue. I'm not sure who the feature owner might be: |
Previously, we reset the seq_id whenever a process forked. Because of that, we had to rely on capture order to estimate when a payload was generated. This worked since payloads were sent in FIFO order, but as of a few weeks ago, that’s no longer the case (a child process can now submit telemetry before it's parent and this is the source of the current flakiness). We recently changed this behavior so that seq_id is no longer reset on fork. As a result, we can now reliably use seq_id to determine which telemetry payload was generated first. |
My question is about this point : seq_id now become the source of truth, rather timestamps. In consequence, we don't test it. So it means that if at some point someone breaks this behavior (for instance, an error is pushed that reset again Just to be clear, I'm not opposing to this change, but I just wanted to be sure that it's an educated choice, and not a new loophole. |
https://datadoghq.atlassian.net/browse/PROF-13796
Motivation
test_telemetry.py::Test_Telemetry::test_app_started_is_first_messagewas flaky (e.g. 8 hits in dd-trace-py main CI, 2026). The test asserted thatapp-startedis the first telemetry message, but it used capture order (log file order) instead of send order (seq_id). Capture order can be non-deterministic—when the first captured batch is not the first sent, the test fails.Changes
test_app_started_is_first_message: Useseq_idas the source of truth for ordering. Flatten all lifecycle telemetry (single messages + message-batch payloads), skip non-lifecycle types (sketches,generate-metrics,logs,distributions), group byruntime_id, sort by(seq_id, batch_index), and assert per runtime that the first message isapp-started. This removes dependence on capture/file order.Reviewer checklist
tests/ormanifests/is modified ? I have the approval from R&P teambuild-XXX-imagelabel is present