[PROF-12646] Update OTel process context support with latest version of reference library#414
[PROF-12646] Update OTel process context support with latest version of reference library#414
Conversation
…of reference library **What does this PR do?** This PR updates the vendored version of the OTel process context support with the latest code from open-telemetry/sig-profiling#79 (which will soon be merged upstream). **Motivation:** This latest version of the library fixes a number of small issues, but more importantly includes support for the thread context configuration keys which we need for #347 (so we will be able to shed a bunch of code once we rebase on top of this one). **Additional Notes:** From the Java API consumer side, not a lot changes with this version. **How to test the change?** Validate tests still pass!
There was a problem hiding this comment.
The changes from this file are a bit big but all comes from upstream -- same sha1sum as the PR I linked in the description :)
Scan-Build Report
Bug Summary
Reports
|
||||||||||||||||||||||||||||||||||||
CI Test ResultsRun: #23251054896 | Commit:
Status Overview
Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled Summary: Total: 32 | Passed: 32 | Failed: 0 Updated: 2026-03-18 15:10:49 UTC |
Ooops I guess too old libc; I'll add a fallback |
I'll make sure to propagate this to the upstream version as well, so we don't need to keep maintaining this separately.
TL;DR if there were repeated keys when parsing the data in the testing code we could leak their values. This matches the latest version upstream in open-telemetry/sig-profiling#79 .
jbachorik
left a comment
There was a problem hiding this comment.
I have a few nits and some comments where I am looking for clarification before approving.
| uint64_t otel_process_monotonic_published_at_ns; // Timestamp from when the context was published in nanoseconds from CLOCK_BOOTTIME. 0 during updates. | ||
| char *otel_process_payload; // Always non-null, points to the storage for the data; expected to be a protobuf map of string key/value pairs, null-terminated |
There was a problem hiding this comment.
Nit: This formatting looks funny
|
|
||
| if (monotonic_published_at_ns == published_state.mapping->otel_process_monotonic_published_at_ns) { | ||
| // Advance published_at_ns to allow readers to detect the update | ||
| monotonic_published_at_ns++; |
There was a problem hiding this comment.
Is the clock value used just as a monotonic counter seeded by the system time?
There was a problem hiding this comment.
In practice, yes, although readers can use it to get a timestamp of how long ago the context was published (because they can also get the same clock and compare timestamps)
| .deployment_environment_name = env_str.c_str(), | ||
| .service_instance_id = runtime_id_str.c_str(), | ||
| .service_name = service_str.c_str(), | ||
| .service_version = version_str.c_str(), | ||
| .telemetry_sdk_language = "java", | ||
| .telemetry_sdk_version = tracer_version_str.c_str(), | ||
| .telemetry_sdk_name = "dd-trace-java", | ||
| .resource_attributes = host_name_attrs, | ||
| .extra_attributes = NULL, | ||
| .thread_ctx_config = NULL |
There was a problem hiding this comment.
Is ProcessContextText testing also the newly added stuff?
There was a problem hiding this comment.
At this point, the only new bits are the extra_attributes and the thread_ctx_config but these are not yet exposed to the Java code so there's nothing to test yet.
Once we wire those two to be set from Java (if needed/ever), we'll need to update the test.
(I can do that now if you prefer, but since it's not clear when we'll use them I was doing a YAGNI thing :D)
| // Try to create mapping from memfd | ||
| if (ftruncate(fd, mapping_size) == -1) { | ||
| otel_process_ctx_drop_current(); | ||
| return (otel_process_ctx_result) {.success = false, .error_message = "Failed to truncate memfd (" __FILE__ ":" ADD_QUOTES(__LINE__) ")"}; |
There was a problem hiding this comment.
Do you need to close(fd) here to avoid leaking it?
| #include <stdlib.h> | ||
| #include <string.h> | ||
| #include <sys/mman.h> | ||
| #include <sys/prctl.h> |
There was a problem hiding this comment.
This one does not exist on macos - probably the easiest would be to ifdef the include. Alternative would be to follow the pattern in this repo and have otel_process_ctx_linux.cpp and otel_process_ctx_macos.cpp with the later one being virtually empty.
There was a problem hiding this comment.
Arghh yeah. I've tweaked it in b8c7d1e to not even include the headers with the no-op build. Because we have testing on the other repo with both kinds of builds, this should make sure that we don't accidentally break the no-op build again.
The macos file is indeed an option, but I'm somewhat trying to avoid it since I'm trying to keep it easier to embed the library by having just the one .h/.c. Once more files are in play I think it complicates the lives of library consumers...
| *ptr++ = total_pairs & 0xFF; // low byte of count | ||
| // ProcessContext.resource (field 1) | ||
| write_protobuf_tag(&ptr, 1); | ||
| write_protobuf_varint(&ptr, resource_size); |
There was a problem hiding this comment.
resource_size should probably checked to fit in UINT14_MAX
There was a problem hiding this comment.
Yes, very correctly spotted out! I did a full pass to see if we missed a few more and found another. Fixed in 19fe5dc
| free(published_state.payload); // This was still pointing to the old payload | ||
| published_state.payload = payload; |
There was a problem hiding this comment.
This looks like we might race here - the reader might have obtained the payload before monotonic_published_at_ns has been zeroed.
And at this point it might be still reading from it?
There was a problem hiding this comment.
Yes -- this is acknowledged in the spec and is somewhat intentional (to avoid heavier synchronization).
Readers SHOULD gracefully handle missing, incomplete, or invalid mappings. If a process does not publish context or if decoding fails, readers SHOULD fall back to default resource detection mechanisms.
We recommend the use of APIs such as
process_vm_readvor equivalent to read data, as they allow gracefully handling of issues inherent to racy cross-process memory access.
So the intention is that the reader will be on the look out for the read possibly failing or being incorrect -- only after making sure it has a consistent read (monotonic_published_at_ns was the same before/after the read) does it know it can use the data it got.
(In general the eBPF Profiler is already taking care of these kinds of sharp edges -- there's a bunch of other situations e.g. for unwinding where it's racing the app and can see intermediate/inconsistent data)
In practice I'm not sure `ftruncate` could fail unless the kernel is already on fire but no reason not to fix the leak anyway.
We were accidentally importing at least one Linux-specific header, which made NOOP not work. To make sure NOOP keeps working, let's move everything it doesn't need inside the ifdef -- this way it'll be very obvious what's being compiled for NOOP and what's not.
Really well-spotted by @jbachorik .
What does this PR do?
This PR updates the vendored version of the OTel process context support with the latest code from
open-telemetry/sig-profiling#79 (which will soon be merged upstream).
Motivation:
This latest version of the library fixes a number of small issues, but more importantly includes support for the thread context configuration keys which we need for #347 (so we will be able to shed a bunch of code once we rebase on top of this one).
Additional Notes:
From the Java API consumer side, not a lot changes with this version.
How to test the change?
Validate tests still pass!
For Datadog employees:
credentials of any kind, I've requested a review from
@DataDog/security-design-and-guidance.