Skip to content

[PROF-12646] Update OTel process context support with latest version of reference library#414

Open
ivoanjo wants to merge 9 commits intomainfrom
ivoanjo/update-process-context
Open

[PROF-12646] Update OTel process context support with latest version of reference library#414
ivoanjo wants to merge 9 commits intomainfrom
ivoanjo/update-process-context

Conversation

@ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Mar 11, 2026

What does this PR do?

This PR updates the vendored version of the OTel process context support with the latest code from
open-telemetry/sig-profiling#79 (which will soon be merged upstream).

Motivation:

This latest version of the library fixes a number of small issues, but more importantly includes support for the thread context configuration keys which we need for #347 (so we will be able to shed a bunch of code once we rebase on top of this one).

Additional Notes:

From the Java API consumer side, not a lot changes with this version.

How to test the change?

Validate tests still pass!

For Datadog employees:

  • If this PR touches code that signs or publishes builds or packages, or handles
    credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
  • This PR doesn't touch any of that.
  • JIRA: PROF-12646

…of reference library

**What does this PR do?**

This PR updates the vendored version of the OTel process context
support with the latest code from
open-telemetry/sig-profiling#79 (which will soon
be merged upstream).

**Motivation:**

This latest version of the library fixes a number of small issues, but
more importantly includes support for the thread context configuration
keys which we need for #347
(so we will be able to shed a bunch of code once we rebase on top of
this one).

**Additional Notes:**

From the Java API consumer side, not a lot changes with this version.

**How to test the change?**

Validate tests still pass!
@ivoanjo ivoanjo requested a review from a team as a code owner March 11, 2026 16:20
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes from this file are a bit big but all comes from upstream -- same sha1sum as the PR I linked in the description :)

@dd-octo-sts
Copy link

dd-octo-sts bot commented Mar 11, 2026

Scan-Build Report

User:runner@runnervm0kj6c
Working Directory:/home/runner/work/java-profiler/java-profiler/ddprof-lib/src/test/make
Command Line:make -j4 all
Clang Version:Ubuntu clang version 18.1.3 (1ubuntu1)
Date:Wed Mar 11 17:51:58 2026

Bug Summary

Bug TypeQuantityDisplay?
All Bugs1
Memory error
Memory leak1

Reports

Bug Group Bug Type ▾ File Function/Method Line Path Length
Memory errorMemory leakotel_process_ctx.cppotel_process_ctx_decode_payload69343

@dd-octo-sts
Copy link

dd-octo-sts bot commented Mar 11, 2026

CI Test Results

Run: #23251054896 | Commit: c800bbf | Duration: 11m 0s (longest job)

All 32 test jobs passed

Status Overview

JDK glibc-aarch64/debug glibc-amd64/debug musl-aarch64/debug musl-amd64/debug
8 - - -
8-ibm - - -
8-j9 - -
8-librca - -
8-orcl - - -
11 - - -
11-j9 - -
11-librca - -
17 - -
17-graal - -
17-j9 - -
17-librca - -
21 - -
21-graal - -
21-librca - -
25 - -
25-graal - -
25-librca - -

Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled

Summary: Total: 32 | Passed: 32 | Failed: 0


Updated: 2026-03-18 15:10:49 UTC

@ivoanjo
Copy link
Member Author

ivoanjo commented Mar 11, 2026

Execution failed for task ':ddprof-lib:compileDebug'.
> Failed to compile otel_process_ctx.cpp: exit code 1
  /home/runner/work/java-profiler/java-profiler/ddprof-lib/src/main/cpp/otel_process_ctx.cpp:148:71: error: use of undeclared identifier 'MFD_NOEXEC_SEAL'
    int fd = memfd_create("OTEL_CTX", MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_NOEXEC_SEAL);

Ooops I guess too old libc; I'll add a fallback #define for it.

ivoanjo added 4 commits March 11, 2026 17:28
I'll make sure to propagate this to the upstream version as well, so
we don't need to keep maintaining this separately.
TL;DR if there were repeated keys when parsing the data in the testing
code we could leak their values.

This matches the latest version upstream in
open-telemetry/sig-profiling#79 .
Copy link
Collaborator

@jbachorik jbachorik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few nits and some comments where I am looking for clarification before approving.

Comment on lines +87 to +88
uint64_t otel_process_monotonic_published_at_ns; // Timestamp from when the context was published in nanoseconds from CLOCK_BOOTTIME. 0 during updates.
char *otel_process_payload; // Always non-null, points to the storage for the data; expected to be a protobuf map of string key/value pairs, null-terminated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This formatting looks funny

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d3fd5b3


if (monotonic_published_at_ns == published_state.mapping->otel_process_monotonic_published_at_ns) {
// Advance published_at_ns to allow readers to detect the update
monotonic_published_at_ns++;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the clock value used just as a monotonic counter seeded by the system time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, yes, although readers can use it to get a timestamp of how long ago the context was published (because they can also get the same clock and compare timestamps)

Comment on lines +463 to +472
.deployment_environment_name = env_str.c_str(),
.service_instance_id = runtime_id_str.c_str(),
.service_name = service_str.c_str(),
.service_version = version_str.c_str(),
.telemetry_sdk_language = "java",
.telemetry_sdk_version = tracer_version_str.c_str(),
.telemetry_sdk_name = "dd-trace-java",
.resource_attributes = host_name_attrs,
.extra_attributes = NULL,
.thread_ctx_config = NULL
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is ProcessContextText testing also the newly added stuff?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, the only new bits are the extra_attributes and the thread_ctx_config but these are not yet exposed to the Java code so there's nothing to test yet.

Once we wire those two to be set from Java (if needed/ever), we'll need to update the test.

(I can do that now if you prefer, but since it's not clear when we'll use them I was doing a YAGNI thing :D)

// Try to create mapping from memfd
if (ftruncate(fd, mapping_size) == -1) {
otel_process_ctx_drop_current();
return (otel_process_ctx_result) {.success = false, .error_message = "Failed to truncate memfd (" __FILE__ ":" ADD_QUOTES(__LINE__) ")"};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to close(fd) here to avoid leaking it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely spotted! Fixed in ccf7ed4

#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/prctl.h>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one does not exist on macos - probably the easiest would be to ifdef the include. Alternative would be to follow the pattern in this repo and have otel_process_ctx_linux.cpp and otel_process_ctx_macos.cpp with the later one being virtually empty.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arghh yeah. I've tweaked it in b8c7d1e to not even include the headers with the no-op build. Because we have testing on the other repo with both kinds of builds, this should make sure that we don't accidentally break the no-op build again.

The macos file is indeed an option, but I'm somewhat trying to avoid it since I'm trying to keep it easier to embed the library by having just the one .h/.c. Once more files are in play I think it complicates the lives of library consumers...

*ptr++ = total_pairs & 0xFF; // low byte of count
// ProcessContext.resource (field 1)
write_protobuf_tag(&ptr, 1);
write_protobuf_varint(&ptr, resource_size);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resource_size should probably checked to fit in UINT14_MAX

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, very correctly spotted out! I did a full pass to see if we missed a few more and found another. Fixed in 19fe5dc

Comment on lines +291 to +292
free(published_state.payload); // This was still pointing to the old payload
published_state.payload = payload;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like we might race here - the reader might have obtained the payload before monotonic_published_at_ns has been zeroed.
And at this point it might be still reading from it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes -- this is acknowledged in the spec and is somewhat intentional (to avoid heavier synchronization).

Quoting https://github.com/open-telemetry/opentelemetry-specification/pull/4719/changes#diff-c02a091b4a954164a90867c19771116b26d717419c16b715bce8b56d80ba49edR161 :

Readers SHOULD gracefully handle missing, incomplete, or invalid mappings. If a process does not publish context or if decoding fails, readers SHOULD fall back to default resource detection mechanisms.

We recommend the use of APIs such as process_vm_readv or equivalent to read data, as they allow gracefully handling of issues inherent to racy cross-process memory access.

So the intention is that the reader will be on the look out for the read possibly failing or being incorrect -- only after making sure it has a consistent read (monotonic_published_at_ns was the same before/after the read) does it know it can use the data it got.

(In general the eBPF Profiler is already taking care of these kinds of sharp edges -- there's a bunch of other situations e.g. for unwinding where it's racing the app and can see intermediate/inconsistent data)

ivoanjo added 4 commits March 18, 2026 14:55
In practice I'm not sure `ftruncate` could fail unless the kernel is
already on fire but no reason not to fix the leak anyway.
We were accidentally importing at least one Linux-specific header,
which made NOOP not work.

To make sure NOOP keeps working, let's move everything it doesn't need
inside the ifdef -- this way it'll be very obvious what's being compiled
for NOOP and what's not.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants