feat(openfeature): add flag evaluation tracking via OTel Metrics by leoromanovsky · Pull Request #4489 · DataDog/dd-trace-go

leoromanovsky · 2026-03-02T20:45:13Z

Motivation

Per the RFC "Flag evaluations tracking for APM tracers" (Oleksii Shmalko, 2026-01-20): we want to collect a metric for flag evaluations to track usage of flags. This data powers the FFE product's change tracking ("which services evaluate this flag?") and usage analytics.

The RFC evaluated 6 alternatives (tracer-side aggregation via EVP, custom agent aggregation, Metrics Platform, reuse agent pipeline with custom intake, OTel Events) and recommends the Metrics Platform approach: implement flag evaluations tracking as regular custom metrics sent via the OpenTelemetry Metrics API. Tracers aggregate metrics via OTel, send aggregated metrics to the agent via OTLP, and the agent sends metrics to Metrics Platform. This approach has the lowest SDK team effort, requires no backend changes, requires no agent changes, and has good performance.

Key RFC constraints:

No high-cardinality attributes (targeting key, evaluation context) — each unique attribute combination creates a custom metric, increasing load and cost
Independent from exposure events — exposures are per-subject deduplication events already implemented; eval metrics are aggregate counts
Sampling is OK — since pricing shifted from charging per-evaluation to charging per-configuration request, we don't need exact counts

Changes

New openfeature/flageval_metrics.go: Creates a dedicated MeterProvider via dd-trace-go's OTel metrics support (ddmetric.NewMeterProvider()). Defines an Int64Counter instrument (feature_flag.evaluations, delta temporality, 10s export interval). Provides record() to emit metric with attributes: feature_flag.key, feature_flag.result.variant, feature_flag.result.reason, and error.type (on error). Error classification uses a declarative errorTypeTags map from sentinel errors to low-cardinality strings.
Modified openfeature/provider.go: Added flagEvalMetrics field to DatadogProvider. Wired into newDatadogProvider() (creates metrics on init), evaluate() (records metric via defer after every evaluation, reason lowercased directly from OpenFeature constants), and ShutdownWithContext() (graceful meter provider shutdown).
New openfeature/flageval_metrics_test.go: Table-driven unit tests using OTel SDK ManualReader for in-memory metric collection. Covers success/error/default/disabled attributes, multiple evaluations aggregation, different flag series, all error types, and integration with evaluate().

Decisions

OTel Metrics (Metrics Platform path): Per RFC recommendation. Lowest SDK effort, no agent/backend changes needed, no custom aggregation code — the OTel SDK handles it all.
Dedicated MeterProvider: Self-contained; works without requiring the user to set up OTel separately. Returns noop if DD_METRICS_OTEL_ENABLED is not true — zero overhead when disabled.
10s export interval: Matches the flush cadence of EVP track implementations (iOS/Unity) for responsive tracking data.
Low-cardinality attributes only: feature_flag.key, feature_flag.result.variant, feature_flag.result.reason, error.type. High-cardinality attributes (targeting_key, context, allocation) explicitly excluded per RFC to avoid blowing up custom metric cardinality. feature_flag.provider.name also excluded — always "Datadog", adds no value.

Enabling OTLP in production / dogfooding

The following is needed on the deployment side to receive these metrics:

On the app: Set DD_METRICS_OTEL_ENABLED=true and OTEL_EXPORTER_OTLP_METRICS_ENDPOINT=http://<agent-host>:4318/v1/metrics
On the Datadog Agent: Enable the OTLP HTTP receiver. The env var DD_OTLP_CONFIG_RECEIVER_PROTOCOLS_HTTP_ENDPOINT doesn't properly nest the config — you need to mount a datadog.yaml with the nested YAML structure:
```
otlp_config:
  receiver:
    protocols:
      http:
        endpoint: 0.0.0.0:4318
```
On macOS Docker Desktop: The agent container needs pid: host to avoid "failed to register process metrics: process does not exist" which crashes the OTLP pipeline.

Dogfooding branch: https://github.com/DataDog/ffe-dogfooding/tree/leo.romanovsky/flageval-metrics-dogfooding

Dogfooding evidence

Metric feature_flag.evaluations confirmed registered in Datadog (Eppo org, datadoghq.com) with the Go dogfooding app running dd-trace-go v2.7.0-dev.1:

$ docker logs app-go 2>&1 | head -2
Datadog Tracer v2.7.0-dev.1 INFO: DATADOG TRACER CONFIGURATION ...
Go server starting on port 8081

$ curl -s -X POST http://localhost:8081/evaluate -H "Content-Type: application/json" \
  -d '{"flag":{"key":"dogfood-test-flag","type":"string","defaultVariant":"off"},"subject":{"id":"user-1","attributes":{}}}'
{"timestamp":1772488246999,"allocation":{"key":"default-allocation"},"flag":{"key":"dogfood-test-flag"},"variant":{"key":"off"},...}

Metric metadata confirmed in Datadog:

feature_flag.evaluations — origin_product: Other, registered with no upload errors

Local test evidence

Unit tests (all pass)

--- PASS: TestRecord/success_with_targeting_match (0.00s)
--- PASS: TestRecord/error_flag_not_found (0.00s)
--- PASS: TestRecord/default_reason (0.00s)
--- PASS: TestRecord/disabled_flag (0.00s)
--- PASS: TestRecordMultipleEvaluations (0.00s)
--- PASS: TestRecordDifferentFlags (0.00s)
--- PASS: TestRecordAllErrorTypes (0.00s)
--- PASS: TestShutdownClean (0.00s)
--- PASS: TestIntegrationEvaluate/targeting_match_records_metric (0.00s)
--- PASS: TestIntegrationEvaluate/non-existent_flag_records_error_metric (0.00s)
--- PASS: TestIntegrationEvaluate/no_configuration_records_error_metric (0.00s)
ok  	github.com/DataDog/dd-trace-go/v2/openfeature	0.651s

System tests (all 17 FFE tests pass — 0 regressions)

Scenario: FEATURE_FLAGGING_AND_EXPERIMENTATION
Library: golang@2.7.0-dev.1

tests/ffe/test_dynamic_evaluation.py ..                                  [ 11%]
tests/ffe/test_exposures.py ...........                                  [ 76%]
tests/ffe/test_flag_eval_metrics.py ....                                 [100%]

=============== 17 passed, 2224 deselected in 228.93s (0:03:48) ================

Companion PRs

system-tests: feat(go): Enable flag evaluation metrics E2E tests for Go; fix reason=static system-tests#6410 (E2E tests)
ffe-dogfooding: https://github.com/DataDog/ffe-dogfooding/tree/leo.romanovsky/flageval-metrics-dogfooding (dogfooding setup)

Count feature flag evaluations as custom metrics using the OTel Metrics API. The OTel SDK handles aggregation; metrics export to the Datadog agent via OTLP; the agent forwards to Metrics Platform. Metric: feature_flag.evaluations (Int64Counter, delta temporality) Attributes: feature_flag.key, feature_flag.provider.name, feature_flag.result.variant, feature_flag.result.reason, error.type Gated by DD_METRICS_OTEL_ENABLED=true (noop otherwise).

codecov · 2026-03-02T20:56:34Z

Codecov Report

❌ Patch coverage is 87.23404% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 55.91%. Comparing base (f0b4b24) to head (6e577d2).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
openfeature/flageval_metrics.go	88.88%	2 Missing and 2 partials ⚠️
openfeature/provider.go	81.81%	1 Missing and 1 partial ⚠️

Additional details and impacted files

Files with missing lines	Coverage Δ
openfeature/provider.go	`70.21% <81.81%> (ø)`
openfeature/flageval_metrics.go	`88.88% <88.88%> (ø)`

... and 371 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

pr-commenter · 2026-03-02T21:05:38Z

Benchmarks

Benchmark execution time: 2026-03-03 16:13:42

Comparing candidate commit 7a57d24 in PR branch leo.romanovsky/flageval-metrics with baseline commit f0b4b24 in branch main.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 155 metrics, 9 unstable metrics.

Explanation

This is an A/B test comparing a candidate commit's performance against that of a baseline commit. Performance changes are noted in the tables below as:

🟩 = significantly better candidate vs. baseline
🟥 = significantly worse candidate vs. baseline

We compute a confidence interval (CI) over the relative difference of means between metrics from the candidate and baseline commits, considering the baseline as the reference.

If the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD), the change is considered significant.

Feel free to reach out to #apm-benchmarking-platform on Slack if you have any questions.

More details about the CI and significant changes

You can imagine this CI as a range of values that is likely to contain the true difference of means between the candidate and baseline commits.

CIs of the difference of means are often centered around 0%, because often changes are not that big:

---------------------------------(------|---^--------)-------------------------------->
                              -0.6%    0%  0.3%     +1.2%
                                 |          |        |
         lower bound of the CI --'          |        |
sample mean (center of the CI) -------------'        |
         upper bound of the CI ----------------------'

As described above, a change is considered significant if the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD).

For instance, for an execution time metric, this confidence interval indicates a significantly worse performance:

----------------------------------------|---------|---(---------^---------)---------->
                                       0%        1%  1.3%      2.2%      3.1%
                                                  |   |         |         |
       significant impact threshold --------------'   |         |         |
                      lower bound of CI --------------'         |         |
       sample mean (center of the CI) --------------------------'         |
                      upper bound of CI ----------------------------------'

The OpenFeature Reason constants (TARGETING_MATCH, DEFAULT, DISABLED, ERROR) just need lowercasing for metric attributes. The explicit switch mapping was unnecessary indirection.

…en style

Always "Datadog" — adds no value as a tag dimension since this metric is only emitted by our own provider.

The switch was a mechanical mapping from sentinel errors to snake_case strings. A declarative map is clearer and eliminates the indirection.

dd-oleksii · 2026-03-03T13:17:02Z

+		// Record flag evaluation metric
+		if p.flagEvalMetrics != nil {
+			p.flagEvalMetrics.record(ctx, flagKey, res.VariantKey,
+				strings.ToLower(string(res.Reason)), res.Error)
+		}


nitpick: having 3 string parameters is error-prone. It would be cleaner if record() accepted evaluationResult:

Suggested change

// Record flag evaluation metric

if p.flagEvalMetrics != nil {

p.flagEvalMetrics.record(ctx, flagKey, res.VariantKey,

strings.ToLower(string(res.Reason)), res.Error)

}

// Record flag evaluation metric

if p.flagEvalMetrics != nil {

p.flagEvalMetrics.record(ctx, flagKey, res)

}

Thanks agreed with that!

Remove the ownsProvider field from flagEvalMetrics since it was a test-only knob that made the shutdown test a no-op. ddmetric.Shutdown() already handles noop and SDK providers gracefully, so the guard is unnecessary. Remove TestShutdownClean which was meaningless because setupTestMetrics set ownsProvider=false, causing shutdown() to skip all real work.

Move metric recording from a defer in evaluate() to an OpenFeature Finally hook. The old approach missed type conversion errors (e.g., calling BooleanValue on a string flag) and "not ready" state evaluations because those happen after evaluate() returns. The Finally hook fires after ALL evaluation logic completes, including type-specific conversions in BooleanEvaluation/StringEvaluation/etc., so it captures the full picture. Also simplify record() to accept InterfaceEvaluationDetails instead of 3 separate string params, and use OpenFeature ErrorCode for error classification instead of matching against sentinel errors.

leoromanovsky · 2026-03-03T20:32:37Z

Addressed all review comments

Commits

e6b2adf6d — Remove ownsProvider field and bogus shutdown test (comments #3 + #4)

Removed ownsProvider from flagEvalMetrics — it was a test-only knob that made TestShutdownClean a no-op.
shutdown() now always calls ddmetric.Shutdown(), which already handles noop and SDK providers gracefully (returns nil for noop, calls Shutdown() on SDK providers).
Removed TestShutdownClean since setupTestMetrics set ownsProvider=false, making the test meaningless.

7a57d2433 — Move flag evaluation metrics to a Finally hook (comments #1 + #2)

Added flagEvalHook struct (follows exposureHook pattern) with a Finally() method that fires after ALL evaluation logic — including type conversions in BooleanEvaluation/StringEvaluation/etc.
Changed record() to accept of.InterfaceEvaluationDetails instead of 3 separate string params + error. Variant, reason, and error code are pulled directly from the details struct.
Replaced sentinel error matching (errors.Is against errFlagNotFound, etc.) with of.ErrorCode enum mapping (FlagNotFoundCode → "flag_not_found", TypeMismatchCode → "type_mismatch", etc.). Eliminates the errorTypeTags map entirely.
Hooks() now returns both the exposure hook and the flag eval metrics hook.
Removed the metric recording defer from evaluate().
Integration tests now go through the full OF client lifecycle (of.SetNamedProviderWithContextAndWait → client.BooleanValue), proving hooks fire in the real OF pipeline.
Added type conversion error test: calls BooleanValue on a STRING flag, verifies error.type=type_mismatch in the metric. This test would fail with the old evaluate()-level defer because the metric was recorded before the type conversion error happened.

System tests

Also updated system-tests in DataDog/system-tests#6410:

Fixed Go weblog (ffe.go): Was always calling ofClient.Object() regardless of variationType, meaning type conversion errors could never occur. Now dispatches to BooleanValue/StringValue/IntValue/FloatValue based on variationType, matching the Python and Node.js weblogs.
Added Test_FFE_Eval_Metric_Type_Mismatch: Configures a STRING flag, evaluates it as BOOLEAN, asserts the metric has reason:error and error.type:type_mismatch. This test would fail with the old evaluate()-level recording (which would see reason:targeting_match with no error tag) and only passes with the Finally hook approach.

Local test results

All 18 FFE system tests pass (previously 17, +1 new type mismatch test):

tests/ffe/test_dynamic_evaluation.py ..                                  [ 11%]
tests/ffe/test_exposures.py ...........                                  [ 72%]
tests/ffe/test_flag_eval_metrics.py .....                                [100%]

=============== 18 passed, 2224 deselected in 259.67s (0:04:19) ================

Unit tests also pass:

go test ./openfeature/... -count=1
ok  	github.com/DataDog/dd-trace-go/v2/openfeature	2.182s

leoromanovsky · 2026-03-03T21:04:14Z

/merge

gh-worker-devflow-routing-ef8351 · 2026-03-03T21:04:18Z

View all feedbacks in Devflow UI.

2026-03-03 21:04:18 UTC ℹ️ Start processing command /merge

2026-03-03 21:04:23 UTC ℹ️ MergeQueue: pull request added to the queue

The expected merge time in main is approximately 29m (p90).

2026-03-03 21:17:32 UTC ℹ️ MergeQueue: This merge request was merged

leoromanovsky mentioned this pull request Mar 2, 2026

feat(go): Enable flag evaluation metrics E2E tests for Go; fix reason=static DataDog/system-tests#6410

Merged

leoromanovsky added 7 commits March 2, 2026 16:06

fix(openfeature): use err.Error() in log call per gocritic ruleguard

4947da7

fix(openfeature): modernize for loop to range over int per rangeint lint

8b5baf7

refactor(openfeature): remove mapReason, use strings.ToLower directly

195ba88

The OpenFeature Reason constants (TARGETING_MATCH, DEFAULT, DISABLED, ERROR) just need lowercasing for metric attributes. The explicit switch mapping was unnecessary indirection.

refactor(openfeature): consolidate eval metrics tests into table-driv…

1ddba37

…en style

refactor(openfeature): remove feature_flag.provider.name attribute

0e6bf76

Always "Datadog" — adds no value as a tag dimension since this metric is only emitted by our own provider.

refactor(openfeature): replace classifyError with errorTypeTags map

0a9f5f0

The switch was a mechanical mapping from sentinel errors to snake_case strings. A declarative map is clearer and eliminates the indirection.

fix(openfeature): remove trailing blank line per gofmt

6e577d2

leoromanovsky marked this pull request as ready for review March 2, 2026 22:02

leoromanovsky requested review from a team as code owners March 2, 2026 22:02

leoromanovsky requested review from dd-oleksii and typotter March 2, 2026 22:02

dd-oleksii approved these changes Mar 3, 2026

View reviewed changes

leoromanovsky added 2 commits March 3, 2026 10:50

gh-worker-dd-devflow-36fce6 Bot added mergequeue-status: queued mergequeue-status: in_progress and removed mergequeue-status: queued labels Mar 3, 2026

gh-worker-dd-mergequeue-cf854d Bot merged commit 0fb2232 into main Mar 3, 2026
185 checks passed

gh-worker-dd-devflow-36fce6 Bot removed the mergequeue-status: in_progress label Mar 3, 2026

gh-worker-dd-mergequeue-cf854d Bot deleted the leo.romanovsky/flageval-metrics branch March 3, 2026 21:17

gh-worker-dd-devflow-36fce6 Bot added the mergequeue-status: done label Mar 3, 2026

typotter mentioned this pull request Apr 2, 2026

Add flag evaluation metrics via OTel counter and OpenFeature Hook DataDog/dd-trace-java#11040

Merged

3 tasks

dd-oleksii mentioned this pull request Apr 17, 2026

feat(openfeature): implement flag evaluation metrics DataDog/dd-trace-js#7993

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(openfeature): add flag evaluation tracking via OTel Metrics#4489

feat(openfeature): add flag evaluation tracking via OTel Metrics#4489
gh-worker-dd-mergequeue-cf854d[bot] merged 10 commits intomainfrom
leo.romanovsky/flageval-metrics

leoromanovsky commented Mar 2, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Mar 2, 2026 •

edited

Loading

Uh oh!

pr-commenter Bot commented Mar 2, 2026 •

edited

Loading

Explanation

More details about the CI and significant changes

Uh oh!

dd-oleksii Mar 3, 2026

Uh oh!

leoromanovsky Mar 3, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

leoromanovsky commented Mar 3, 2026

Uh oh!

leoromanovsky commented Mar 3, 2026

Uh oh!

gh-worker-devflow-routing-ef8351 Bot commented Mar 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

leoromanovsky commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

Decisions

Enabling OTLP in production / dogfooding

Dogfooding evidence

Local test evidence

Unit tests (all pass)

System tests (all 17 FFE tests pass — 0 regressions)

Companion PRs

Uh oh!

codecov Bot commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

pr-commenter Bot commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

Explanation

More details about the CI and significant changes

Uh oh!

dd-oleksii Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

leoromanovsky Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

leoromanovsky commented Mar 3, 2026

Addressed all review comments

Commits

System tests

Local test results

Uh oh!

leoromanovsky commented Mar 3, 2026

Uh oh!

gh-worker-devflow-routing-ef8351 Bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

leoromanovsky commented Mar 2, 2026 •

edited

Loading

codecov Bot commented Mar 2, 2026 •

edited

Loading

pr-commenter Bot commented Mar 2, 2026 •

edited

Loading

gh-worker-devflow-routing-ef8351 Bot commented Mar 3, 2026 •

edited

Loading