Skip to content

feat(openfeature): add flag evaluation tracking via OTel Metrics#4489

Merged
gh-worker-dd-mergequeue-cf854d[bot] merged 10 commits intomainfrom
leo.romanovsky/flageval-metrics
Mar 3, 2026
Merged

feat(openfeature): add flag evaluation tracking via OTel Metrics#4489
gh-worker-dd-mergequeue-cf854d[bot] merged 10 commits intomainfrom
leo.romanovsky/flageval-metrics

Conversation

@leoromanovsky
Copy link
Copy Markdown
Contributor

@leoromanovsky leoromanovsky commented Mar 2, 2026

Motivation

Per the RFC "Flag evaluations tracking for APM tracers" (Oleksii Shmalko, 2026-01-20): we want to collect a metric for flag evaluations to track usage of flags. This data powers the FFE product's change tracking ("which services evaluate this flag?") and usage analytics.

The RFC evaluated 6 alternatives (tracer-side aggregation via EVP, custom agent aggregation, Metrics Platform, reuse agent pipeline with custom intake, OTel Events) and recommends the Metrics Platform approach: implement flag evaluations tracking as regular custom metrics sent via the OpenTelemetry Metrics API. Tracers aggregate metrics via OTel, send aggregated metrics to the agent via OTLP, and the agent sends metrics to Metrics Platform. This approach has the lowest SDK team effort, requires no backend changes, requires no agent changes, and has good performance.

Key RFC constraints:

  • No high-cardinality attributes (targeting key, evaluation context) — each unique attribute combination creates a custom metric, increasing load and cost
  • Independent from exposure events — exposures are per-subject deduplication events already implemented; eval metrics are aggregate counts
  • Sampling is OK — since pricing shifted from charging per-evaluation to charging per-configuration request, we don't need exact counts

Changes

  • New openfeature/flageval_metrics.go: Creates a dedicated MeterProvider via dd-trace-go's OTel metrics support (ddmetric.NewMeterProvider()). Defines an Int64Counter instrument (feature_flag.evaluations, delta temporality, 10s export interval). Provides record() to emit metric with attributes: feature_flag.key, feature_flag.result.variant, feature_flag.result.reason, and error.type (on error). Error classification uses a declarative errorTypeTags map from sentinel errors to low-cardinality strings.
  • Modified openfeature/provider.go: Added flagEvalMetrics field to DatadogProvider. Wired into newDatadogProvider() (creates metrics on init), evaluate() (records metric via defer after every evaluation, reason lowercased directly from OpenFeature constants), and ShutdownWithContext() (graceful meter provider shutdown).
  • New openfeature/flageval_metrics_test.go: Table-driven unit tests using OTel SDK ManualReader for in-memory metric collection. Covers success/error/default/disabled attributes, multiple evaluations aggregation, different flag series, all error types, and integration with evaluate().

Decisions

  • OTel Metrics (Metrics Platform path): Per RFC recommendation. Lowest SDK effort, no agent/backend changes needed, no custom aggregation code — the OTel SDK handles it all.
  • Dedicated MeterProvider: Self-contained; works without requiring the user to set up OTel separately. Returns noop if DD_METRICS_OTEL_ENABLED is not true — zero overhead when disabled.
  • 10s export interval: Matches the flush cadence of EVP track implementations (iOS/Unity) for responsive tracking data.
  • Low-cardinality attributes only: feature_flag.key, feature_flag.result.variant, feature_flag.result.reason, error.type. High-cardinality attributes (targeting_key, context, allocation) explicitly excluded per RFC to avoid blowing up custom metric cardinality. feature_flag.provider.name also excluded — always "Datadog", adds no value.

Enabling OTLP in production / dogfooding

The following is needed on the deployment side to receive these metrics:

  1. On the app: Set DD_METRICS_OTEL_ENABLED=true and OTEL_EXPORTER_OTLP_METRICS_ENDPOINT=http://<agent-host>:4318/v1/metrics
  2. On the Datadog Agent: Enable the OTLP HTTP receiver. The env var DD_OTLP_CONFIG_RECEIVER_PROTOCOLS_HTTP_ENDPOINT doesn't properly nest the config — you need to mount a datadog.yaml with the nested YAML structure:
    otlp_config:
      receiver:
        protocols:
          http:
            endpoint: 0.0.0.0:4318
  3. On macOS Docker Desktop: The agent container needs pid: host to avoid "failed to register process metrics: process does not exist" which crashes the OTLP pipeline.

Dogfooding branch: https://github.com/DataDog/ffe-dogfooding/tree/leo.romanovsky/flageval-metrics-dogfooding

Dogfooding evidence

Metric feature_flag.evaluations confirmed registered in Datadog (Eppo org, datadoghq.com) with the Go dogfooding app running dd-trace-go v2.7.0-dev.1:

$ docker logs app-go 2>&1 | head -2
Datadog Tracer v2.7.0-dev.1 INFO: DATADOG TRACER CONFIGURATION ...
Go server starting on port 8081

$ curl -s -X POST http://localhost:8081/evaluate -H "Content-Type: application/json" \
  -d '{"flag":{"key":"dogfood-test-flag","type":"string","defaultVariant":"off"},"subject":{"id":"user-1","attributes":{}}}'
{"timestamp":1772488246999,"allocation":{"key":"default-allocation"},"flag":{"key":"dogfood-test-flag"},"variant":{"key":"off"},...}

Metric metadata confirmed in Datadog:

feature_flag.evaluations — origin_product: Other, registered with no upload errors

Local test evidence

Screenshot 2026-03-02 at 5 01 46 PM

Unit tests (all pass)

--- PASS: TestRecord/success_with_targeting_match (0.00s)
--- PASS: TestRecord/error_flag_not_found (0.00s)
--- PASS: TestRecord/default_reason (0.00s)
--- PASS: TestRecord/disabled_flag (0.00s)
--- PASS: TestRecordMultipleEvaluations (0.00s)
--- PASS: TestRecordDifferentFlags (0.00s)
--- PASS: TestRecordAllErrorTypes (0.00s)
--- PASS: TestShutdownClean (0.00s)
--- PASS: TestIntegrationEvaluate/targeting_match_records_metric (0.00s)
--- PASS: TestIntegrationEvaluate/non-existent_flag_records_error_metric (0.00s)
--- PASS: TestIntegrationEvaluate/no_configuration_records_error_metric (0.00s)
ok  	github.com/DataDog/dd-trace-go/v2/openfeature	0.651s

System tests (all 17 FFE tests pass — 0 regressions)

Scenario: FEATURE_FLAGGING_AND_EXPERIMENTATION
Library: golang@2.7.0-dev.1

tests/ffe/test_dynamic_evaluation.py ..                                  [ 11%]
tests/ffe/test_exposures.py ...........                                  [ 76%]
tests/ffe/test_flag_eval_metrics.py ....                                 [100%]

=============== 17 passed, 2224 deselected in 228.93s (0:03:48) ================

Companion PRs

Count feature flag evaluations as custom metrics using the OTel Metrics API.
The OTel SDK handles aggregation; metrics export to the Datadog agent via OTLP;
the agent forwards to Metrics Platform.

Metric: feature_flag.evaluations (Int64Counter, delta temporality)
Attributes: feature_flag.key, feature_flag.provider.name,
feature_flag.result.variant, feature_flag.result.reason, error.type

Gated by DD_METRICS_OTEL_ENABLED=true (noop otherwise).
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 2, 2026

Codecov Report

❌ Patch coverage is 87.23404% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 55.91%. Comparing base (f0b4b24) to head (6e577d2).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
openfeature/flageval_metrics.go 88.88% 2 Missing and 2 partials ⚠️
openfeature/provider.go 81.81% 1 Missing and 1 partial ⚠️
Additional details and impacted files
Files with missing lines Coverage Δ
openfeature/provider.go 70.21% <81.81%> (ø)
openfeature/flageval_metrics.go 88.88% <88.88%> (ø)

... and 371 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@pr-commenter
Copy link
Copy Markdown

pr-commenter Bot commented Mar 2, 2026

Benchmarks

Benchmark execution time: 2026-03-03 16:13:42

Comparing candidate commit 7a57d24 in PR branch leo.romanovsky/flageval-metrics with baseline commit f0b4b24 in branch main.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 155 metrics, 9 unstable metrics.

Explanation

This is an A/B test comparing a candidate commit's performance against that of a baseline commit. Performance changes are noted in the tables below as:

  • 🟩 = significantly better candidate vs. baseline
  • 🟥 = significantly worse candidate vs. baseline

We compute a confidence interval (CI) over the relative difference of means between metrics from the candidate and baseline commits, considering the baseline as the reference.

If the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD), the change is considered significant.

Feel free to reach out to #apm-benchmarking-platform on Slack if you have any questions.

More details about the CI and significant changes

You can imagine this CI as a range of values that is likely to contain the true difference of means between the candidate and baseline commits.

CIs of the difference of means are often centered around 0%, because often changes are not that big:

---------------------------------(------|---^--------)-------------------------------->
                              -0.6%    0%  0.3%     +1.2%
                                 |          |        |
         lower bound of the CI --'          |        |
sample mean (center of the CI) -------------'        |
         upper bound of the CI ----------------------'

As described above, a change is considered significant if the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD).

For instance, for an execution time metric, this confidence interval indicates a significantly worse performance:

----------------------------------------|---------|---(---------^---------)---------->
                                       0%        1%  1.3%      2.2%      3.1%
                                                  |   |         |         |
       significant impact threshold --------------'   |         |         |
                      lower bound of CI --------------'         |         |
       sample mean (center of the CI) --------------------------'         |
                      upper bound of CI ----------------------------------'

The OpenFeature Reason constants (TARGETING_MATCH, DEFAULT, DISABLED,
ERROR) just need lowercasing for metric attributes. The explicit switch
mapping was unnecessary indirection.
Always "Datadog" — adds no value as a tag dimension since this metric
is only emitted by our own provider.
The switch was a mechanical mapping from sentinel errors to snake_case
strings. A declarative map is clearer and eliminates the indirection.
@leoromanovsky leoromanovsky marked this pull request as ready for review March 2, 2026 22:02
@leoromanovsky leoromanovsky requested review from a team as code owners March 2, 2026 22:02
Comment thread openfeature/provider.go Outdated
Comment on lines +430 to +434
// Record flag evaluation metric
if p.flagEvalMetrics != nil {
p.flagEvalMetrics.record(ctx, flagKey, res.VariantKey,
strings.ToLower(string(res.Reason)), res.Error)
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: having 3 string parameters is error-prone. It would be cleaner if record() accepted evaluationResult:

Suggested change
// Record flag evaluation metric
if p.flagEvalMetrics != nil {
p.flagEvalMetrics.record(ctx, flagKey, res.VariantKey,
strings.ToLower(string(res.Reason)), res.Error)
}
// Record flag evaluation metric
if p.flagEvalMetrics != nil {
p.flagEvalMetrics.record(ctx, flagKey, res)
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks agreed with that!

Comment thread openfeature/provider.go Outdated
Comment thread openfeature/flageval_metrics_test.go Outdated
Comment thread openfeature/flageval_metrics.go Outdated
Remove the ownsProvider field from flagEvalMetrics since it was a
test-only knob that made the shutdown test a no-op. ddmetric.Shutdown()
already handles noop and SDK providers gracefully, so the guard is
unnecessary.

Remove TestShutdownClean which was meaningless because setupTestMetrics
set ownsProvider=false, causing shutdown() to skip all real work.
Move metric recording from a defer in evaluate() to an OpenFeature
Finally hook. The old approach missed type conversion errors (e.g.,
calling BooleanValue on a string flag) and "not ready" state
evaluations because those happen after evaluate() returns.

The Finally hook fires after ALL evaluation logic completes, including
type-specific conversions in BooleanEvaluation/StringEvaluation/etc.,
so it captures the full picture.

Also simplify record() to accept InterfaceEvaluationDetails instead
of 3 separate string params, and use OpenFeature ErrorCode for error
classification instead of matching against sentinel errors.
@leoromanovsky
Copy link
Copy Markdown
Contributor Author

Addressed all review comments

Commits

e6b2adf6d — Remove ownsProvider field and bogus shutdown test (comments #3 + #4)

  • Removed ownsProvider from flagEvalMetrics — it was a test-only knob that made TestShutdownClean a no-op.
  • shutdown() now always calls ddmetric.Shutdown(), which already handles noop and SDK providers gracefully (returns nil for noop, calls Shutdown() on SDK providers).
  • Removed TestShutdownClean since setupTestMetrics set ownsProvider=false, making the test meaningless.

7a57d2433 — Move flag evaluation metrics to a Finally hook (comments #1 + #2)

  • Added flagEvalHook struct (follows exposureHook pattern) with a Finally() method that fires after ALL evaluation logic — including type conversions in BooleanEvaluation/StringEvaluation/etc.
  • Changed record() to accept of.InterfaceEvaluationDetails instead of 3 separate string params + error. Variant, reason, and error code are pulled directly from the details struct.
  • Replaced sentinel error matching (errors.Is against errFlagNotFound, etc.) with of.ErrorCode enum mapping (FlagNotFoundCode"flag_not_found", TypeMismatchCode"type_mismatch", etc.). Eliminates the errorTypeTags map entirely.
  • Hooks() now returns both the exposure hook and the flag eval metrics hook.
  • Removed the metric recording defer from evaluate().
  • Integration tests now go through the full OF client lifecycle (of.SetNamedProviderWithContextAndWaitclient.BooleanValue), proving hooks fire in the real OF pipeline.
  • Added type conversion error test: calls BooleanValue on a STRING flag, verifies error.type=type_mismatch in the metric. This test would fail with the old evaluate()-level defer because the metric was recorded before the type conversion error happened.

System tests

Also updated system-tests in DataDog/system-tests#6410:

  1. Fixed Go weblog (ffe.go): Was always calling ofClient.Object() regardless of variationType, meaning type conversion errors could never occur. Now dispatches to BooleanValue/StringValue/IntValue/FloatValue based on variationType, matching the Python and Node.js weblogs.

  2. Added Test_FFE_Eval_Metric_Type_Mismatch: Configures a STRING flag, evaluates it as BOOLEAN, asserts the metric has reason:error and error.type:type_mismatch. This test would fail with the old evaluate()-level recording (which would see reason:targeting_match with no error tag) and only passes with the Finally hook approach.

Local test results

All 18 FFE system tests pass (previously 17, +1 new type mismatch test):

tests/ffe/test_dynamic_evaluation.py ..                                  [ 11%]
tests/ffe/test_exposures.py ...........                                  [ 72%]
tests/ffe/test_flag_eval_metrics.py .....                                [100%]

=============== 18 passed, 2224 deselected in 259.67s (0:04:19) ================

Unit tests also pass:

go test ./openfeature/... -count=1
ok  	github.com/DataDog/dd-trace-go/v2/openfeature	2.182s

@leoromanovsky
Copy link
Copy Markdown
Contributor Author

/merge

@gh-worker-devflow-routing-ef8351
Copy link
Copy Markdown

gh-worker-devflow-routing-ef8351 Bot commented Mar 3, 2026

View all feedbacks in Devflow UI.

2026-03-03 21:04:18 UTC ℹ️ Start processing command /merge


2026-03-03 21:04:23 UTC ℹ️ MergeQueue: pull request added to the queue

The expected merge time in main is approximately 29m (p90).


2026-03-03 21:17:32 UTC ℹ️ MergeQueue: This merge request was merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants