Skip to content

[OTEL] Add OpenTelemetry observability support#285

Open
royischoss wants to merge 15 commits intomlrun:developmentfrom
royischoss:ceml-641
Open

[OTEL] Add OpenTelemetry observability support#285
royischoss wants to merge 15 commits intomlrun:developmentfrom
royischoss:ceml-641

Conversation

@royischoss
Copy link
Copy Markdown
Contributor

@royischoss royischoss commented Apr 5, 2026

Adds OTel-based observability to MLRun CE with automatic Python instrumentation, deployment-mode metrics collection, and Prometheus integration.
https://iguazio.atlassian.net/browse/CEML-685

Changes
OTel operator sub-chart

  • Added opentelemetry-operator v0.78.1 as an optional dependency

New templates (templates/opentelemetry/)

  • Pre-install hook to label/annotate the namespace for OTel webhook injection and namespace-wide Python auto-instrumentation
  • Post-install hook that waits for OTel CRDs before creating the OpenTelemetryCollector and Instrumentation CRs, avoiding operator/CR race condition
  • RBAC for hook jobs

Instrumentation CR

  • Deployment-mode collector — single pod per namespace, exports metrics to Prometheus on port 8889
  • Disabled aws_lambda OTel instrumentor to suppress irrelevant Lambda warnings
  • Removed duplicate OTEL_RESOURCE_ATTRIBUTES_* env vars (auto-injected by the operator, caused hides previous definition warnings on every pod)

MLRun API crash fix

  • Added mlrun.api.extraEnvKeyValue.PYTHONPATH — OTel operator injects PYTHONPATH=/otel-auto-instrumentation-python:$(PYTHONPATH) using K8s env var expansion, which can't see Docker image ENV vars. Without this explicit
    K8s env var, $(PYTHONPATH) resolves to empty, dropping the MLRun services package path and crashing the API

Admin/non-admin split

  • Admin: installs OTel operator with namespace selector webhook, CRs disabled
  • User namespace: operator disabled, collector + instrumentation CRs enabled

Other

  • Azure Blob storage path helpers in _helpers.tpl (branching on storage.mode)
  • Prometheus scrape config simplified to a single otel-collector job

- action: replace
target_label: metrics_source
replacement: otel_collector
kube-state-metrics:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is design limitation with no conditions on the values.yaml the scraping will run even if otel is disabled

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this extra scraping job and not just to add web.enable-otlp-receiver flag to the Prometheus deployment??

As you can see here

@royischoss royischoss marked this pull request as ready for review April 9, 2026 07:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants