Add ECMWF observability guidelines for logging and metrics by sametd · Pull Request #39 · ecmwf/codex

sametd · 2026-02-11T14:19:51Z

Context

This is a draft proposal to align observability guidelines across ECMWF software and services.
The objective of this PR is to collect broad feedback early, before we finalize requirements and structure.

Review approach

Please focus on whether the proposed direction is workable for your teams and platforms.
You are welcome to add colleagues as reviewers where you think it is useful.

Discussion scope for this PR

Please avoid deep implementation debates in this PR thread.
If needed, we can open follow-up issues/PRs for detailed technical discussions.

Next steps

After this review round:

Incorporate feedback into a revised version.
Add alerting guidance.
Add environment-specific collection guidance.
Add tracing guidance.

jameshawkes · 2026-02-16T16:37:49Z

Observability/observability-guidelines.md

+- Includes identifiers and outcome.
+- Uses stable field names.
+- Supports correlation:
+  - Include `trace_id` and `span_id` when context exists.


It's not clear what trace_id and span_id are (and their differences). Would be worth explaining how to use these attributes.

Good point, I added a short dedicated subsection in 4.5.1

jameshawkes · 2026-02-16T16:38:27Z

Observability/observability-guidelines.md

+Use stable event names (`event.name`) where possible, and make messages
+explicit about outcome, target, and reason.
+
+For severity mapping guidance, follow OpenTelemetry severity concepts in the


can you add a link to OTel docs on this?

Added a direct link in 4.6 to the OpenTelemetry severity definition

jameshawkes · 2026-02-16T16:40:12Z

Observability/observability-guidelines.md

+
+## 5. Metrics Standard
+
+Metrics MUST be exposed in Prometheus/OpenMetrics-compatible format.


We should probably explain what this means. Specifically I think it means we shoudl be serving them over HTTP at a relevant end-point, owned by the service (i.e. hello-world.ecmwf.int/metrics). Not sure how this applies to non-web-based services (e.g. HPC services, MARS)?

I expanded 5.1 to clarify expected exposure/collection patterns

jameshawkes · 2026-02-16T16:42:05Z

Observability/observability-guidelines.md

+
+Ownership split for compliance:
+
+| Control | App Team | Platform Team |


App Team -> Development Team (just to be clearer)
Platform Team -> Platform Engineering Team (just to be clearer)

Maybe add the role of Production team here too? Requirements for new metrics/validation of metric setup?

Done, added also a Production Team column in 5.9

jameshawkes · 2026-02-16T16:45:25Z

Observability/observability-guidelines.md

+
+Out of scope in this version:
+
+- Detailed environment-specific collection pipelines and agent deployment patterns.


I think it would be worth, at a high level, explaining that the collection pipelines are part of the deployment environment, and explain some key ideas there. i.e. for k8s-based deployments PET deploy OTel collector/forwarder; for HPC/VM a collector should be deployed alongside app. All logs forwarded to central ECMWF collector; all metrics collected by central prometheus?

Just to give an idea of the overall strategy.

A diagram may help, even.

Added section 3.1 with a high-level strategy across Kubernetes, VM, and HPC, plus a Mermaid diagram showing workload-local collectors/forwarders and central ingestion flow for logs and metrics.

peshence

Looks great!

peshence · 2026-02-16T16:53:21Z

Observability/observability-guidelines.md

+    "request.id": "req-8f31c9",
+    "job.id": "job-42a7",
+    "trace_id": "7f3fbbf5b8f24f32a59ec8ef9b264f93",
+    "span_id": "f9c3a29d03ef154f"


Might be woth talking about the reasoning behind some fields being "thing.id" and some "thing_id".

In general fields like "thing.thing" can become collisions with nested dicts, so I would avoid them entirely, but I'm open to hearing other opinions.

Thats a good catch. Things like job.id and request.id here are custom keywords, we can change them as we want. There is a good document in opentelemetry for naming conventions: https://opentelemetry.io/docs/specs/semconv/general/naming/

Adding a naming section can make this document larger than intended, @jameshawkes what do you think?

peshence · 2026-02-16T16:54:29Z

Observability/observability-guidelines.md

+ECMWF services MUST use Prometheus metric types and naming conventions, and
+MUST expose metrics in a Prometheus/OpenMetrics-compatible text format.
+Metrics defined in this section are the source for alerting rules defined in
+the Alerting section.


This does not exist currently I guess?

Yes, it will be added in the next iteration

Add ECMWF observability guidelines for logging and metrics

23c53db

sametd requested review from EddyCMWF, Ozaq, carletes, jameshawkes, peshence and tbkr February 11, 2026 14:19

linting and wrapping

7ff4198

jameshawkes marked this pull request as ready for review February 16, 2026 16:35

jameshawkes requested changes Feb 16, 2026

View reviewed changes

peshence approved these changes Feb 16, 2026

View reviewed changes

sametd added 2 commits February 16, 2026 20:23

Clarify observability model across environments and trace correlation

d665791

Refine ownership model with explicit team roles

b0db173

sametd requested a review from jameshawkes February 16, 2026 19:54


		## 5. Metrics Standard

		Metrics MUST be exposed in Prometheus/OpenMetrics-compatible format.


		Ownership split for compliance:

		\| Control \| App Team \| Platform Team \|


		Out of scope in this version:

		- Detailed environment-specific collection pipelines and agent deployment patterns.

Conversation

sametd commented Feb 11, 2026

Context

Review approach

Discussion scope for this PR

Next steps

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peshence left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments