Skip to content

Add ECMWF observability guidelines for logging and metrics#39

Open
sametd wants to merge 4 commits intomainfrom
codex/observability-guidelines
Open

Add ECMWF observability guidelines for logging and metrics#39
sametd wants to merge 4 commits intomainfrom
codex/observability-guidelines

Conversation

@sametd
Copy link
Member

@sametd sametd commented Feb 11, 2026

Context

This is a draft proposal to align observability guidelines across ECMWF software and services.
The objective of this PR is to collect broad feedback early, before we finalize requirements and structure.

Review approach

Please focus on whether the proposed direction is workable for your teams and platforms.
You are welcome to add colleagues as reviewers where you think it is useful.

Discussion scope for this PR

Please avoid deep implementation debates in this PR thread.
If needed, we can open follow-up issues/PRs for detailed technical discussions.

Next steps

After this review round:

  1. Incorporate feedback into a revised version.
  2. Add alerting guidance.
  3. Add environment-specific collection guidance.
  4. Add tracing guidance.

@jameshawkes jameshawkes marked this pull request as ready for review February 16, 2026 16:35
- Includes identifiers and outcome.
- Uses stable field names.
- Supports correlation:
- Include `trace_id` and `span_id` when context exists.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear what trace_id and span_id are (and their differences). Would be worth explaining how to use these attributes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I added a short dedicated subsection in 4.5.1

Use stable event names (`event.name`) where possible, and make messages
explicit about outcome, target, and reason.

For severity mapping guidance, follow OpenTelemetry severity concepts in the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a link to OTel docs on this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a direct link in 4.6 to the OpenTelemetry severity definition


## 5. Metrics Standard

Metrics MUST be exposed in Prometheus/OpenMetrics-compatible format.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably explain what this means. Specifically I think it means we shoudl be serving them over HTTP at a relevant end-point, owned by the service (i.e. hello-world.ecmwf.int/metrics). Not sure how this applies to non-web-based services (e.g. HPC services, MARS)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expanded 5.1 to clarify expected exposure/collection patterns


Ownership split for compliance:

| Control | App Team | Platform Team |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

App Team -> Development Team (just to be clearer)
Platform Team -> Platform Engineering Team (just to be clearer)

Maybe add the role of Production team here too? Requirements for new metrics/validation of metric setup?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, added also a Production Team column in 5.9


Out of scope in this version:

- Detailed environment-specific collection pipelines and agent deployment patterns.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be worth, at a high level, explaining that the collection pipelines are part of the deployment environment, and explain some key ideas there. i.e. for k8s-based deployments PET deploy OTel collector/forwarder; for HPC/VM a collector should be deployed alongside app. All logs forwarded to central ECMWF collector; all metrics collected by central prometheus?

Just to give an idea of the overall strategy.

A diagram may help, even.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added section 3.1 with a high-level strategy across Kubernetes, VM, and HPC, plus a Mermaid diagram showing workload-local collectors/forwarders and central ingestion flow for logs and metrics.

Copy link
Contributor

@peshence peshence left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

Comment on lines +120 to +123
"request.id": "req-8f31c9",
"job.id": "job-42a7",
"trace_id": "7f3fbbf5b8f24f32a59ec8ef9b264f93",
"span_id": "f9c3a29d03ef154f"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be woth talking about the reasoning behind some fields being "thing.id" and some "thing_id".

In general fields like "thing.thing" can become collisions with nested dicts, so I would avoid them entirely, but I'm open to hearing other opinions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats a good catch. Things like job.id and request.id here are custom keywords, we can change them as we want. There is a good document in opentelemetry for naming conventions: https://opentelemetry.io/docs/specs/semconv/general/naming/

Adding a naming section can make this document larger than intended, @jameshawkes what do you think?

ECMWF services MUST use Prometheus metric types and naming conventions, and
MUST expose metrics in a Prometheus/OpenMetrics-compatible text format.
Metrics defined in this section are the source for alerting rules defined in
the Alerting section.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not exist currently I guess?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will be added in the next iteration

@sametd sametd requested a review from jameshawkes February 16, 2026 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments