Add ECMWF observability guidelines for logging and metrics#39
Add ECMWF observability guidelines for logging and metrics#39
Conversation
| - Includes identifiers and outcome. | ||
| - Uses stable field names. | ||
| - Supports correlation: | ||
| - Include `trace_id` and `span_id` when context exists. |
There was a problem hiding this comment.
It's not clear what trace_id and span_id are (and their differences). Would be worth explaining how to use these attributes.
There was a problem hiding this comment.
Good point, I added a short dedicated subsection in 4.5.1
| Use stable event names (`event.name`) where possible, and make messages | ||
| explicit about outcome, target, and reason. | ||
|
|
||
| For severity mapping guidance, follow OpenTelemetry severity concepts in the |
There was a problem hiding this comment.
can you add a link to OTel docs on this?
There was a problem hiding this comment.
Added a direct link in 4.6 to the OpenTelemetry severity definition
|
|
||
| ## 5. Metrics Standard | ||
|
|
||
| Metrics MUST be exposed in Prometheus/OpenMetrics-compatible format. |
There was a problem hiding this comment.
We should probably explain what this means. Specifically I think it means we shoudl be serving them over HTTP at a relevant end-point, owned by the service (i.e. hello-world.ecmwf.int/metrics). Not sure how this applies to non-web-based services (e.g. HPC services, MARS)?
There was a problem hiding this comment.
I expanded 5.1 to clarify expected exposure/collection patterns
|
|
||
| Ownership split for compliance: | ||
|
|
||
| | Control | App Team | Platform Team | |
There was a problem hiding this comment.
App Team -> Development Team (just to be clearer)
Platform Team -> Platform Engineering Team (just to be clearer)
Maybe add the role of Production team here too? Requirements for new metrics/validation of metric setup?
There was a problem hiding this comment.
Done, added also a Production Team column in 5.9
|
|
||
| Out of scope in this version: | ||
|
|
||
| - Detailed environment-specific collection pipelines and agent deployment patterns. |
There was a problem hiding this comment.
I think it would be worth, at a high level, explaining that the collection pipelines are part of the deployment environment, and explain some key ideas there. i.e. for k8s-based deployments PET deploy OTel collector/forwarder; for HPC/VM a collector should be deployed alongside app. All logs forwarded to central ECMWF collector; all metrics collected by central prometheus?
Just to give an idea of the overall strategy.
A diagram may help, even.
There was a problem hiding this comment.
Added section 3.1 with a high-level strategy across Kubernetes, VM, and HPC, plus a Mermaid diagram showing workload-local collectors/forwarders and central ingestion flow for logs and metrics.
| "request.id": "req-8f31c9", | ||
| "job.id": "job-42a7", | ||
| "trace_id": "7f3fbbf5b8f24f32a59ec8ef9b264f93", | ||
| "span_id": "f9c3a29d03ef154f" |
There was a problem hiding this comment.
Might be woth talking about the reasoning behind some fields being "thing.id" and some "thing_id".
In general fields like "thing.thing" can become collisions with nested dicts, so I would avoid them entirely, but I'm open to hearing other opinions.
There was a problem hiding this comment.
Thats a good catch. Things like job.id and request.id here are custom keywords, we can change them as we want. There is a good document in opentelemetry for naming conventions: https://opentelemetry.io/docs/specs/semconv/general/naming/
Adding a naming section can make this document larger than intended, @jameshawkes what do you think?
| ECMWF services MUST use Prometheus metric types and naming conventions, and | ||
| MUST expose metrics in a Prometheus/OpenMetrics-compatible text format. | ||
| Metrics defined in this section are the source for alerting rules defined in | ||
| the Alerting section. |
There was a problem hiding this comment.
This does not exist currently I guess?
There was a problem hiding this comment.
Yes, it will be added in the next iteration
Context
This is a draft proposal to align observability guidelines across ECMWF software and services.
The objective of this PR is to collect broad feedback early, before we finalize requirements and structure.
Review approach
Please focus on whether the proposed direction is workable for your teams and platforms.
You are welcome to add colleagues as reviewers where you think it is useful.
Discussion scope for this PR
Please avoid deep implementation debates in this PR thread.
If needed, we can open follow-up issues/PRs for detailed technical discussions.
Next steps
After this review round: