Skip to content

[TRUNK-17697] Document Prometheus metrics endpoint for merge queue#460

Open
samgutentag wants to merge 3 commits intomainfrom
sam-gutentag/prometheus-metrics-endpoint
Open

[TRUNK-17697] Document Prometheus metrics endpoint for merge queue#460
samgutentag wants to merge 3 commits intomainfrom
sam-gutentag/prometheus-metrics-endpoint

Conversation

@samgutentag
Copy link
Member

Summary

• Added new "Prometheus metrics endpoint" section to Metrics and monitoring with metrics table, scrape config, and Grafana query examples
• Added "Metrics Endpoints" section to API reference linking to the Prometheus docs

Context links

  • trunk-io/trunk2#2859, #2963, #3171, #3095 (all merged)

Files changed

  • merge-queue/administration/metrics.md — new Prometheus section
  • merge-queue/reference/merge.md — new Metrics Endpoints section

Open questions

  1. Critical: What is the exact endpoint URL? (Draft assumes /v1/merge-queue/metrics on api.trunk.io)
  2. Critical: What auth mechanism? (Draft assumes bearer token via Trunk API token)
  3. What are the exact metric names and labels? (Draft provides best guesses from PR code)
  4. Are there any rate limits on scraping?
  5. Is this available on all plans or only certain tiers?

Test plan

  • Check GitBook preview for both pages
  • Eng review required — metric names, endpoint URL, and auth need verification
  • Verify Grafana query examples are syntactically valid

🤖 Generated with Claude Code

Add Prometheus scrape endpoint docs with available metrics, example
scrape config, and Grafana queries. New section in metrics page and
reference from API page.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@trunk-staging-io
Copy link

trunk-staging-io bot commented Mar 12, 2026

✨ Submitted to Merge by github-actions[bot]. It will be added to the merge queue once all branch protection rules pass and there are no merge conflicts with the target branch. See more details here.

@samgutentag samgutentag changed the title Document Prometheus metrics endpoint for merge queue [TRUNK-17697] Document Prometheus metrics endpoint for merge queue Mar 12, 2026
@claude
Copy link

claude bot commented Mar 12, 2026

Claude finished @samgutentag's task in 1m 53s —— View job


Review complete

  • Read changed files
  • Check CLAUDE.md for repo guidelines (none found)
  • Review for typos, grammar, and formatting
  • Review documentation best practices
  • Post inline comments
  • Final summary

Summary

Good foundation — the structure is clear and the cross-reference from the API page is well done. A few issues to address before merging:

Blocking issues:

  1. Broken PromQL query (metrics.md line 157) — The histogram_quantile example is missing the required sum(rate(...)) by (le) wrapper. As written, it will fail or return incorrect results. Fix this →

  2. Auth mechanism mismatch (metrics.md line 139) — The scrape config uses Prometheus bearer_token (sets Authorization: Bearer header), but the rest of the API uses x-api-token header. Needs eng confirmation of which auth the Prometheus endpoint actually accepts before documenting this. Fix this →

Non-blocking:

  1. Heading level inconsistency (metrics.md line 110) — ## Prometheus metrics endpoint uses h2 while all other sections in the file use h3 (###). Should be ###, with subsections bumped to ####. Fix this →

The author's own open questions (endpoint URL, exact metric names, rate limits, plan availability) are the real blockers for this content being accurate — eng review is clearly needed before this merges.


**P95 wait time:**
```promql
histogram_quantile(0.95, trunk_merge_queue_wait_time_seconds_bucket{repo="my-org/my-repo"})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This histogram_quantile query is incomplete — it will produce incorrect results or errors. Prometheus histogram quantile requires wrapping with sum(rate(...)) by (le). The corrected form:

Suggested change
histogram_quantile(0.95, trunk_merge_queue_wait_time_seconds_bucket{repo="my-org/my-repo"})
histogram_quantile(0.95, sum(rate(trunk_merge_queue_wait_time_seconds_bucket{repo="my-org/my-repo"}[5m])) by (le))

The [5m] range vector is also required by rate(). Adjust the window to your scrape interval needs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blx can you verify this?

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
@samgutentag samgutentag requested review from blx and pv72895 March 13, 2026 17:18
…ndpoint details

Replaces placeholder metric names (trunk_merge_queue_*) with actual implementation
(mq_queue_depth_current, mq_pr_conclusions_1h_total, etc.). Adds correct endpoint
path, auth mechanism, query params, histogram details, gauge semantics caveat,
sample output, and PromQL examples from the Prometheus Metrics Spec.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude
Copy link

claude bot commented Mar 13, 2026

Claude finished @samgutentag's task in 1m 57s —— View job


Review complete

  • Read changed files
  • Check CLAUDE.md for repo guidelines (none found)
  • Review for typos, grammar, and formatting
  • Review documentation best practices
  • Final summary

Summary

The two blocking issues from the previous review have been resolved in this update:

  1. Auth mechanism — now correctly uses x-api-token header (metrics.md lines 188–189) ✅
  2. PromQL queries — no longer apply rate() to gauge-semantics metrics, consistent with the inline warning ✅

Non-blocking issues remaining:

  1. Heading hierarchy (metrics.md lines 118, 135, 173, 194, 237) — The parent section is ### Prometheus metrics endpoint (h3), but all subsections (Endpoint, Available metrics, Scrape configuration, Example queries, Sample output) also use ###. They should be ####, and #### Point-in-time gauges / #### Rolling 1-hour window metrics (lines 145, 155) should become #####. Fix this →

  2. Histogram PromQL missing by (le) (metrics.md lines 225, 231) — histogram_quantile queries work as-is for single-series cases, but best practice is to add sum(...) by (le) to correctly aggregate across label combinations (different repos/branches). Given the gauge semantics, rate() should be omitted — the corrected form would be histogram_quantile(0.90, sum(mq_pr_wait_duration_1h_seconds_bucket) by (le)). Fix this →

The author's open questions about exact metric names, rate limits, and plan availability remain — eng sign-off is still needed before merging.

@github-actions
Copy link

/trunk merge

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto-approved: Claude code review passed.

Copy link
Contributor

@blx blx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few notes

Comment on lines +114 to +116
{% hint style="info" %}
The Prometheus metrics endpoint is available on select plans. Contact your Trunk representative for access.
{% endhint %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is available to all merge queue users currently

| Parameter | Required | Description |
| --- | --- | --- |
| `repo` | No | Repository in `owner/name` format (e.g., `my-org/my-repo`). If omitted, returns metrics for all repositories in the organization. |
| `repoHost` | No | Repository host. Defaults to `github.com`. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repo and repoHost must be provided together, if either is specified. It does not default to github. So either no query params, or both repo and repoHost.

For repoHost, I would keep the default text but make it an example instead ("Example: github.com")

| `repo` | No | Repository in `owner/name` format (e.g., `my-org/my-repo`). If omitted, returns metrics for all repositories in the organization. |
| `repoHost` | No | Repository host. Defaults to `github.com`. |

The response uses content type `text/plain; version=0.0.4; charset=utf-8` (standard Prometheus format). Metrics are computed on-demand with a 60-second cache.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would omit the computed on-demand cache part -- that is internal and subject to change and not relevant to the API contract exposed.

| --- | --- | --- |
| `repo` | Repository name | `my-org/my-repo` |
| `branch` | Base branch name | `main`, `develop` |
| `queue_type` | Queue type | `main`, `bisection` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to clarify: queue_type will be only main or bisection (it's an enum), whereas the examples for repo and branch are good but ultimately freeform text


| Metric | Type | Description |
| --- | --- | --- |
| `mq_queue_depth_current` | Gauge | Number of PRs currently in the queue (excludes NOT\_READY PRs) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `mq_queue_depth_current` | Gauge | Number of PRs currently in the queue (excludes NOT\_READY PRs) |
| `mq_depth_current` | Gauge | Number of PRs currently in the queue (excludes PRs that are waiting to be mergeable before being admitted to the queue) |

| `mq_pr_restarts_1h_total` | Gauge || PR restarts (TESTING to PENDING transitions) in the last hour |
| `mq_pr_wait_duration_1h_seconds` | Histogram | `le` (bucket boundary) | Distribution of time PRs spent waiting before testing starts |
| `mq_pr_test_duration_1h_seconds` | Histogram | `le` (bucket boundary) | Distribution of time PRs spent in the testing phase |
| `mq_pr_time_in_queue_1h_seconds` | Histogram | `conclusion`, `le` | Distribution of total time in queue for PRs that concluded in the last hour |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not available yet - please remove for now

Comment on lines +230 to +231
# P95 total time in queue for merged PRs
histogram_quantile(0.95, mq_pr_time_in_queue_1h_seconds_bucket{conclusion="merged"})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

omit this until time in queue metric is added

| Parameter | Required | Description |
| --- | --- | --- |
| `repo` | No | Repository in `owner/name` format. If omitted, returns metrics for all repositories in the organization. |
| `repoHost` | No | Repository host. Defaults to `github.com`. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does not default, and is required if repo is passed - see note above

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants