fix: add health check to detect and recover stale EphemeralRunner registrations by rob-howie-depop · Pull Request #4399 · actions/actions-runner-controller

rob-howie-depop · 2026-03-09T09:34:02Z

Summary

Adds a GitHub-side registration health check to the EphemeralRunnerReconciler. During reconciliation of a running EphemeralRunner that has a valid RunnerId, the controller now calls GetRunner() against the GitHub Actions API to verify the registration still exists. If the API returns 404 (registration gone), the runner is marked as failed so the EphemeralRunnerSet can provision a replacement.

This also fixes deleteRunnerFromService to tolerate 404 from RemoveRunner, since a runner whose registration was already invalidated will also 404 on removal. Previously this caused markAsFailed to error and trigger a requeue loop, preventing cleanup of stale runners.

A redundant pod phase check was removed from the health check path — checkRunnerRegistration is only called from within the cs.State.Terminated == nil branch (container confirmed running), so the phase check was unnecessary and could cause false negatives since the local EphemeralRunner object may not have the updated phase yet.

Changes

ephemeralrunner_controller.go: New checkRunnerRegistration() method that calls GetRunner() and returns unhealthy only on confirmed 404. Transient API errors (500, timeouts, etc.) are logged and ignored to avoid false positives. Called during reconciliation when a running pod has a non-zero RunnerId.
ephemeralrunner_controller.go: deleteRunnerFromService() now treats 404 from RemoveRunner as success, since the runner is already gone.
ephemeralrunner_controller_test.go: Two new test cases — one verifying a 404 from GetRunner marks the runner as failed, another verifying a 500 does not. Both simulate a running pod container status before the health check triggers.
fake/client.go: New WithRemoveRunnerError option for the fake client to support testing the 404 removal path.

Issues Fixed

Directly fixes

ARC runners did not recover automatically after Github Outage #4396 — ARC runners did not recover automatically after GitHub Outage: Runners that lost their GitHub-side registration during the 2026-03-05 GitHub Actions incident remained stuck indefinitely because the EphemeralRunnerReconciler never verified registration validity. This change adds exactly that verification — on each reconciliation of a running runner, GetRunner() confirms the registration exists. If it returns 404, the runner is marked failed and replaced.
GHA Self-hosted runner pods are running but the runner status is showing offline. #4395 — GHA Self-hosted runner pods are running but the runner status is showing offline: Runner pods were running for 5-7+ hours showing "Listening for Jobs" but marked offline in GitHub. The pods' registrations were invalidated server-side (same incident), but the controller had no mechanism to detect this. The new health check detects the 404 and triggers cleanup, preventing these zombie runners from accumulating.

Partially addresses

listener: stale TotalAssignedJobs from GitHub Actions service causes permanent over-provisioning after platform incidents #4397 — Stale TotalAssignedJobs causes permanent over-provisioning after platform incidents: This issue has two components: (1) zombie runner pods occupying slots, and (2) the listener's TotalAssignedJobs remaining inflated. This PR fixes component (1) — by detecting and cleaning up runners with invalidated registrations, the controller stops accumulating zombie pods. However, the listener-side TotalAssignedJobs inflation (a separate code path in listener.go → worker.go) is not addressed by this change and still requires either a session reconnect mechanism or CR deletion to clear.
Ephemeral Runners seem to get stuck when job is canceled or interrupted #4307 — Ephemeral Runners seem to get stuck when job is canceled or interrupted: Reports runners getting stuck with Registration <uuid> was not found errors in BrokerServer backoff loops after job cancellation. The health check would detect that the registration is gone (404 from GetRunner) and mark these runners as failed rather than leaving them in an infinite backoff loop.
EphemeralRunner and its pods left stuck Running after runner OOMKILL #4155 — EphemeralRunner and its pods left stuck Running after runner OOMKill: Runners that are OOMKilled can end up in a state where the pod is technically running but the runner process is non-functional, and the registration may become stale. The health check provides a secondary detection mechanism — if the GitHub-side registration is invalidated for a non-functional runner, it will be caught and cleaned up.
AutoscalingRunnerSet gets stuck thinking runners exist when they do not #3821 — AutoscalingRunnerSet gets stuck thinking runners exist when they do not: Reports the AutoscalingRunnerSet believing runners exist when no pods are present, requiring CR deletion to recover. While the root cause may vary, stale registrations that the controller cannot clean up (due to RemoveRunner 404 errors causing requeue loops) are one contributing factor. The 404 tolerance in deleteRunnerFromService directly addresses this cleanup path.

…istrations fix: add health check to detect and recover stale EphemeralRunner registrations ## Summary Adds a GitHub-side registration health check to the `EphemeralRunnerReconciler`. During reconciliation of a running EphemeralRunner that has a valid `RunnerId`, the controller now calls `GetRunner()` against the GitHub Actions API to verify the registration still exists. If the API returns 404 (registration gone), the runner is marked as failed so the `EphemeralRunnerSet` can provision a replacement. This also fixes `deleteRunnerFromService` to tolerate 404 from `RemoveRunner`, since a runner whose registration was already invalidated will also 404 on removal. Previously this caused `markAsFailed` to error and trigger a requeue loop, preventing cleanup of stale runners. A redundant pod phase check was removed from the health check path — `checkRunnerRegistration` is only called from within the `cs.State.Terminated == nil` branch (container confirmed running), so the phase check was unnecessary and could cause false negatives since the local `EphemeralRunner` object may not have the updated phase yet. ## Changes - **`ephemeralrunner_controller.go`**: New `checkRunnerRegistration()` method that calls `GetRunner()` and returns unhealthy only on confirmed 404. Transient API errors (500, timeouts, etc.) are logged and ignored to avoid false positives. Called during reconciliation when a running pod has a non-zero `RunnerId`. - **`ephemeralrunner_controller.go`**: `deleteRunnerFromService()` now treats 404 from `RemoveRunner` as success, since the runner is already gone. - **`ephemeralrunner_controller_test.go`**: Two new test cases — one verifying a 404 from `GetRunner` marks the runner as failed, another verifying a 500 does not. Both simulate a running pod container status before the health check triggers. - **`fake/client.go`**: New `WithRemoveRunnerError` option for the fake client to support testing the 404 removal path. ## Issues Fixed ### Directly fixes - **actions#4396** — *ARC runners did not recover automatically after GitHub Outage*: Runners that lost their GitHub-side registration during the [2026-03-05 GitHub Actions incident](https://www.githubstatus.com/incidents/g9j4tmfqdd09) remained stuck indefinitely because the `EphemeralRunnerReconciler` never verified registration validity. This change adds exactly that verification — on each reconciliation of a running runner, `GetRunner()` confirms the registration exists. If it returns 404, the runner is marked failed and replaced. - **actions#4395** — *GHA Self-hosted runner pods are running but the runner status is showing offline*: Runner pods were running for 5-7+ hours showing "Listening for Jobs" but marked offline in GitHub. The pods' registrations were invalidated server-side (same incident), but the controller had no mechanism to detect this. The new health check detects the 404 and triggers cleanup, preventing these zombie runners from accumulating. ### Partially addresses - **actions#4397** — *Stale TotalAssignedJobs causes permanent over-provisioning after platform incidents*: This issue has two components: (1) zombie runner pods occupying slots, and (2) the listener's `TotalAssignedJobs` remaining inflated. This PR fixes component (1) — by detecting and cleaning up runners with invalidated registrations, the controller stops accumulating zombie pods. However, the listener-side `TotalAssignedJobs` inflation (a separate code path in `listener.go` → `worker.go`) is not addressed by this change and still requires either a session reconnect mechanism or CR deletion to clear. - **actions#4307** — *Ephemeral Runners seem to get stuck when job is canceled or interrupted*: Reports runners getting stuck with `Registration <uuid> was not found` errors in BrokerServer backoff loops after job cancellation. The health check would detect that the registration is gone (404 from `GetRunner`) and mark these runners as failed rather than leaving them in an infinite backoff loop. - **actions#4155** — *EphemeralRunner and its pods left stuck Running after runner OOMKill*: Runners that are OOMKilled can end up in a state where the pod is technically running but the runner process is non-functional, and the registration may become stale. The health check provides a secondary detection mechanism — if the GitHub-side registration is invalidated for a non-functional runner, it will be caught and cleaned up. - **actions#3821** — *AutoscalingRunnerSet gets stuck thinking runners exist when they do not*: Reports the AutoscalingRunnerSet believing runners exist when no pods are present, requiring CR deletion to recover. While the root cause may vary, stale registrations that the controller cannot clean up (due to `RemoveRunner` 404 errors causing requeue loops) are one contributing factor. The 404 tolerance in `deleteRunnerFromService` directly addresses this cleanup path. ## Test Plan - [x] Unit test: `GetRunner` returning 404 → runner marked as failed - [x] Unit test: `GetRunner` returning 500 → runner NOT marked as failed (transient error tolerance) - [x] Unit test: `RemoveRunner` returning 404 → deletion succeeds (runner already gone) - [ ] Integration: Deploy to a test cluster, manually delete a runner's registration via the GitHub API, verify the controller detects and replaces it - [ ] Integration: Simulate incident conditions by blocking `broker.actions.githubusercontent.com` for running runners, then unblocking — verify stale runners are detected and replaced

Copilot

Pull request overview

Adds a GitHub-side registration health check to EphemeralRunnerReconciler so the controller can detect runners whose GitHub registration was invalidated (e.g., during a GitHub outage) and trigger replacement/cleanup. Also makes runner removal resilient to 404s so stale runners don’t get stuck in a requeue loop.

Changes:

Add checkRunnerRegistration() that calls GetRunner() and marks the EphemeralRunner failed only on confirmed 404.
Treat 404 from RemoveRunner() as success in deleteRunnerFromService().
Add controller tests for 404 (marks failed) vs 500 (does not mark failed) and extend the fake client to simulate RemoveRunner errors.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
`controllers/actions.github.com/ephemeralrunner_controller.go`	Adds registration health check during reconciliation and tolerates 404 on runner removal.
`controllers/actions.github.com/ephemeralrunner_controller_test.go`	Adds tests covering 404 invalidation and transient GitHub API errors.
`github/actions/fake/client.go`	Adds `WithRemoveRunnerError` option to support new test scenarios.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-09T09:39:40Z

controllers/actions.github.com/ephemeralrunner_controller.go

+		// Check if runner's GitHub registration is still valid
+		healthy, err := r.checkRunnerRegistration(ctx, ephemeralRunner, log)
+		if err != nil {
+			return ctrl.Result{}, err
+		}


checkRunnerRegistration is invoked whenever cs.State.Terminated == nil, but that condition also covers Waiting container states (image pull, crashloop backoff, etc.) and doesn’t guarantee the runner container is actually running. This can trigger GitHub registration checks before the runner process is up and may incorrectly mark the EphemeralRunner failed on a 404. Consider gating the health check on cs.State.Running != nil (and/or pod.Status.Phase == PodRunning) before calling it.

Copilot · 2026-03-09T09:39:40Z

controllers/actions.github.com/ephemeralrunner_controller.go

+		// Check if runner's GitHub registration is still valid
+		healthy, err := r.checkRunnerRegistration(ctx, ephemeralRunner, log)
+		if err != nil {
+			return ctrl.Result{}, err
+		}


This adds an external GetRunner API call on the hot path for running pods and will execute on every reconcile triggered by pod updates. Since pod status updates can be frequent, this may significantly increase GitHub API usage and risk rate limiting at scale. Consider throttling (e.g., record last check time in status/annotations and only check every N minutes) and/or explicitly RequeueAfter a fixed interval to make the check periodic rather than event-frequency dependent.

Copilot · 2026-03-09T09:39:40Z

controllers/actions.github.com/ephemeralrunner_controller.go

+			if err := r.markAsFailed(ctx, ephemeralRunner, "Runner registration no longer exists on GitHub", "RegistrationInvalidated", log); err != nil {
+				return ctrl.Result{}, err
+			}
+			return ctrl.Result{}, nil


The failure reason is introduced as a string literal ("RegistrationInvalidated"), while other failure reasons in this package are constants (e.g., ReasonInvalidPodFailure). To keep reasons consistent and avoid typos/drift, consider defining a ReasonRegistrationInvalidated constant (likely in constants.go) and using it here.

nikola-jokic · 2026-03-10T18:41:47Z

Hey @rob-howie-depop,

Thank you for creating this PR but this is not the way we should approach the issue. The original issue is that the runner doesn't exit. If we were to check the runner on each reconcile loop, you would quickly get rate limited on a large enough scale. The source of the issue is that the runner doesn't exit. Therefore, the runner should dictate how we should react, not the controller.

rob-howie-depop · 2026-03-17T15:47:48Z

Hey @nikola-jokic -

The original issue is that the runner doesn't exit

Which specific issue are you talking about? There are several linked and mine had runners that were running fine, with no github registration - hence could never have jobs scheduled on them. This, coupled with some poorly implemented scaling logic within ARC meant that I had stale runners which could never receive jobs, simultaneously to not being able to scale. This deadlocked my CI, and based on the issues raised post the Github incident, other peoples' CIs

If we were to check the runner on each reconcile loop, you would quickly get rate limited on a large enough scale.

This is a valid point and something I was considering. The default sync period is ~1m, so this is an extra 60 API calls per runner. With a limit of 5,000 requests/hour we do quickly run into that issue. Sadly github provide no batch endpoint to get all runner registrations, so we will need to implement a time gate of some kind to avoid blowing up the API rate limit. Once every 10 mins could work? This would balance the rate of API calls with the ability of a runners to recover after the next github outage

I am open to other suggestions, but we need to fix this issue. ARC needs to be resilient to github outages given how common these have been in the past couple of years.

I am happy to do the work to implement & test the PR if you can give me some direction on how best to implement this?

nikola-jokic · 2026-03-17T16:52:00Z

There are several linked and mine had runners that were running fine, with no github registration

This should not occur. This is the point I'm trying to bring. The runners should not run without registration. They should start, register, and talk to the API. If they can't, they should try to recover. If they can't, they should exit. It is a simple contract between the controller and the runner. That is why I've said the runner should exit here, which would allow controller to properly react. Checking the state with the API is simply not a good path forward.

May I ask, what do you mean by "some poorly implemented scaling logic within ARC"?

rob-howie-depop · 2026-03-17T17:12:47Z

some poorly implemented scaling logic within ARC

I guess this is another way to view the incident - we had one runner out of 50 which had an active registration (so could accept jobs), the rest were functionally dead (were not accepting jobs as they were not registered with Github) but existing from the viewpoint of the ARC controller. In my mind ARC should scale 50 -> upwards since there were jobs pending (even if a number of of existing jobs are idle, from its view), but as i understand it, it was comparing the queue size of pending jobs (on the Github side) to the number of available runners (from its view) so didn't scale as there were less jobs than available runners.

There is another perspective here on what this looked like (much better write up)
#4397

This should not occur. This is the point I'm trying to bring. The runners should not run without registration. They should start, register, and talk to the API. If they can't, they should try to recover. If they can't, they should exit. It is a simple contract between the controller and the runner.

Yep agreed on this. The runners should not run without registration.

That is why I've said the runner should exit here, which would allow controller to properly react. Checking the state with the API is simply not a good path forward.

This is where I am a bit confused - we can move the logic to verify that the registration is still valid into the container (either directly or by proxy) but its just that - shifting the polling of github into the container. Does that make sense?

Remember we are talking about when a container starts, registers, and talks to the API, then the registration goes missing AFTER its started, in our case this happened due to a github incident

sethjackson · 2026-03-18T00:33:17Z

Wanted to chime in on this thread as well.

We too have experienced similar issues with running ARC. Regardless of whether caused by a GitHub incident or other reasons sometimes the runners do lose their registration and the pods do not terminate which basically means the cluster fills up with unusable runner pods and will eventually cause jobs to not schedule at all (what I like to call a catastrophic queuing issues) as cluster resource constraints (CPU, memory) are reached. Currently, we run out of band automation that greps the pod logs and kills the underlying ephemeralrunner to prevent catastrophic queueing issues. This logic is based on many other issues that have been reported against the actions/runner and actions/actions-runner-controller repos.

If I understand what you are saying @nikola-jokic the issue is somewhere in the runner logic (eg. in actions/runner) where the runner doesn't exit with a non-zero exit code when the registration is lost. Is that a correct understanding? But to @rob-howie-depop's point I'm a bit confused as to how that would work over in the runner repo as the registration goes missing after the runner pod starts either due to a GitHub incident or other reasons.

Copilot AI review requested due to automatic review settings March 9, 2026 09:34

rob-howie-depop requested review from a team, mumoshu, nikola-jokic, rentziass and toast-gear as code owners March 9, 2026 09:34

Copilot started reviewing on behalf of rob-howie-depop March 9, 2026 09:34 View session

Copilot AI reviewed Mar 9, 2026

View reviewed changes

fix: Add a sync period to ensure we don't overload the github rate limit

471f81d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add health check to detect and recover stale EphemeralRunner registrations#4399

fix: add health check to detect and recover stale EphemeralRunner registrations#4399
rob-howie-depop wants to merge 2 commits intoactions:masterfrom
rob-howie-depop:add-health-check-to-detect-and-recover-stale-EphemeralRunner-registrations

rob-howie-depop commented Mar 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

nikola-jokic commented Mar 10, 2026

Uh oh!

rob-howie-depop commented Mar 17, 2026

Uh oh!

nikola-jokic commented Mar 17, 2026

Uh oh!

rob-howie-depop commented Mar 17, 2026 •

edited

Loading

Uh oh!

sethjackson commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

rob-howie-depop commented Mar 9, 2026

Summary

Changes

Issues Fixed

Directly fixes

Partially addresses

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

nikola-jokic commented Mar 10, 2026

Uh oh!

rob-howie-depop commented Mar 17, 2026

Uh oh!

nikola-jokic commented Mar 17, 2026

Uh oh!

rob-howie-depop commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sethjackson commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rob-howie-depop commented Mar 17, 2026 •

edited

Loading