Skip to content

fix: retry SQS messages when GitHub API fails after EC2 instance creation#5029

Draft
edersonbrilhante wants to merge 1 commit intogithub-aws-runners:mainfrom
edersonbrilhante:fix-retry-gh-errors
Draft

fix: retry SQS messages when GitHub API fails after EC2 instance creation#5029
edersonbrilhante wants to merge 1 commit intogithub-aws-runners:mainfrom
edersonbrilhante:fix-retry-gh-errors

Conversation

@edersonbrilhante
Copy link
Contributor

Closes #5024

When createStartRunnerConfig fails with a GitHub HTTP error (e.g., 401, 404, 422) after EC2 instances have already been created, the SQS messages are consumed and permanently deleted. The workflow jobs are silently lost — runners can't register, and jobs remain queued in GitHub until they time out (24 hours).

This change introduces GHHttpError, a subclass of ScaleError, that catches Octokit HttpError exceptions in createStartRunnerConfig and wraps them. Because GHHttpError extends ScaleError, the existing ScaleError catch in scaleUpHandler handles it via polymorphism — no changes needed in lambda.ts. Unlike ScaleError (which retries only failedInstanceCount messages), GHHttpError overrides toBatchItemFailures() to retry all messages, since the GitHub API failure affects every instance in the batch.

@Brend-Smits
Copy link
Contributor

This seems similar to #4990, can you please have a look?

@edersonbrilhante
Copy link
Contributor Author

@Brend-Smits your implementation is so much better than mine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Scale-up Lambda silently drops entire SQS batch on non-ScaleError exceptions

2 participants