Problem
When a repo pod fails to start due to ErrImageNeverPull (the agent image doesn't exist locally), the task worker retries indefinitely. Each retry creates a new pod that immediately fails, waits 120s for timeout, then re-queues. This creates a runaway loop that:
- Accumulates dozens of dead pods (55+ observed in one incident)
- Wastes cluster resources
- Never succeeds (the image won't magically appear)
- Fills logs with repeated timeout errors
Observed behavior
Task ebdf2deb entered this loop for 2+ hours, creating a new pod every ~2.5 minutes, accumulating 55 dead pods all stuck in ErrImageNeverPull.
Root cause
The provisioning retry logic in task-worker.ts catches pod timeout errors and re-queues with a provretry job, but it:
- Has no max retry count for provisioning failures (separate from the task's
maxRetries which is for agent failures)
- Doesn't inspect the pod's actual failure reason — it only sees "timed out waiting for Running state"
- Doesn't clean up the failed pod before creating a new one
Proposed fix
1. Detect unrecoverable pod failures early
In waitForPodRunning() or the provisioning path in repo-pool-service.ts, check the pod's container status before/during the 120s wait. If the pod is in ErrImageNeverPull, ImagePullBackOff, or InvalidImageName, fail immediately — don't wait for the timeout.
2. Cap provisioning retries
Add a max provisioning retry count (e.g., 3-5). After that, fail the task permanently with a clear error message: "Pod failed to start: image 'optio-node:latest' not found locally. Run ./images/build.sh node to build it."
3. Clean up failed pods on retry
When a provisioning attempt fails, delete the failed pod before creating a new one. Currently they pile up.
4. Classify the error
Add ErrImageNeverPull / ImagePullBackOff to the error classifier in packages/shared/src/error-classifier.ts with category image and a helpful remedy message.
Files to modify
packages/container-runtime/src/kubernetes.ts — waitForPodRunning(): check container statuses for terminal image errors, fail fast
apps/api/src/services/repo-pool-service.ts — createRepoPod(): delete failed pod on retry, add retry counter
apps/api/src/workers/task-worker.ts — provisioning retry path: respect max provisioning retries, pass failure reason
packages/shared/src/error-classifier.ts — add image pull error patterns
Problem
When a repo pod fails to start due to
ErrImageNeverPull(the agent image doesn't exist locally), the task worker retries indefinitely. Each retry creates a new pod that immediately fails, waits 120s for timeout, then re-queues. This creates a runaway loop that:Observed behavior
Task
ebdf2debentered this loop for 2+ hours, creating a new pod every ~2.5 minutes, accumulating 55 dead pods all stuck inErrImageNeverPull.Root cause
The provisioning retry logic in
task-worker.tscatches pod timeout errors and re-queues with aprovretryjob, but it:maxRetrieswhich is for agent failures)Proposed fix
1. Detect unrecoverable pod failures early
In
waitForPodRunning()or the provisioning path inrepo-pool-service.ts, check the pod's container status before/during the 120s wait. If the pod is inErrImageNeverPull,ImagePullBackOff, orInvalidImageName, fail immediately — don't wait for the timeout.2. Cap provisioning retries
Add a max provisioning retry count (e.g., 3-5). After that, fail the task permanently with a clear error message: "Pod failed to start: image 'optio-node:latest' not found locally. Run
./images/build.sh nodeto build it."3. Clean up failed pods on retry
When a provisioning attempt fails, delete the failed pod before creating a new one. Currently they pile up.
4. Classify the error
Add
ErrImageNeverPull/ImagePullBackOffto the error classifier inpackages/shared/src/error-classifier.tswith categoryimageand a helpful remedy message.Files to modify
packages/container-runtime/src/kubernetes.ts—waitForPodRunning(): check container statuses for terminal image errors, fail fastapps/api/src/services/repo-pool-service.ts—createRepoPod(): delete failed pod on retry, add retry counterapps/api/src/workers/task-worker.ts— provisioning retry path: respect max provisioning retries, pass failure reasonpackages/shared/src/error-classifier.ts— add image pull error patterns