Skip to content

fix: provisioning retry loop should detect ErrImageNeverPull and fail permanently #190

@jonwiggins

Description

@jonwiggins

Problem

When a repo pod fails to start due to ErrImageNeverPull (the agent image doesn't exist locally), the task worker retries indefinitely. Each retry creates a new pod that immediately fails, waits 120s for timeout, then re-queues. This creates a runaway loop that:

  • Accumulates dozens of dead pods (55+ observed in one incident)
  • Wastes cluster resources
  • Never succeeds (the image won't magically appear)
  • Fills logs with repeated timeout errors

Observed behavior

Task ebdf2deb entered this loop for 2+ hours, creating a new pod every ~2.5 minutes, accumulating 55 dead pods all stuck in ErrImageNeverPull.

Root cause

The provisioning retry logic in task-worker.ts catches pod timeout errors and re-queues with a provretry job, but it:

  1. Has no max retry count for provisioning failures (separate from the task's maxRetries which is for agent failures)
  2. Doesn't inspect the pod's actual failure reason — it only sees "timed out waiting for Running state"
  3. Doesn't clean up the failed pod before creating a new one

Proposed fix

1. Detect unrecoverable pod failures early

In waitForPodRunning() or the provisioning path in repo-pool-service.ts, check the pod's container status before/during the 120s wait. If the pod is in ErrImageNeverPull, ImagePullBackOff, or InvalidImageName, fail immediately — don't wait for the timeout.

2. Cap provisioning retries

Add a max provisioning retry count (e.g., 3-5). After that, fail the task permanently with a clear error message: "Pod failed to start: image 'optio-node:latest' not found locally. Run ./images/build.sh node to build it."

3. Clean up failed pods on retry

When a provisioning attempt fails, delete the failed pod before creating a new one. Currently they pile up.

4. Classify the error

Add ErrImageNeverPull / ImagePullBackOff to the error classifier in packages/shared/src/error-classifier.ts with category image and a helpful remedy message.

Files to modify

  • packages/container-runtime/src/kubernetes.tswaitForPodRunning(): check container statuses for terminal image errors, fail fast
  • apps/api/src/services/repo-pool-service.tscreateRepoPod(): delete failed pod on retry, add retry counter
  • apps/api/src/workers/task-worker.ts — provisioning retry path: respect max provisioning retries, pass failure reason
  • packages/shared/src/error-classifier.ts — add image pull error patterns

Metadata

Metadata

Assignees

No one assigned

    Labels

    optioAssigned to Optio AI agent

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions