fix: provisioning retry loop should detect ErrImageNeverPull and fail permanently

## Problem

When a repo pod fails to start due to `ErrImageNeverPull` (the agent image doesn't exist locally), the task worker retries indefinitely. Each retry creates a new pod that immediately fails, waits 120s for timeout, then re-queues. This creates a runaway loop that:

- Accumulates dozens of dead pods (55+ observed in one incident)
- Wastes cluster resources
- Never succeeds (the image won't magically appear)
- Fills logs with repeated timeout errors

### Observed behavior

Task `ebdf2deb` entered this loop for 2+ hours, creating a new pod every ~2.5 minutes, accumulating 55 dead pods all stuck in `ErrImageNeverPull`.

## Root cause

The provisioning retry logic in `task-worker.ts` catches pod timeout errors and re-queues with a `provretry` job, but it:

1. Has no max retry count for provisioning failures (separate from the task's `maxRetries` which is for agent failures)
2. Doesn't inspect the pod's actual failure reason — it only sees "timed out waiting for Running state"
3. Doesn't clean up the failed pod before creating a new one

## Proposed fix

### 1. Detect unrecoverable pod failures early
In `waitForPodRunning()` or the provisioning path in `repo-pool-service.ts`, check the pod's container status before/during the 120s wait. If the pod is in `ErrImageNeverPull`, `ImagePullBackOff`, or `InvalidImageName`, fail immediately — don't wait for the timeout.

### 2. Cap provisioning retries
Add a max provisioning retry count (e.g., 3-5). After that, fail the task permanently with a clear error message: "Pod failed to start: image 'optio-node:latest' not found locally. Run `./images/build.sh node` to build it."

### 3. Clean up failed pods on retry
When a provisioning attempt fails, delete the failed pod before creating a new one. Currently they pile up.

### 4. Classify the error
Add `ErrImageNeverPull` / `ImagePullBackOff` to the error classifier in `packages/shared/src/error-classifier.ts` with category `image` and a helpful remedy message.

## Files to modify

- `packages/container-runtime/src/kubernetes.ts` — `waitForPodRunning()`: check container statuses for terminal image errors, fail fast
- `apps/api/src/services/repo-pool-service.ts` — `createRepoPod()`: delete failed pod on retry, add retry counter
- `apps/api/src/workers/task-worker.ts` — provisioning retry path: respect max provisioning retries, pass failure reason
- `packages/shared/src/error-classifier.ts` — add image pull error patterns

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: provisioning retry loop should detect ErrImageNeverPull and fail permanently #190

Problem

Observed behavior

Root cause

Proposed fix

1. Detect unrecoverable pod failures early

2. Cap provisioning retries

3. Clean up failed pods on retry

4. Classify the error

Files to modify

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

fix: provisioning retry loop should detect ErrImageNeverPull and fail permanently #190

Description

Problem

Observed behavior

Root cause

Proposed fix

1. Detect unrecoverable pod failures early

2. Cap provisioning retries

3. Clean up failed pods on retry

4. Classify the error

Files to modify

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions