[Bug] The operator continuously reconciles the failed job without detecting the permanent failure

When an ApprovedImage is applied with an image that causes compute-pcrs to fail permanently (e.g., corrupted PE header), the operator continuously reconciles the failed job without detecting the permanent failure. This results in resource waste and misleading status.

To reproduce:
1. Apply a approvedimage CR which will cause the compute-pcrs job to fail
2. Watch the status of the job, and check the log of the operator, the compute-pcrs pod

Current behavior:
1. Exactly 7 pods of compute-pcrs-* (1 initial + 6 retries with backoffLimit: 6). Newest pod is 10m old - NO NEW PODS BEING CREATED (Kubernetes has stopped retrying correctly ✓)
```
$ oc get jobs 
NAME                                                              STATUS   COMPLETIONS   DURATION   AGE
compute-pcrs-139c43a838-quay-io-trusted-execution-clusters-fedo   Failed   0/1           67m        67m
$ oc get job compute-pcrs-139c43a838-quay-io-trusted-execution-clusters-fedo -o yaml
......
status:
  conditions:
  - lastProbeTime: "2026-04-02T05:58:34Z"
    lastTransitionTime: "2026-04-02T05:58:34Z"
    message: Job has reached the specified backoff limit
    reason: BackoffLimitExceeded
    status: "True"
    type: FailureTarget
  - lastProbeTime: "2026-04-02T05:58:34Z"
    lastTransitionTime: "2026-04-02T05:58:34Z"
    message: Job has reached the specified backoff limit
    reason: BackoffLimitExceeded
    status: "True"
    type: Failed
  failed: 7
  ready: 0
  startTime: "2026-04-02T05:46:57Z"
  terminating: 0
  uncountedTerminatedPods: {}
```
2. The operator continues to reconcile every 300s, which is unexpected
```
$ oc get approvedimage  image-latest -o yaml
......
status:
  conditions:
  - lastTransitionTime: "2026-04-02T05:46:58Z"
    message: Computation is ongoing. Check jobs for progress.
    observedGeneration: 1
    reason: Computing  <------------------------------------------- still show computing
    status: "False"
    type: Committed

$ oc logs -f confidential-cluster-operator-6c7f547f8-km8p5 --timestamps
...
2026-04-02T06:58:34.627721122Z [INFO  kube_runtime::controller] reconciling object; object.ref=Job.v1.batch/compute-pcrs-139c43a838-quay-io-trusted-execution-clusters-fedo.confidential-clusters object.reason=reconciler requested retry
2026-04-02T06:58:34.627721122Z [INFO  operator::reference_values] Job compute-pcrs-139c43a838-quay-io-trusted-execution-clusters-fedo changed, but had not completed
2026-04-02T06:58:34.627803038Z [INFO  operator] reconciled (ObjectRef { dyntype: (), name: "compute-pcrs-139c43a838-quay-io-trusted-execution-clusters-fedo", namespace: Some("confidential-clusters"), extra: Extra { resource_version: Some("25925947"), uid: Some("494ef462-9837-4002-b3d4-60d783447c25") } }, Action { requeue_after: Some(300s) })
```
Expected result:
- Detect Permanent Failures
- After backoffLimit (6) retries, detect that the job has permanently failed
- Update ApprovedImage status to reason: Failed
- Stop retrying the computation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] The operator continuously reconciles the failed job without detecting the permanent failure #232

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] The operator continuously reconciles the failed job without detecting the permanent failure #232

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions