-
Notifications
You must be signed in to change notification settings - Fork 9
[Bug] The operator continuously reconciles the failed job without detecting the permanent failure #232
Copy link
Copy link
Open
Description
When an ApprovedImage is applied with an image that causes compute-pcrs to fail permanently (e.g., corrupted PE header), the operator continuously reconciles the failed job without detecting the permanent failure. This results in resource waste and misleading status.
To reproduce:
- Apply a approvedimage CR which will cause the compute-pcrs job to fail
- Watch the status of the job, and check the log of the operator, the compute-pcrs pod
Current behavior:
- Exactly 7 pods of compute-pcrs-* (1 initial + 6 retries with backoffLimit: 6). Newest pod is 10m old - NO NEW PODS BEING CREATED (Kubernetes has stopped retrying correctly ✓)
$ oc get jobs
NAME STATUS COMPLETIONS DURATION AGE
compute-pcrs-139c43a838-quay-io-trusted-execution-clusters-fedo Failed 0/1 67m 67m
$ oc get job compute-pcrs-139c43a838-quay-io-trusted-execution-clusters-fedo -o yaml
......
status:
conditions:
- lastProbeTime: "2026-04-02T05:58:34Z"
lastTransitionTime: "2026-04-02T05:58:34Z"
message: Job has reached the specified backoff limit
reason: BackoffLimitExceeded
status: "True"
type: FailureTarget
- lastProbeTime: "2026-04-02T05:58:34Z"
lastTransitionTime: "2026-04-02T05:58:34Z"
message: Job has reached the specified backoff limit
reason: BackoffLimitExceeded
status: "True"
type: Failed
failed: 7
ready: 0
startTime: "2026-04-02T05:46:57Z"
terminating: 0
uncountedTerminatedPods: {}
- The operator continues to reconcile every 300s, which is unexpected
$ oc get approvedimage image-latest -o yaml
......
status:
conditions:
- lastTransitionTime: "2026-04-02T05:46:58Z"
message: Computation is ongoing. Check jobs for progress.
observedGeneration: 1
reason: Computing <------------------------------------------- still show computing
status: "False"
type: Committed
$ oc logs -f confidential-cluster-operator-6c7f547f8-km8p5 --timestamps
...
2026-04-02T06:58:34.627721122Z [INFO kube_runtime::controller] reconciling object; object.ref=Job.v1.batch/compute-pcrs-139c43a838-quay-io-trusted-execution-clusters-fedo.confidential-clusters object.reason=reconciler requested retry
2026-04-02T06:58:34.627721122Z [INFO operator::reference_values] Job compute-pcrs-139c43a838-quay-io-trusted-execution-clusters-fedo changed, but had not completed
2026-04-02T06:58:34.627803038Z [INFO operator] reconciled (ObjectRef { dyntype: (), name: "compute-pcrs-139c43a838-quay-io-trusted-execution-clusters-fedo", namespace: Some("confidential-clusters"), extra: Extra { resource_version: Some("25925947"), uid: Some("494ef462-9837-4002-b3d4-60d783447c25") } }, Action { requeue_after: Some(300s) })
Expected result:
- Detect Permanent Failures
- After backoffLimit (6) retries, detect that the job has permanently failed
- Update ApprovedImage status to reason: Failed
- Stop retrying the computation
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels