Skip to content

[Bug] The operator continuously reconciles the failed job without detecting the permanent failure #232

@yalzhang

Description

@yalzhang

When an ApprovedImage is applied with an image that causes compute-pcrs to fail permanently (e.g., corrupted PE header), the operator continuously reconciles the failed job without detecting the permanent failure. This results in resource waste and misleading status.

To reproduce:

  1. Apply a approvedimage CR which will cause the compute-pcrs job to fail
  2. Watch the status of the job, and check the log of the operator, the compute-pcrs pod

Current behavior:

  1. Exactly 7 pods of compute-pcrs-* (1 initial + 6 retries with backoffLimit: 6). Newest pod is 10m old - NO NEW PODS BEING CREATED (Kubernetes has stopped retrying correctly ✓)
$ oc get jobs 
NAME                                                              STATUS   COMPLETIONS   DURATION   AGE
compute-pcrs-139c43a838-quay-io-trusted-execution-clusters-fedo   Failed   0/1           67m        67m
$ oc get job compute-pcrs-139c43a838-quay-io-trusted-execution-clusters-fedo -o yaml
......
status:
  conditions:
  - lastProbeTime: "2026-04-02T05:58:34Z"
    lastTransitionTime: "2026-04-02T05:58:34Z"
    message: Job has reached the specified backoff limit
    reason: BackoffLimitExceeded
    status: "True"
    type: FailureTarget
  - lastProbeTime: "2026-04-02T05:58:34Z"
    lastTransitionTime: "2026-04-02T05:58:34Z"
    message: Job has reached the specified backoff limit
    reason: BackoffLimitExceeded
    status: "True"
    type: Failed
  failed: 7
  ready: 0
  startTime: "2026-04-02T05:46:57Z"
  terminating: 0
  uncountedTerminatedPods: {}
  1. The operator continues to reconcile every 300s, which is unexpected
$ oc get approvedimage  image-latest -o yaml
......
status:
  conditions:
  - lastTransitionTime: "2026-04-02T05:46:58Z"
    message: Computation is ongoing. Check jobs for progress.
    observedGeneration: 1
    reason: Computing  <------------------------------------------- still show computing
    status: "False"
    type: Committed

$ oc logs -f confidential-cluster-operator-6c7f547f8-km8p5 --timestamps
...
2026-04-02T06:58:34.627721122Z [INFO  kube_runtime::controller] reconciling object; object.ref=Job.v1.batch/compute-pcrs-139c43a838-quay-io-trusted-execution-clusters-fedo.confidential-clusters object.reason=reconciler requested retry
2026-04-02T06:58:34.627721122Z [INFO  operator::reference_values] Job compute-pcrs-139c43a838-quay-io-trusted-execution-clusters-fedo changed, but had not completed
2026-04-02T06:58:34.627803038Z [INFO  operator] reconciled (ObjectRef { dyntype: (), name: "compute-pcrs-139c43a838-quay-io-trusted-execution-clusters-fedo", namespace: Some("confidential-clusters"), extra: Extra { resource_version: Some("25925947"), uid: Some("494ef462-9837-4002-b3d4-60d783447c25") } }, Action { requeue_after: Some(300s) })

Expected result:

  • Detect Permanent Failures
  • After backoffLimit (6) retries, detect that the job has permanently failed
  • Update ApprovedImage status to reason: Failed
  • Stop retrying the computation

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions