Skip to content

Add finalizer management for application credentials#685

Open
Deydra71 wants to merge 1 commit intoopenstack-k8s-operators:mainfrom
Deydra71:appcred-service-labels
Open

Add finalizer management for application credentials#685
Deydra71 wants to merge 1 commit intoopenstack-k8s-operators:mainfrom
Deydra71:appcred-service-labels

Conversation

@Deydra71
Copy link
Copy Markdown
Contributor

@Deydra71 Deydra71 commented Apr 1, 2026

Jira: OSPRH-28176, OSPRH-27512

  • Delete unused GetApplicationCredentialFromSecret function
  • Add a service label to the AC secret for easy discovery, e.g. application-credential-service: barbican
  • Introduce immutable per-rotation AC secrets with deterministic names
  • Add a new status field previousSecretName used for tracking previously used AC secret
  • Add Keystone-side revocation of unused rotated ACs - currently used secret and previously used secret are protected, pre-previous secret is deleted and revoked
  • Suppress Owns() create events on the secret watch to prevent a race condition caused by stale informer cach and sometimes causing additional AC secret to be created and deleted immediately during app cred rotation
  • Add migration support for old mutable secrets - on the first reconcile after upgrade, the application-credential-service label is added to any existing secret that lacks it, making it visible to the label-based deletion. This ensures old secrets are properly revoked and cleaned up on the next rotation or CR deletion, and prevents orphaned protection finalizers

Each service operator that consumes an AC secret now places a openstack.org/<service>-ac-consumer finalizer on the AC secret it is actively using. This ensures the keystone-operator cannot revoke or clean up secret while a service is still holding a reference to it.

NOTE: This PR doesn't incldue changes to tracking services that have credentials deployed on EDPM, that depends on openstack-k8s-operators/openstack-operator#1781

Tested with openstack-k8s-operators/barbican-operator#356

Assisted-by: Claude Opus 4.6 noreply@anthropic.com

@Deydra71 Deydra71 requested review from fmount, stuggi and vakwetu April 1, 2026 07:19
@openshift-ci openshift-ci bot requested review from abays and afaranha April 1, 2026 07:19
@softwarefactory-project-zuul
Copy link
Copy Markdown

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/61090ad340f246b2a571069f7190fad7

openstack-k8s-operators-content-provider FAILURE in 8m 49s
⚠️ keystone-operator-kuttl SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider (non-voting)
⚠️ keystone-operator-tempest SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider (non-voting)

if k8s_errors.IsNotFound(err) {
// If the ACID is set but the secret is not found, requeue to let the cache sync
if instance.Status.ACID != "" {
return ctrl.Result{RequeueAfter: time.Second * 5}, nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the ACID is set in status but the Secret is not found, the controller assumes an informer cache lag and requeues every 5 seconds. If the Secret was genuinely deleted (e.g. manual kubectl delete secret), this becomes an infinite requeue loop.

Consider adding a bounded retry - for example, an annotation counter or a timestamp check - and after a threshold (e.g. 30 seconds or N retries), fall through to doRotate = true so the controller self-heals instead of spinning indefinitely.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't actually need the requeue anymore - we suppress reconcile trigger on create events, so this scenario can't happen anymore. I will remove the guard and that fixes the infinite loop possibility, which is valid and I replicated manually the problem when I manually deleted the finalziers.

// finalizer should only be removed once all nodes across all NodeSets have
// been redeployed with the new credentials. This depends on per-node secret
// rotation tracking: https://github.com/openstack-k8s-operators/openstack-operator/pull/1781
func hasConsumerFinalizer(secret *corev1.Secret) bool {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The substring check strings.Contains(f, "-ac-consumer") could match unrelated finalizers that happen to contain that substring. A tighter check would reduce false-positive risk:

if strings.HasPrefix(f, "openstack.org/") && strings.HasSuffix(f, "-ac-consumer") {
    return true
}

This still matches all service-operator consumer finalizers (openstack.org/barbican-ac-consumer, openstack.org/cinder-ac-consumer, etc.) while being robust against accidental collisions.


for i := range secretList.Items {
s := &secretList.Items[i]
if protected[s.Name] || hasConsumerFinalizer(s) || !controllerutil.ContainsFinalizer(s, acSecretFinalizer) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The !controllerutil.ContainsFinalizer(s, acSecretFinalizer) guard creates a permanent orphan scenario. If the controller crashes between line 625 (removing the protection finalizer via Update) and line 631 (deleting the Secret), the Secret is left without a finalizer. On the next reconcile, this guard causes the loop to skip it, and it is never cleaned up.

One fix is to remove the ContainsFinalizer pre-condition entirely - if the Secret is not protected by name and has no consumer finalizer, it should be deleted regardless of whether it still carries the protection finalizer:

if protected[s.Name] || hasConsumerFinalizer(s) {
    continue
}

Alternatively, invert the order: issue the Delete first (the API server won't actually remove the object while the finalizer exists), then remove the finalizer. That way, even on crash between the two operations, the object is already marked for deletion and the next reconcile just removes the finalizer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we don't need that additional check. Old mutable secrets with only the protection finalizers should be safe since the protected is checked before hasConsumerFinalizer if service operator would be late to add consumer finalizer to the secret.

I don't think inversing the order is a good way to go, we would need to handle the Terminating state, because deletion would be blocked.

@@ -455,26 +728,25 @@ func (r *ApplicationCredentialReconciler) storeACSecret(

op, err := controllerutil.CreateOrPatch(ctx, helperObj.GetClient(), secret, func() error {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since each rotation produces a unique Secret name (ac-<svc>-<first5>-secret), CreateOrPatch will effectively always Create. However, if the controller crashes after the Create succeeds but before the status is patched, the retry will find the Secret already exists and attempt a Patch - which will try to update .data on an immutable Secret and fail.

Consider replacing this with a plain Create + IsAlreadyExists check:

err := helperObj.GetClient().Create(ctx, secret)
if k8s_errors.IsAlreadyExists(err) {
    logger.Info("Immutable AC secret already exists (likely retry), proceeding", "secret", secretName)
    return secretName, nil
}
if err != nil {
    return "", fmt.Errorf("failed to create immutable AC secret %s: %w", secretName, err)
}

userID string,
) error {
logger := r.GetLogger(ctx)
serviceName := strings.TrimPrefix(instance.Name, "ac-")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This derivation assumes every KeystoneApplicationCredential CR name follows the ac-<service> convention. If a CR is created with a name that doesn't start with ac-, TrimPrefix returns the full name unchanged and the resulting service label / label-based queries silently target the wrong set of Secrets.

Consider either:

  • Validating the naming convention via a webhook at admission time, or
  • Adding an explicit serviceName field to the Spec so the derivation is unnecessary and the contract is explicit.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

openstack-operator creates AC CRs with a name convention defined in keystone-operator -https://github.com/openstack-k8s-operators/openstack-operator/blob/main/internal/openstack/applicationcredential.go#L113

It's still possible to manually create the AC CR ( we do that for tests bypassing openstack-op), but we want openstack-operator to be the sole creator of AC CR based on config in controlplane CR.

We could create additional api/v1beta1/keystoneapplicationcredential_webhook.go , let's see what other reviewers comment.

Copy link
Copy Markdown

@mauricioharley mauricioharley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work on this, Veronika. The overall architecture is solid - immutable per-rotation secrets, Keystone-side revocation, and consumer finalizer coordination are the right design choices.

I left five inline comments on areas that could lead to subtle issues in production:

  1. Infinite requeue loop (line 223) - unbounded retry when a Secret is genuinely deleted
  2. Fragile substring match (line 564) - hasConsumerFinalizer could match unrelated finalizers
  3. Orphaned secrets (line 606) - crash between finalizer removal and delete leaves permanently uncleaned secrets
  4. Immutable secret retry failure (line 729) - CreateOrPatch fails on retry against an immutable Secret
  5. Implicit naming convention (line 584) - TrimPrefix silently misbehaves if CR name doesn't follow ac-<service>

Please address these before merging.

Deleted unused `GetApplicationCredentialFromSecret` function and introduce immutable per-rotation AC secrets with deterministic names,
add Keystone-side revocation of unused rotated ACs, and suppress Owns() create events on the secret
watch to prevent a race condition caused by stale informer cach and sometimes causing additional AC secret to be created and deleted immediately during rotation.

Signed-off-by: Veronika Fisarova <vfisarov@redhat.com>
Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
@Deydra71 Deydra71 force-pushed the appcred-service-labels branch from 7af98fc to 45357af Compare April 1, 2026 12:49
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 1, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Deydra71
Once this PR has been reviewed and has the lgtm label, please ask for approval from mauricioharley. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants