Skip to content

fix: clean up Temporal server-side versioning data on TWD deletion#240

Open
anujagrawal380 wants to merge 5 commits intotemporalio:mainfrom
anujagrawal380:fix/twd-leaves-stale-versioning-data
Open

fix: clean up Temporal server-side versioning data on TWD deletion#240
anujagrawal380 wants to merge 5 commits intotemporalio:mainfrom
anujagrawal380:fix/twd-leaves-stale-versioning-data

Conversation

@anujagrawal380
Copy link

@anujagrawal380 anujagrawal380 commented Mar 24, 2026

  • Add a finalizer to TemporalWorkerDeployment to run Temporal server-side cleanup before K8s deletion
  • Add a finalizer to TemporalConnection to prevent it from being deleted while any TWD still references it
  • On TWD deletion, set current version to unversioned, clear ramping version, and delete registered versions

Problem

When a TemporalWorkerDeployment CRD is deleted (e.g., switching back to plain Deployments), the Temporal server retains the build ID routing configuration. The matching service continues routing new tasks to the deleted build ID's physical queue, while unversioned workers poll a different physical queue. Tasks sit in Scheduled state indefinitely with no errors.

A secondary race condition exists: Helm deletes both the TemporalConnection and TWD in the same upgrade. Without the connection, the controller cannot talk to Temporal to clean up. This is solved by adding a finalizer to the TemporalConnection that blocks its deletion until all referencing TWDs are gone.

Changes

internal/controller/worker_controller.go:

TWD finalizer (temporal.io/worker-deployment-cleanup):

  • Added to all TWD resources during normal reconciliation
  • On deletion, triggers handleDeletion() which:
    1. Sets the current version to unversioned (BuildID: "") -- the critical step that unblocks task dispatch
    2. Clears any ramping version
    3. Deletes all registered versions with SkipDrainage: true
    4. Attempts to delete the deployment record itself
    5. Removes the connection finalizer if no other TWDs reference it
    6. Removes its own finalizer, allowing K8s to complete deletion

TemporalConnection finalizer (temporal.io/connection-in-use):

  • Added to the TemporalConnection during normal TWD reconciliation via ensureConnectionFinalizer()
  • Prevents the connection from being deleted while any TWD still references it
  • Removed by removeConnectionFinalizerIfUnused() during TWD deletion, after checking no other TWDs in the same namespace reference the connection
  • Guarantees the connection is always available during TWD cleanup -- no race condition with Helm deleting both resources simultaneously

RBAC updates:

  • Added update;patch verbs for temporalconnections (was get;list;watch)
  • Added update verb for temporalconnections/finalizers

Deletion flow

Helm upgrade (TWD disabled)
  |
  v
Helm deletes TWD CRD + TemporalConnection CRD simultaneously
  |
  +--> TemporalConnection: has finalizer, K8s sets DeletionTimestamp but blocks deletion
  |
  +--> TWD: has finalizer, K8s sets DeletionTimestamp, triggers Reconcile
         |
         v
       handleDeletion() runs:
         1. Fetches TemporalConnection (guaranteed to exist via finalizer)
         2. Connects to Temporal server
         3. Sets current version to unversioned
         4. Deletes versions
         5. Removes connection finalizer (no other TWDs reference it)
         6. Removes TWD finalizer
         |
         v
       TWD deleted by K8s
         |
         v
       TemporalConnection: no more finalizers, deleted by K8s

Issue #55
Issue #166

…blocking unversioned workers

Signed-off-by: Anuj Agrawal <anujagrawal380@gmail.com>
@anujagrawal380 anujagrawal380 requested review from a team and jlegrone as code owners March 24, 2026 18:18
@CLAassistant
Copy link

CLAassistant commented Mar 24, 2026

CLA assistant check
All committers have signed the CLA.

@anujagrawal380
Copy link
Author

PTAL @carlydf

Signed-off-by: Anuj Agrawal <anujagrawal380@gmail.com>
Copy link

@jaypipes jaypipes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anujagrawal380 awesome contribution, thank you so much for this PR! I couple really minor comments below, but overall excellent work.

Signed-off-by: Anuj Agrawal <anujagrawal380@gmail.com>
Signed-off-by: Anuj Agrawal <anujagrawal380@gmail.com>
@anujagrawal380
Copy link
Author

@anujagrawal380 awesome contribution, thank you so much for this PR! I couple really minor comments below, but overall excellent work.

Thanks, resolved both the comments!

Copy link

@jaypipes jaypipes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rock on :) nice work on this @anujagrawal380!

Copy link
Collaborator

@carlydf carlydf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need integration tests for this before merging to main / including it in a release

Signed-off-by: Anuj Agrawal <anujagrawal380@gmail.com>
@anujagrawal380
Copy link
Author

we need integration tests for this before merging to main / including it in a release

@carlydf Added the integration tests. PTAL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants