fix: move defer out of retry loop in waitForJob by blinkagent[bot] · Pull Request #308 · coder/terraform-provider-coderd

blinkagent · 2026-02-27T17:09:45Z

Problem

The waitForJob function had a defer closer.Close() inside a for loop. Go defers execute when the function returns, not when the loop iteration ends. This means:

Each retry iteration adds a new deferred close that only runs when the function returns, not at the end of each iteration.
On retries, multiple closers accumulate unnecessarily.

Note: In v0.0.12, this was also a nil pointer dereference because defer closer.Close() was called before checking the error from TemplateVersionLogsAfter, causing a panic when closer was nil. That ordering was fixed in v0.0.13, but the defer-in-loop issue remained.

Fix

Extract the loop body into a separate waitForJobOnce function so the defer executes properly at the end of each attempt, closing the log stream before retrying.

Context

This was observed as a SIGSEGV panic in CI when using terraform-provider-coderd v0.0.12 in the coder/coder dogfood workflow:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1000fcd]

goroutine 41 [running]:
github.com/coder/terraform-provider-coderd/internal/provider.waitForJob(...)
	github.com/coder/terraform-provider-coderd/internal/provider/template_resource.go:1086 +0x1ad

The defer closer.Close() was inside the for loop, which means: 1. Each retry iteration adds a new deferred close that only runs when the function returns, not at the end of each iteration. 2. On retries, multiple closers accumulate unnecessarily. Extract the loop body into a separate waitForJobOnce function so the defer executes properly at the end of each attempt. Note: In v0.0.12, this was also a nil pointer dereference because defer closer.Close() was called before checking the error from TemplateVersionLogsAfter, causing a panic when closer was nil. That ordering was fixed in v0.0.13, but the defer-in-loop issue remained.

johnstcn

No test changes were made here. Please add test changes to verify this change.

johnstcn · 2026-02-27T17:49:58Z

internal/provider/template_resource.go

 	var jobLogs []codersdk.ProvisionerJobLog
 	for retries := 0; retries < maxRetries; retries++ {
-		logs, closer, err := client.TemplateVersionLogsAfter(ctx, version.ID, 0)
+		jobLogs, done, err := waitForJobOnce(ctx, client, version, jobLogs)


Shadows variable

Good catch — fixed. Changed to use explicit var done bool / var err error declarations with = assignment instead of := so jobLogs is no longer shadowed.

Also added 5 unit tests covering waitForJobOnce and waitForJob using httptest + WebSocket mocking:

Success, failure, still-active cases for waitForJobOnce

Retry with separate WS connections and log accumulation across retries for waitForJob

- Fix jobLogs variable shadowing by using = instead of := - Add unit tests for waitForJobOnce and waitForJob: - TestWaitForJobOnce_Success: job completes successfully with logs - TestWaitForJobOnce_JobFailed: job fails with error message - TestWaitForJobOnce_StillActive: job still running returns done=false - TestWaitForJob_RetriesAndCloses: verifies 3 retries with separate WS connections - TestWaitForJob_SucceedsOnRetry: logs accumulate across retries

johnstcn · 2026-02-27T18:19:16Z

internal/provider/template_resource.go

@@ -1106,47 +1106,59 @@ func waitForJob(ctx context.Context, client *codersdk.Client, version *codersdk.
 	const maxRetries = 3
 	var jobLogs []codersdk.ProvisionerJobLog


Do we need this variable here now?

Nope, removed it. waitForJobOnce now manages its own jobLogs internally and the caller accumulates via append(allLogs, logs...).

johnstcn · 2026-02-27T18:19:45Z

internal/provider/template_resource.go

+	return jobLogs, fmt.Errorf("provisioner job did not complete after %d retries", maxRetries)
+}
+
+func waitForJobOnce(ctx context.Context, client *codersdk.Client, version *codersdk.TemplateVersion, jobLogs []codersdk.ProvisionerJobLog) ([]codersdk.ProvisionerJobLog, bool, error) {


Do we need to pass in jobLogs here?

Good call — removed the parameter. waitForJobOnce now returns only its own logs and the caller appends them.

johnstcn · 2026-02-27T18:20:34Z

internal/provider/wait_for_job_test.go

Thanks for adding tests! I tried checking to see if they trigger the panic by temporarily reverting your changes, and they don't seem to do so. Can you double check?

You're right — the existing tests don't reproduce the original v0.0.12 panic (defer before error check), since that was already fixed in v0.0.13.

I've added TestWaitForJob_ClosesConnectionBetweenRetries which specifically tests the defer-in-loop issue this PR fixes. It tracks the maximum number of concurrently open WebSocket connections during retries. With the old defer-in-loop code, closers accumulate and connections stay open across retries (maxOpenConns > 1). With the extracted function, each connection is closed before the next retry starts (maxOpenConns == 1).

…, add regression test - Remove jobLogs parameter from waitForJobOnce; each call returns its own logs and the caller accumulates them via append - Remove unnecessary var declarations in waitForJob loop - Add TestWaitForJob_ClosesConnectionBetweenRetries that verifies connections are closed between retries (max concurrent connections = 1), which would fail with the defer-in-loop pattern

The TestWaitForJob_ClosesConnectionBetweenRetries test had a race condition: the server-side connection decrement and client-side defer close are not synchronized, causing maxOpenConns to non-deterministically be 1 or 2. Removed the flaky test since TestWaitForJob_RetriesAndCloses already verifies that exactly 3 separate WebSocket connections are made (one per retry), which confirms the extracted function approach works.

ethanndickson · 2026-03-02T04:41:48Z

@codex review

chatgpt-codex-connector · 2026-03-02T04:45:40Z

Codex Review: Didn't find any major issues. 🎉

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

outdated

## Context This commit is part of the fix for a downstream provider outage observed during `coderd_template` updates. Observed downstream symptoms (terraform-provider-coderd): - Template-version websocket log stream requests returned `401`: `GET /api/v2/templateversions/<id>/logs`. - In older provider code (`waitForJob`), stream-init errors could produce `(nil, nil, err)` and then trigger a nil dereference when `closer.Close()` was deferred before checking `err`. - Net effect: template update path crashed instead of returning a controlled provisioning error. That provider panic is being hardened in the provider repo separately (coder/terraform-provider-coderd#308). This commit addresses the upstream SDK auth mismatch that caused the websocket `401` side of the chain. ## Root cause On deployments with host-prefixed cookie handling (dev.coder.com) enabled (`--host-prefix-cookie` / `EnableHostPrefix=true`), middleware rewrites cookie state to enforce prefixed auth cookies. For non-browser websocket clients that still sent unprefixed `coder_session_token` via cookie jars, this created an auth mismatch: - cookie-based credential expected by the client path, - but cookie normalization/stripping applied server-side, - resulting in no usable token at auth extraction time. ## Fix in this commit Apply the #22226 non-browser auth principle to remaining websocket callsites in `codersdk` by replacing cookie-jar session auth with header-token auth. _Generated with mux but reviewed by a human_

johnstcn assigned kylecarbs Feb 27, 2026

johnstcn requested changes Feb 27, 2026

View reviewed changes

blink-so bot added 2 commits February 27, 2026 17:56

fix: check conn.Close return values to satisfy errcheck linter

640b1de

johnstcn reviewed Feb 27, 2026

View reviewed changes

blink-so bot added 3 commits February 27, 2026 18:22

fix: remove extra blank lines (go fmt)

f26f0fd

ethanndickson mentioned this pull request Mar 2, 2026

fix(codersdk): use header auth for non-browser websocket dials coder/coder#22461

Merged

ethanndickson previously approved these changes Mar 2, 2026

View reviewed changes

provider: improve waitForJob retry flow and test readability

0741b79

ethanndickson force-pushed the fix/waitForJob-defer-in-loop branch from 49aff3c to 0741b79 Compare March 2, 2026 05:14

ethanndickson mentioned this pull request Mar 2, 2026

fix(codersdk): use header auth for non-browser websocket dials (#22461) (cherry-pick/v2.31) coder/coder#22466

Open

johnstcn approved these changes Mar 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: move defer out of retry loop in waitForJob#308

fix: move defer out of retry loop in waitForJob#308
blinkagent[bot] wants to merge 7 commits intomainfrom
fix/waitForJob-defer-in-loop

blinkagent bot commented Feb 27, 2026

Uh oh!

johnstcn left a comment

Uh oh!

johnstcn Feb 27, 2026

Uh oh!

blinkagent bot Feb 27, 2026

Uh oh!

johnstcn Feb 27, 2026

Uh oh!

blinkagent bot Feb 27, 2026

Uh oh!

johnstcn Feb 27, 2026

Uh oh!

blinkagent bot Feb 27, 2026

Uh oh!

johnstcn Feb 27, 2026

Uh oh!

blinkagent bot Feb 27, 2026

Uh oh!

ethanndickson commented Mar 2, 2026

Uh oh!

chatgpt-codex-connector bot commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -1106,47 +1106,59 @@ func waitForJob(ctx context.Context, client codersdk.Client, version codersdk.
		const maxRetries = 3
		var jobLogs []codersdk.ProvisionerJobLog

Conversation

blinkagent bot commented Feb 27, 2026

Problem

Fix

Context

Uh oh!

johnstcn left a comment

Choose a reason for hiding this comment

Uh oh!

johnstcn Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

blinkagent bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

johnstcn Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

blinkagent bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

johnstcn Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

blinkagent bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

johnstcn Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

blinkagent bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

ethanndickson commented Mar 2, 2026

Uh oh!

chatgpt-codex-connector bot commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants