Skip to content

fix: move defer out of retry loop in waitForJob#308

Open
blinkagent[bot] wants to merge 7 commits intomainfrom
fix/waitForJob-defer-in-loop
Open

fix: move defer out of retry loop in waitForJob#308
blinkagent[bot] wants to merge 7 commits intomainfrom
fix/waitForJob-defer-in-loop

Conversation

@blinkagent
Copy link
Contributor

@blinkagent blinkagent bot commented Feb 27, 2026

Problem

The waitForJob function had a defer closer.Close() inside a for loop. Go defers execute when the function returns, not when the loop iteration ends. This means:

  1. Each retry iteration adds a new deferred close that only runs when the function returns, not at the end of each iteration.
  2. On retries, multiple closers accumulate unnecessarily.

Note: In v0.0.12, this was also a nil pointer dereference because defer closer.Close() was called before checking the error from TemplateVersionLogsAfter, causing a panic when closer was nil. That ordering was fixed in v0.0.13, but the defer-in-loop issue remained.

Fix

Extract the loop body into a separate waitForJobOnce function so the defer executes properly at the end of each attempt, closing the log stream before retrying.

Context

This was observed as a SIGSEGV panic in CI when using terraform-provider-coderd v0.0.12 in the coder/coder dogfood workflow:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1000fcd]

goroutine 41 [running]:
github.com/coder/terraform-provider-coderd/internal/provider.waitForJob(...)
	github.com/coder/terraform-provider-coderd/internal/provider/template_resource.go:1086 +0x1ad

The defer closer.Close() was inside the for loop, which means:
1. Each retry iteration adds a new deferred close that only runs
   when the function returns, not at the end of each iteration.
2. On retries, multiple closers accumulate unnecessarily.

Extract the loop body into a separate waitForJobOnce function so
the defer executes properly at the end of each attempt.

Note: In v0.0.12, this was also a nil pointer dereference because
defer closer.Close() was called before checking the error from
TemplateVersionLogsAfter, causing a panic when closer was nil.
That ordering was fixed in v0.0.13, but the defer-in-loop issue
remained.
Copy link
Member

@johnstcn johnstcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No test changes were made here. Please add test changes to verify this change.

var jobLogs []codersdk.ProvisionerJobLog
for retries := 0; retries < maxRetries; retries++ {
logs, closer, err := client.TemplateVersionLogsAfter(ctx, version.ID, 0)
jobLogs, done, err := waitForJobOnce(ctx, client, version, jobLogs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shadows variable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed. Changed to use explicit var done bool / var err error declarations with = assignment instead of := so jobLogs is no longer shadowed.

Also added 5 unit tests covering waitForJobOnce and waitForJob using httptest + WebSocket mocking:

  • Success, failure, still-active cases for waitForJobOnce
  • Retry with separate WS connections and log accumulation across retries for waitForJob

blink-so bot added 2 commits February 27, 2026 17:56
- Fix jobLogs variable shadowing by using = instead of :=
- Add unit tests for waitForJobOnce and waitForJob:
  - TestWaitForJobOnce_Success: job completes successfully with logs
  - TestWaitForJobOnce_JobFailed: job fails with error message
  - TestWaitForJobOnce_StillActive: job still running returns done=false
  - TestWaitForJob_RetriesAndCloses: verifies 3 retries with separate WS connections
  - TestWaitForJob_SucceedsOnRetry: logs accumulate across retries
@@ -1106,47 +1106,59 @@ func waitForJob(ctx context.Context, client *codersdk.Client, version *codersdk.
const maxRetries = 3
var jobLogs []codersdk.ProvisionerJobLog
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this variable here now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, removed it. waitForJobOnce now manages its own jobLogs internally and the caller accumulates via append(allLogs, logs...).

return jobLogs, fmt.Errorf("provisioner job did not complete after %d retries", maxRetries)
}

func waitForJobOnce(ctx context.Context, client *codersdk.Client, version *codersdk.TemplateVersion, jobLogs []codersdk.ProvisionerJobLog) ([]codersdk.ProvisionerJobLog, bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to pass in jobLogs here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — removed the parameter. waitForJobOnce now returns only its own logs and the caller appends them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding tests! I tried checking to see if they trigger the panic by temporarily reverting your changes, and they don't seem to do so. Can you double check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right — the existing tests don't reproduce the original v0.0.12 panic (defer before error check), since that was already fixed in v0.0.13.

I've added TestWaitForJob_ClosesConnectionBetweenRetries which specifically tests the defer-in-loop issue this PR fixes. It tracks the maximum number of concurrently open WebSocket connections during retries. With the old defer-in-loop code, closers accumulate and connections stay open across retries (maxOpenConns > 1). With the extracted function, each connection is closed before the next retry starts (maxOpenConns == 1).

blink-so bot added 3 commits February 27, 2026 18:22
…, add regression test

- Remove jobLogs parameter from waitForJobOnce; each call returns its own
  logs and the caller accumulates them via append
- Remove unnecessary var declarations in waitForJob loop
- Add TestWaitForJob_ClosesConnectionBetweenRetries that verifies
  connections are closed between retries (max concurrent connections = 1),
  which would fail with the defer-in-loop pattern
The TestWaitForJob_ClosesConnectionBetweenRetries test had a race
condition: the server-side connection decrement and client-side
defer close are not synchronized, causing maxOpenConns to
non-deterministically be 1 or 2.

Removed the flaky test since TestWaitForJob_RetriesAndCloses already
verifies that exactly 3 separate WebSocket connections are made
(one per retry), which confirms the extracted function approach works.
@ethanndickson
Copy link
Member

@codex review

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. 🎉

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

ethanndickson
ethanndickson previously approved these changes Mar 2, 2026
@ethanndickson ethanndickson force-pushed the fix/waitForJob-defer-in-loop branch from 49aff3c to 0741b79 Compare March 2, 2026 05:14
ethanndickson added a commit to coder/coder that referenced this pull request Mar 2, 2026
## Context
This commit is part of the fix for a downstream provider outage observed
during
`coderd_template` updates.

Observed downstream symptoms (terraform-provider-coderd):
- Template-version websocket log stream requests returned `401`:
  `GET /api/v2/templateversions/<id>/logs`.
- In older provider code (`waitForJob`), stream-init errors could
produce
`(nil, nil, err)` and then trigger a nil dereference when
`closer.Close()`
  was deferred before checking `err`.
- Net effect: template update path crashed instead of returning a
controlled
  provisioning error.

That provider panic is being hardened in the provider repo separately
(coder/terraform-provider-coderd#308). This
commit addresses the upstream SDK auth mismatch that caused the
websocket `401`
side of the chain.

## Root cause

On deployments with host-prefixed cookie handling (dev.coder.com)
enabled
(`--host-prefix-cookie` / `EnableHostPrefix=true`), middleware rewrites
cookie
state to enforce prefixed auth cookies.

For non-browser websocket clients that still sent unprefixed
`coder_session_token` via cookie jars, this created an auth mismatch:
- cookie-based credential expected by the client path,
- but cookie normalization/stripping applied server-side,
- resulting in no usable token at auth extraction time.

## Fix in this commit

Apply the #22226 non-browser auth principle to remaining websocket
callsites in
`codersdk` by replacing cookie-jar session auth with header-token auth.

_Generated with mux but reviewed by a human_
ethanndickson added a commit to coder/coder that referenced this pull request Mar 2, 2026
## Context
This commit is part of the fix for a downstream provider outage observed
during
`coderd_template` updates.

Observed downstream symptoms (terraform-provider-coderd):
- Template-version websocket log stream requests returned `401`:
  `GET /api/v2/templateversions/<id>/logs`.
- In older provider code (`waitForJob`), stream-init errors could
produce
`(nil, nil, err)` and then trigger a nil dereference when
`closer.Close()`
  was deferred before checking `err`.
- Net effect: template update path crashed instead of returning a
controlled
  provisioning error.

That provider panic is being hardened in the provider repo separately
(coder/terraform-provider-coderd#308). This
commit addresses the upstream SDK auth mismatch that caused the
websocket `401`
side of the chain.

## Root cause

On deployments with host-prefixed cookie handling (dev.coder.com)
enabled
(`--host-prefix-cookie` / `EnableHostPrefix=true`), middleware rewrites
cookie
state to enforce prefixed auth cookies.

For non-browser websocket clients that still sent unprefixed
`coder_session_token` via cookie jars, this created an auth mismatch:
- cookie-based credential expected by the client path,
- but cookie normalization/stripping applied server-side,
- resulting in no usable token at auth extraction time.

## Fix in this commit

Apply the #22226 non-browser auth principle to remaining websocket
callsites in
`codersdk` by replacing cookie-jar session auth with header-token auth.

_Generated with mux but reviewed by a human_
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants