Skip to content

fix(gastown): Hourly token refresh wakes sleeping containers on idle towns — 50% duty cycle #1975

@jrf0110

Description

@jrf0110

Bug

Idle town containers wake up every hour despite having zero work. The refreshContainerToken() function fires once per hour for ANY town with rigs, regardless of active work. Each call does container.setEnvVar() + container.fetch('POST /refresh-token'), resetting the sleepAfter: 30m timer. This creates a 50% duty cycle: container wakes at T+0, sleeps at T+30m, stays asleep until T+60m (next refresh), repeating indefinitely.

With ~1000 towns, most of which are idle, this means hundreds of containers cycling awake→sleep→awake instead of staying asleep.

Root Cause

Town.do.ts:3338 calls refreshContainerToken() gated on hasRigs (not hasActiveWork):

// Line 3334 comment:
// "Gated on hasRigs (not hasActiveWork) because the container may still
// be running with an idle mayor accepting user messages."

The comment explains the intent: keep the token fresh for the mayor. But the mayor's container is already sleeping — refreshing its token while it's asleep wakes it up for no reason.

The Only Idle-Town Container Wake

An exhaustive audit of every container-touching code path in the alarm loop confirms refreshContainerToken() is the only path that wakes a sleeping container on an idle town:

Code path Fires on idle town?
ensureContainerReady()GET /health NO — gated on hasActiveWork
refreshContainerToken()setEnvVar + POST /refresh-token YES — gated only on hasRigs
refreshKilocodeTokenIfExpiring()syncConfigToContainer() RARELY — only near token expiry
checkAgentContainerStatus()GET /agents/:id/status NO — zero working agents
All reconciler side effects NO — zero open beads
deliverPendingMail() NO — zero working agents
maybeDispatchTriageAgent() NO — zero triage requests

Fix

Option A (Recommended): Gate token refresh on active work OR container awake

Don't refresh the token if the container is sleeping. The token will be refreshed on the next wake (when actual work is dispatched, a user opens the town UI, etc.):

private async refreshContainerToken(): Promise<void> {
  // Skip if no active work — the container is sleeping and doesn't need a fresh token.
  // The token will be refreshed when work is next dispatched (ensureContainerToken
  // is called at dispatch time).
  if (!scheduling.hasActiveWork(this.sql)) return;
  
  // ... existing throttle + refresh logic
}

This is safe because ensureContainerToken() is ALREADY called at the start of startAgentInContainer() (container-dispatch.ts:322). When a container wakes for real work, the token is refreshed before any agent is started. There's no window where the container is awake with an expired token.

Option B: Check container status before refreshing

private async refreshContainerToken(): Promise<void> {
  // ... existing throttle check
  
  // Don't wake a sleeping container just to refresh its token
  const container = getTownContainerDOStub(this.env, this.townId);
  const status = await container.getStatus(); // lightweight, doesn't wake the container
  if (status === 'sleeping') return;
  
  // ... existing refresh logic
}

However, container.getStatus() may itself wake the Container (any fetch to a Container DO wakes it). This option only works if Cloudflare provides a way to check Container sleep status without waking it. Option A is simpler and doesn't require this.

Option C: Refresh token lazily at dispatch time only

Remove refreshContainerToken() from the alarm loop entirely. Rely solely on ensureContainerToken() at dispatch time (already called in startAgentInContainer()). Add it to ensureContainerReady() as well, so the token is refreshed whenever the container is confirmed alive.

This means the token could expire while the container is sleeping, but that's fine — the container isn't doing anything. On wake, ensureContainerToken() mints a fresh one.

Impact

With Option A deployed:

  • Idle towns: container stays asleep permanently (0% duty cycle instead of 50%)
  • Active towns: no change (token refreshed via dispatch path)
  • ~500+ idle town containers stop cycling, reducing Cloudflare Container billing significantly

Files

  • src/dos/Town.do.ts:3338refreshContainerToken() call site (add hasActiveWork guard)
  • src/dos/Town.do.ts:3660refreshContainerToken() method

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    P0Blocks soft launchbugSomething isn't workinggt:containerContainer management, agent processes, SDK, heartbeatgt:coreReconciler, state machine, bead lifecycle, convoy flow

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions