-
Notifications
You must be signed in to change notification settings - Fork 26
fix(gastown): Hourly token refresh wakes sleeping containers on idle towns — 50% duty cycle #1975
Description
Bug
Idle town containers wake up every hour despite having zero work. The refreshContainerToken() function fires once per hour for ANY town with rigs, regardless of active work. Each call does container.setEnvVar() + container.fetch('POST /refresh-token'), resetting the sleepAfter: 30m timer. This creates a 50% duty cycle: container wakes at T+0, sleeps at T+30m, stays asleep until T+60m (next refresh), repeating indefinitely.
With ~1000 towns, most of which are idle, this means hundreds of containers cycling awake→sleep→awake instead of staying asleep.
Root Cause
Town.do.ts:3338 calls refreshContainerToken() gated on hasRigs (not hasActiveWork):
// Line 3334 comment:
// "Gated on hasRigs (not hasActiveWork) because the container may still
// be running with an idle mayor accepting user messages."The comment explains the intent: keep the token fresh for the mayor. But the mayor's container is already sleeping — refreshing its token while it's asleep wakes it up for no reason.
The Only Idle-Town Container Wake
An exhaustive audit of every container-touching code path in the alarm loop confirms refreshContainerToken() is the only path that wakes a sleeping container on an idle town:
| Code path | Fires on idle town? |
|---|---|
ensureContainerReady() → GET /health |
NO — gated on hasActiveWork |
refreshContainerToken() → setEnvVar + POST /refresh-token |
YES — gated only on hasRigs |
refreshKilocodeTokenIfExpiring() → syncConfigToContainer() |
RARELY — only near token expiry |
checkAgentContainerStatus() → GET /agents/:id/status |
NO — zero working agents |
| All reconciler side effects | NO — zero open beads |
deliverPendingMail() |
NO — zero working agents |
maybeDispatchTriageAgent() |
NO — zero triage requests |
Fix
Option A (Recommended): Gate token refresh on active work OR container awake
Don't refresh the token if the container is sleeping. The token will be refreshed on the next wake (when actual work is dispatched, a user opens the town UI, etc.):
private async refreshContainerToken(): Promise<void> {
// Skip if no active work — the container is sleeping and doesn't need a fresh token.
// The token will be refreshed when work is next dispatched (ensureContainerToken
// is called at dispatch time).
if (!scheduling.hasActiveWork(this.sql)) return;
// ... existing throttle + refresh logic
}This is safe because ensureContainerToken() is ALREADY called at the start of startAgentInContainer() (container-dispatch.ts:322). When a container wakes for real work, the token is refreshed before any agent is started. There's no window where the container is awake with an expired token.
Option B: Check container status before refreshing
private async refreshContainerToken(): Promise<void> {
// ... existing throttle check
// Don't wake a sleeping container just to refresh its token
const container = getTownContainerDOStub(this.env, this.townId);
const status = await container.getStatus(); // lightweight, doesn't wake the container
if (status === 'sleeping') return;
// ... existing refresh logic
}However, container.getStatus() may itself wake the Container (any fetch to a Container DO wakes it). This option only works if Cloudflare provides a way to check Container sleep status without waking it. Option A is simpler and doesn't require this.
Option C: Refresh token lazily at dispatch time only
Remove refreshContainerToken() from the alarm loop entirely. Rely solely on ensureContainerToken() at dispatch time (already called in startAgentInContainer()). Add it to ensureContainerReady() as well, so the token is refreshed whenever the container is confirmed alive.
This means the token could expire while the container is sleeping, but that's fine — the container isn't doing anything. On wake, ensureContainerToken() mints a fresh one.
Impact
With Option A deployed:
- Idle towns: container stays asleep permanently (0% duty cycle instead of 50%)
- Active towns: no change (token refreshed via dispatch path)
- ~500+ idle town containers stop cycling, reducing Cloudflare Container billing significantly
Files
src/dos/Town.do.ts:3338—refreshContainerToken()call site (addhasActiveWorkguard)src/dos/Town.do.ts:3660—refreshContainerToken()method
Related
- fix(gastown): Town containers never go idle — mayor holds alarm at 5s, constant health checks reset sleep timer #1450 — Town containers never idle (the mayor alarm issue — already fixed in PR feat(gastown): staging batch — circuit breaker, drain/eviction, token refresh, Cloudflare links, fullscreen #1828)
- fix(gastown): Container token refresh fires every minute on idle towns due to in-memory throttle #1409 — Token refresh in-memory throttle (already fixed — throttle persisted in ctx.storage)
- perf(gastown): TownDO runs 47+ SQL queries per alarm tick — deduplicate and guard wasteful queries #1855 — Alarm query deduplication (already fixed)