- Symptom:
workspace sync failed: Failed to sync workspace to pod: Transfer protocol error: stamp fetch failed: SSH connection failed: TCP connection failed: failed to lookup address information: nodename nor servname provided, or not known - Cause: When the relay connects but SSH upgrade hasn't happened yet, the sync tries to resolve a hostname that isn't available. Seems to happen when the relay connection is slow or when the pod provisions before SSH is ready.
- Workaround: Retry — sometimes works on second/third attempt after daemon restart
- Severity: Intermittent blocking — ~50% of provisioning attempts fail
- Fields in reference docs that don't match actual schema:
hooks.readiness.command(string) → actual schema requiresrun(array of strings)inputsrequireskey+labelfields, notnameinputs.optionsrequires objects withlabel+value, not plain strings
- Reference file:
/references/config.mdshows simplified examples that don't pass schema validation - Severity: Medium — causes config validation errors on first attempt
- Symptom:
bash: line 1: uv: command not foundduring pip install phase, causing all dependency installation to fail - Cause: GPU CLI tries to use
uvas the package manager but some pod images don't haveuvinstalled - Note: On later attempts, the CLI auto-installed
uv("Installing uv package manager... downloading uv 0.10.4"). Inconsistent behavior — sometimes it auto-installs, sometimes it doesn't. - Workaround: Remove
environment.pythonentirely and useenvironment.shell.stepswith{ "run": "pip install -r requirements.txt" }instead. - Severity: Intermittent blocking
- Symptom:
E: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 748 (apt-get) - Cause: When a pod is reused from a previous run, a previous apt-get process may still be holding the dpkg lock
- Workaround: None reliably — the install just silently fails for apt packages
- Severity: Medium — apt packages may not install on reused pods
- Symptom:
workspace sync failed: Failed to sync workspace to pod: Remote helper bootstrap failed: No SSH metadata for pod after 10s - Cause: The SSH upgrade from relay to direct SSH fails or is too slow, leaving the sync without SSH metadata
- Related to: Bug #1 (DNS resolution failure) — both are sync-phase connectivity issues
- Workaround: Retry after daemon restart
- Severity: Intermittent blocking
- Symptom: Even with no
environmentsection in gpu.jsonc, GPU CLI auto-detectsrequirements.txtand runsuv pip installduring the install phase - Impact: This is actually helpful but undocumented. However, it makes the
environment.pythonconfig redundant and confusing. - Note: The auto-install uses
uv(auto-downloaded) and caches packages on the global volume — subsequent runs showPython dependencies already installed (hash: ...) - Severity: Not a bug per se, but confusing behavior that conflicts with docs
- Symptom:
RuntimeError: Failed to import transformers.models.llama.modeling_llama because of the following error: partially initialized module 'torchvision' has no attribute 'extension' (most likely due to a circular import) - Cause: Pod images come with pre-installed PyTorch (e.g.
torch==2.4.1+cu124,torchvision==0.19.1+cu124). Whenuv pip installupgrades torch/torchaudio from requirements.txt (to2.6.0), the old torchvision (0.19.1) becomes incompatible. Thetransformerslibrary triggers the circular import when loading models. - Workaround: Add
torchvisionexplicitly torequirements.txtso it gets upgraded alongside torch/torchaudio to a compatible version - Severity: Blocking — any package that imports
transformersmodels will fail on first deploy
- Symptom:
gpu runshows pip download and uninstall progress but stops streaming output before showing the "Installed X packages" line. The health check loop runs blind while pip finishes installing in the background. - Cause: The relay connection or log streaming truncates long pip install output. The install actually completes (visible in
gpu logs) butgpu rundoesn't stream it. - Severity: Low — cosmetic, but makes debugging difficult during deploys
- Symptom: After editing files locally (e.g.
tts_server.py,gpu.jsonc),gpu runsyncs OLD versions from the pod back to the local machine, reverting all local changes - Cause: GPU CLI performs a bidirectional sync — it syncs workspace TO the pod before running, but also syncs FROM the pod back to local after the run (or when the relay dies). The
outputsfield in gpu.jsonc (["output/"]) should limit what syncs back, but the entire workspace appears to sync bidirectionally. - Impact: Any local edits made between
gpu runcalls get silently overwritten. This is especially destructive when iterating on server code — you fix a bug, the sync reverts the fix. - Workaround: Stop the pod before editing files, or re-apply edits after each
gpu run - Severity: Blocking — makes iterative development extremely difficult