Skip to content

Latest commit

 

History

History
61 lines (51 loc) · 5.69 KB

File metadata and controls

61 lines (51 loc) · 5.69 KB

GPU TTS - Remaining Bugs & Issues

GPU CLI Issues

1. Workspace sync DNS resolution failure (intermittent)

  • Symptom: workspace sync failed: Failed to sync workspace to pod: Transfer protocol error: stamp fetch failed: SSH connection failed: TCP connection failed: failed to lookup address information: nodename nor servname provided, or not known
  • Cause: When the relay connects but SSH upgrade hasn't happened yet, the sync tries to resolve a hostname that isn't available. Seems to happen when the relay connection is slow or when the pod provisions before SSH is ready.
  • Workaround: Retry — sometimes works on second/third attempt after daemon restart
  • Severity: Intermittent blocking — ~50% of provisioning attempts fail

2. gpu.jsonc schema documentation mismatch

  • Fields in reference docs that don't match actual schema:
    • hooks.readiness.command (string) → actual schema requires run (array of strings)
    • inputs requires key + label fields, not name
    • inputs.options requires objects with label + value, not plain strings
  • Reference file: /references/config.md shows simplified examples that don't pass schema validation
  • Severity: Medium — causes config validation errors on first attempt

3. uv not installed on pod images (intermittent)

  • Symptom: bash: line 1: uv: command not found during pip install phase, causing all dependency installation to fail
  • Cause: GPU CLI tries to use uv as the package manager but some pod images don't have uv installed
  • Note: On later attempts, the CLI auto-installed uv ("Installing uv package manager... downloading uv 0.10.4"). Inconsistent behavior — sometimes it auto-installs, sometimes it doesn't.
  • Workaround: Remove environment.python entirely and use environment.shell.steps with { "run": "pip install -r requirements.txt" } instead.
  • Severity: Intermittent blocking

4. apt-get lock contention on reused pods

  • Symptom: E: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 748 (apt-get)
  • Cause: When a pod is reused from a previous run, a previous apt-get process may still be holding the dpkg lock
  • Workaround: None reliably — the install just silently fails for apt packages
  • Severity: Medium — apt packages may not install on reused pods

5. SSH metadata not available during workspace sync

  • Symptom: workspace sync failed: Failed to sync workspace to pod: Remote helper bootstrap failed: No SSH metadata for pod after 10s
  • Cause: The SSH upgrade from relay to direct SSH fails or is too slow, leaving the sync without SSH metadata
  • Related to: Bug #1 (DNS resolution failure) — both are sync-phase connectivity issues
  • Workaround: Retry after daemon restart
  • Severity: Intermittent blocking

6. GPU CLI auto-installs from requirements.txt even without environment.python config

  • Symptom: Even with no environment section in gpu.jsonc, GPU CLI auto-detects requirements.txt and runs uv pip install during the install phase
  • Impact: This is actually helpful but undocumented. However, it makes the environment.python config redundant and confusing.
  • Note: The auto-install uses uv (auto-downloaded) and caches packages on the global volume — subsequent runs show Python dependencies already installed (hash: ...)
  • Severity: Not a bug per se, but confusing behavior that conflicts with docs

7. torchvision circular import on pod images with pre-installed PyTorch

  • Symptom: RuntimeError: Failed to import transformers.models.llama.modeling_llama because of the following error: partially initialized module 'torchvision' has no attribute 'extension' (most likely due to a circular import)
  • Cause: Pod images come with pre-installed PyTorch (e.g. torch==2.4.1+cu124, torchvision==0.19.1+cu124). When uv pip install upgrades torch/torchaudio from requirements.txt (to 2.6.0), the old torchvision (0.19.1) becomes incompatible. The transformers library triggers the circular import when loading models.
  • Workaround: Add torchvision explicitly to requirements.txt so it gets upgraded alongside torch/torchaudio to a compatible version
  • Severity: Blocking — any package that imports transformers models will fail on first deploy

8. gpu run output doesn't stream pip install completion

  • Symptom: gpu run shows pip download and uninstall progress but stops streaming output before showing the "Installed X packages" line. The health check loop runs blind while pip finishes installing in the background.
  • Cause: The relay connection or log streaming truncates long pip install output. The install actually completes (visible in gpu logs) but gpu run doesn't stream it.
  • Severity: Low — cosmetic, but makes debugging difficult during deploys

9. Reverse sync overwrites local file edits

  • Symptom: After editing files locally (e.g. tts_server.py, gpu.jsonc), gpu run syncs OLD versions from the pod back to the local machine, reverting all local changes
  • Cause: GPU CLI performs a bidirectional sync — it syncs workspace TO the pod before running, but also syncs FROM the pod back to local after the run (or when the relay dies). The outputs field in gpu.jsonc (["output/"]) should limit what syncs back, but the entire workspace appears to sync bidirectionally.
  • Impact: Any local edits made between gpu run calls get silently overwritten. This is especially destructive when iterating on server code — you fix a bug, the sync reverts the fix.
  • Workaround: Stop the pod before editing files, or re-apply edits after each gpu run
  • Severity: Blocking — makes iterative development extremely difficult