Skip to content

Add JARVIS-style TTS status reports with neural voice and HTTP API#9

Open
DannyNs wants to merge 1 commit intodyoburon:mainfrom
DannyNs:feature/tts-status-reports
Open

Add JARVIS-style TTS status reports with neural voice and HTTP API#9
DannyNs wants to merge 1 commit intodyoburon:mainfrom
DannyNs:feature/tts-status-reports

Conversation

@DannyNs
Copy link

@DannyNs DannyNs commented Mar 18, 2026

  • Neural TTS via edge-tts (en-GB-RyanNeural) with chunked playback for low-latency long text. First chunk plays immediately while rest generates in background. Falls back to platform TTS (SAPI/say/espeak-ng) offline.

  • HTTP API server on port 7865: POST /api/speak lets any external tool (Claude Code, scripts, browser) trigger spoken feedback.

  • Feedback hotkey mode (Cmd+Shift+F) pastes transcription with endpoint instructions so the receiving LLM can speak back via the API.

  • Cross-platform: Python CLI, Windows native (C#), macOS native (Swift). Native apps shell out to edge-tts CLI with ffplay/afplay for headless playback, falling back to built-in speech synthesizers.

  • Configurable via ~/.vibetotext/config.json: tts_enabled, tts_voice, tts_edge_rate, tts_edge_pitch, tts_rate, tts_volume.

    flowchart TD
        A[🎤 User speaks\nhold hotkey + speak] --> B[Whisper transcribes\nlocal, offline]
        B --> C{Which hotkey mode?}
    
        C -->|Ctrl+Shift| D[Transcribe]
        C -->|Alt+Shift| E[Cleanup\nGemini refines]
        C -->|Cmd+Alt| F[Plan\nGemini generates]
        C -->|Greppy| G[Greppy\nsemantic search]
        C -->|Cmd+Shift+F| H[Feedback mode]
    
        D --> I[paste_at_cursor\ntext lands in editor]
        E --> I
        F --> I
        G --> I
    
        I --> J[speak_status via edge-tts]
        J --> K[Chunked playback pipeline]
    
        K --> K1{text > 100 chars?}
        K1 -->|No| K2[Single mp3 → 🔊]
        K1 -->|Yes| K3[Rolling chunks of 2-3 sentences]
        K3 --> K4[Chunk 1: generate + play 🔊]
        K3 --> K5[Chunk 2: generate in parallel...]
        K5 --> K6[Play when ready 🔊]
    
        H --> L[Paste transcription\n+ endpoint info]
        L --> M[LLM reads paste\ne.g. Claude Code]
        M --> N[LLM does work, then calls\nPOST /api/speak]
    
        N --> O[HTTP API Server\n127.0.0.1:7865]
        O --> P[edge-tts → mp3\nffplay/afplay → 🔊]
    
        style A fill:#4CAF50,color:#fff
        style H fill:#FF9800,color:#fff
        style O fill:#2196F3,color:#fff
        style K2 fill:#8BC34A,color:#fff
        style K6 fill:#8BC34A,color:#fff
        style P fill:#8BC34A,color:#fff
    
    Loading

- Neural TTS via edge-tts (en-GB-RyanNeural) with chunked playback for
  low-latency long text. First chunk plays immediately while rest generates
  in background. Falls back to platform TTS (SAPI/say/espeak-ng) offline.
- HTTP API server on port 7865: POST /api/speak lets any external tool
  (Claude Code, scripts, browser) trigger spoken feedback.
- Feedback hotkey mode (Cmd+Shift+F) pastes transcription with endpoint
  instructions so the receiving LLM can speak back via the API.
- Cross-platform: Python CLI, Windows native (C#), macOS native (Swift).
  Native apps shell out to edge-tts CLI with ffplay/afplay for headless
  playback, falling back to built-in speech synthesizers.
- Configurable via ~/.vibetotext/config.json: tts_enabled, tts_voice,
  tts_edge_rate, tts_edge_pitch, tts_rate, tts_volume.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
DannyNs added a commit to DannyNsITServices/devglide that referenced this pull request Mar 18, 2026
Split long text into rolling chunks of 2-3 sentences and pipeline
generation + playback so the first chunk plays almost immediately
while subsequent chunks generate in the background.

Reference: dyoburon/vibetotext#9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant