TTS Service (Text-to-Speech)
The TTS Service (port 8003) converts text into natural-sounding speech. It supports both high-quality offline models and fast online providers.
Supported Providers
1. Genie-TTS (High Quality)
A high-performance CPU inference engine for GPT-SoVITS models. - Character Consistency: Uses pre-trained character models. - Language Support: Optimized for Japanese, Chinese, and English. - Latency: GENIE optimizes the original model for outstanding CPU performance.
2. Edge-TTS (Fast & Easy)
Uses Microsoft Edge's online TTS engine.
- No GPU Required: Processing happens on Microsoft servers.
- Wide Voice Selection: Access to dozens of neural voices.
- Requirement: ffmpeg must be installed locally for format conversion.
3. ChatTTS
Optimized for conversational speech, including fillers like [laugh] and [um].
WebSocket Streaming Protocol
To minimize "Time to First Byte" (TTFB), the TTS Service supports streaming synthesis:
- Client Request:
- Audio Streaming: The service begins generating audio and sends it back as raw binary PCM chunks.
- Finalization: When
finalize: trueis sent, the service finishes the current buffer and closes the stream segment.
Latency Optimization
- Pre-connection: The Orchestrator's
TTSManageropens a WebSocket connection as soon as the LLM starts generating, reducing the handshake overhead when the first sentence is ready. - Sentence-Level Chunking: The system doesn't wait for the full LLM response. As soon as a sentence is complete (detected by punctuation or tags), it is sent to the TTS service.