Skip to content

Audio Player Service

The Audio Player provides low-latency streaming audio playback for the voice agent system. It is designed to play audio chunks in real-time as they are generated by the TTS service, enabling responsive voice interactions.

Features

  • Streaming Playback: Plays audio chunks as they arrive, without waiting for the complete audio stream.
  • Low Latency: Optimized for minimal delay between audio generation and playback (typically ~50ms buffer).
  • Automatic Resampling: Handles sample rate conversion between TTS output and playback device requirements.
  • Device Selection: Supports custom audio output device selection via configuration.
  • State Management: Provides callbacks for play state changes, enabling coordination with other system components.

Architecture

The AudioPlayer uses a multi-threaded architecture:

  1. Main Thread: Receives audio chunks via play_audio_chunk() and queues them for processing.
  2. Playback Loop: An async task that processes the queue, resamples audio, and feeds a buffer.
  3. Audio Thread: A separate thread (managed by sounddevice) that continuously reads from the buffer and outputs to the audio device.

This design ensures that audio playback is never blocked by other operations, and incoming audio chunks can be processed immediately.

sequenceDiagram
    participant Q as audio_queue
    participant L as _playback_loop
    participant B as _audio_buffer
    participant C as _audio_callback (Thread)

    L->>Q: await get()
    Q-->>L: item arrives
    L->>B: push data
    L->>L: set _audio_active = True
    Note over L: notify on_play_state(True)

    loop Playback
        C->>B: pop data
        B-->>C: chunk
        Note over C: output to speakers
    end

    C->>B: pop data (empty!)
    B-->>C: empty
    C->>L: call_soon_threadsafe(event.set)
    Note over L: wake up immediately
    L->>L: set _audio_active = False
    Note over L: notify on_play_state(False)

Configuration

Audio Player settings are configured in config.yaml under the audio.output section:

audio:
  output:
    sample_rate: 32000  # Playback sample rate (Hz)
    channels: 1         # Number of audio channels (1 = mono, 2 = stereo)
    device: null        # Output device index or name (null = system default)

Sample Rate

The playback sample rate should match your audio device's capabilities. Common values: - 16000 Hz: Standard for voice/speech - 32000 Hz: Higher quality for voice - 44100 Hz: CD quality - 48000 Hz: Professional audio standard

Device Selection

To use a specific audio output device:

  1. List available devices using the provided script:

    uv run scripts/list_output_devices.py
    

  2. Set the device in config.yaml:

  3. By index: device: 1
  4. By name: device: "Speakers (Realtek Audio)"
  5. System default: device: null

Audio Format

Input Format

  • Data Type: bytes containing float32 samples
  • Normalization: Samples should be in the range [-1.0, 1.0]
  • Layout: Interleaved samples (for multi-channel audio)
  • Sample Rate: Any rate supported by the source (TTS service typically outputs 24000 Hz)

Processing Pipeline

  1. Queue: Incoming chunks are added to an async queue
  2. Conversion: Bytes are converted to NumPy arrays
  3. Resampling: Audio is resampled to the playback sample rate (if needed)
  4. Buffering: Resampled audio is added to a thread-safe buffer
  5. Playback: The audio device callback reads from the buffer in real-time

Resampling

The AudioPlayer automatically handles sample rate conversion using scipy.signal.resample() for high-quality resampling. Resampling is performed on-the-fly, so there's no need to pre-convert audio to the playback sample rate.

Latency Optimization

The AudioPlayer is optimized for low latency:

  • Small Buffer: ~50ms buffer size minimizes delay
  • Streaming: Audio starts playing as soon as the first chunk arrives
  • No Pre-buffering: The system doesn't wait for complete audio before starting playback

This enables natural conversation flow, as the agent can start speaking almost immediately after generating the first audio chunk.

Thread Safety

The AudioPlayer is designed to be thread-safe:

  • Queue Operations: The async queue handles concurrent access safely
  • Buffer Access: Thread locks protect the audio buffer during read/write operations
  • State Management: Play state changes are synchronized via the event loop

You can safely call play_audio_chunk() from any async context without additional synchronization.

Error Handling

The AudioPlayer includes robust error handling:

  • Device Errors: Logs warnings if the audio device reports issues
  • Processing Errors: Catches and logs exceptions during audio processing
  • Stream Errors: Gracefully handles stream closure and cleanup

If an error occurs, the playback loop will stop and clean up resources, but the player can be reused by calling play_audio_chunk() again.

Integration with TTS Service

The AudioPlayer is typically used by the Orchestrator's TTSManager to play audio received from the TTS service:

  1. TTS service generates audio chunks via WebSocket
  2. TTSManager receives chunks and calls player.play_audio_chunk()
  3. AudioPlayer streams the audio to the output device
  4. Play state callbacks notify the system when playback starts/stops

This integration enables seamless voice interactions with minimal latency.