Skip to content

Voice AI & Speech

Bit: In 2024, talking to AI felt like talking to Siri — robotic and frustrating. By 2026, voice AI agents do customer support calls, conduct job interviews, and have natural conversations with sub-300ms latency. The interface is disappearing — you just talk.


★ TL;DR

  • What: AI systems that understand speech (STT), generate speech (TTS), and enable real-time voice conversations
  • Why: Voice is the most natural human interface. Voice AI agents are the fastest-growing GenAI application vertical. deeplearning.ai has a dedicated course on it.
  • Key point: Modern voice AI isn't just STT→LLM→TTS glued together. End-to-end models process speech directly, achieving human-like latency and expressiveness.

★ Overview

Definition

  • STT (Speech-to-Text): Converting spoken audio to text (also called ASR — Automatic Speech Recognition)
  • TTS (Text-to-Speech): Converting text to natural-sounding audio
  • Voice Agent: An AI agent that converses in real-time via voice, handling turn-taking, interruptions, and context

Scope

Covers the voice AI stack for GenAI applications. For the broader dialogue-systems view, see Conversational AI & Dialogue Systems. For general multimodal AI (images, video), see Multimodal Ai.

Last verified for provider and product examples in this note: 2026-04.


★ Deep Dive

Voice AI Architecture

TRADITIONAL PIPELINE (cascaded):
  ┌────────┐    ┌─────────┐    ┌────────┐
  │  STT   │───▶│   LLM   │───▶│  TTS   │
  │(Whisper│    │(GPT-5.4)│    │(ElevenLabs)
  │ V4)   │    └─────────┘    └────────┘
  └────────┘    └─────────┘    └────────┘
  Audio → Text → Text → Audio

  Latency: STT(300ms) + LLM(500ms) + TTS(200ms) = ~1000ms
  ❌ Loses tone, emotion, context from audio
  ❌ Error compounds across stages

MODERN END-TO-END (speech-to-speech):
  ┌──────────────────────────────────┐
  │  MULTIMODAL MODEL                │
  │  (GPT-5.4, Gemini 3.1 Live)      │
  │                                  │
  │  Audio IN ──────▶ Audio OUT     │
  │  (understands tone, emotion,    │
  │   generates expressive speech)  │
  └──────────────────────────────────┘

  Latency: ~250-400ms (one model, no pipeline)
  ✅ Preserves audio context
  ✅ Natural turn-taking and interruptions

STT (Speech-to-Text) Models

Model By Languages Key Feature
Whisper (V4) OpenAI 100+ Open-source, diarization, real-time streaming
Deepgram Nova-3 Deepgram 40+ 5.26% WER, real-time, speaker diarization
Deepgram Flux Deepgram Multi Conversational AI, turn detection, ultra-low latency
Google Chirp 3 Google 100+ Multilingual champion, diarization
Azure Speech Microsoft 100+ Enterprise, custom models
AssemblyAI AssemblyAI Multi Summarization built-in
STT KEY CONCEPTS:
  WER (Word Error Rate) — lower is better, top models < 5%
  Diarization — who said what ("Speaker 1: ... Speaker 2: ...")
  VAD (Voice Activity Detection) — detect when someone is speaking
  Streaming vs Batch — real-time transcription vs file processing
  Code-switching — handling mid-sentence language switches

TTS (Text-to-Speech) Models

Model By Latency Key Feature
ElevenLabs ElevenLabs ~100ms Best quality, voice cloning, emotional
OpenAI TTS OpenAI ~200ms Simple API, good quality
Google Cloud TTS Google ~150ms Neural2 voices, SSML
Cartesia Sonic Cartesia ~50ms Ultra-low latency
Fish Speech Open-source Varies Open, voice cloning
XTTS (Coqui) Open-source Varies Open, multilingual
TTS KEY CONCEPTS:
  Voice cloning — reproduce a specific voice from seconds of audio
  Emotional tags — <happy>, <serious>, <whisper> control
  SSML — Speech Synthesis Markup Language (pauses, emphasis)
  Prosody — rhythm, stress, intonation patterns
  Streaming — start playing audio before the full response is ready
  Zero-shot — generate believable voice from minutes of sample

Real-Time Voice APIs

OPENAI REALTIME API (GPT-5.4):
  - WebSocket-based, full-duplex
  - Speech-to-speech (no intermediate text needed)
  - Sub-300ms latency
  - Interruption handling (user can cut in)
  - Function calling during voice conversation
  - Tools: search, compute, external APIs

  Architecture:
    User speaks → WebSocket → GPT-5.4 processes audio →
    Generates audio response → Streams back → User hears

GOOGLE GEMINI 3.1 LIVE:
  - Multimodal real-time: voice + video + screen
  - ADK integration for building voice agents
  - 3-tier thinking for adaptive complexity

VOICE AGENT ARCHITECTURE:
  ┌───────────────────────────────────────────────┐
  │  VAD: Is there speech? Start listening.        │
  │       │                                       │
  │       ▼                                       │
  │  STT/End-to-end: Transcribe or process audio  │
  │       │                                       │
  │       ▼                                       │
  │  LLM: Understand intent, generate response    │
  │       │ ← Tools: calendar, CRM, database      │
  │       ▼                                       │
  │  TTS: Generate natural speech response         │
  │       │                                       │
  │       ▼                                       │
  │  Turn management: Handle interruptions,        │
  │  silence, back-channeling ("uh-huh")          │
  └───────────────────────────────────────────────┘

Applications (2026)

Application Example Stack
Customer support AI handles tier-1 calls STT + LLM + TTS + CRM tools
Voice assistants Next-gen Siri/Alexa End-to-end multimodal
Interview bots Screening candidates Voice agent + scoring
Healthcare Patient intake, symptom checker Voice + medical knowledge
Language learning Conversational practice TTS + pronunciation scoring
Accessibility Screen readers, voice control STT + TTS
Podcasting AI-generated podcasts (NotebookLM) TTS from text content

◆ Quick Reference

VOICE AI DECISION TREE:
  Need batch transcription?        → Whisper (free, open)
  Need real-time transcription?    → Deepgram Nova-3
  Need best voice quality?         → ElevenLabs
  Need lowest latency TTS?         → Cartesia Sonic
  Need full voice conversation?    → OpenAI Realtime API
  Need voice agent framework?      → ADK + Gemini Live
  Need open-source voice?          → Whisper + XTTS/Fish

KEY METRICS:
  STT WER:           < 5% is excellent
  TTS latency:       < 100ms for real-time feel
  End-to-end:        < 500ms for natural conversation
  Voice clone quality: MOS > 4.0 (out of 5.0)

○ Gotchas & Common Mistakes

  • ⚠️ Cascaded latency: STT + LLM + TTS adds up. Use end-to-end models (Realtime API) for conversational applications.
  • ⚠️ Turn-taking is HARD: Detecting when the user is done speaking vs pausing to think is a major UX challenge. VAD alone isn't enough.
  • ⚠️ Voice cloning ethics: Cloning someone's voice without consent is illegal in many jurisdictions. Always get permission.
  • ⚠️ Accents and noise: STT accuracy drops significantly with heavy accents, background noise, or domain jargon. Test with real users.
  • ⚠️ Cost: Voice tokens are more expensive than text tokens. Budget carefully for voice applications.

○ Interview Angles

  • Q: How would you build a real-time voice AI agent?
  • A: Option 1 (simplest): OpenAI Realtime API — WebSocket-based, speech-to-speech, handles turn-taking and interruptions natively. Option 2 (customizable): Pipeline of Deepgram STT → LLM (with function calling for tools) → ElevenLabs TTS, with a VAD layer for turn management. Option 3 (Google ecosystem): ADK + Gemini Live for multi-agent voice systems. Key challenges: latency optimization, interruption handling, and graceful error recovery.

★ Code & Implementation

Speech-to-Text with Whisper + GPT Response

# pip install openai>=1.60
# ⚠️ Last tested: 2026-04 | Requires: openai>=1.60, OPENAI_API_KEY, a .wav/.mp3 file
from openai import OpenAI
from pathlib import Path

client = OpenAI()

def voice_pipeline(audio_file: str, system_prompt: str = "You are a helpful voice assistant.") -> dict:
    """Full voice pipeline: STT → LLM → TTS."""
    # Step 1: Speech → Text (Whisper)
    with open(audio_file, "rb") as f:
        transcript = client.audio.transcriptions.create(
            model="whisper-1",
            file=f,
            language="en",    # omit for auto-detect
        )
    user_text = transcript.text
    print(f"Transcribed: {user_text}")

    # Step 2: Text → LLM Response
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user",   "content": user_text},
        ],
        max_tokens=200,
    )
    answer_text = response.choices[0].message.content
    print(f"LLM Answer: {answer_text}")

    # Step 3: Text → Speech (TTS)
    speech = client.audio.speech.create(
        model="tts-1",          # tts-1-hd for higher quality
        voice="nova",           # alloy|echo|fable|onyx|nova|shimmer
        input=answer_text,
        response_format="mp3",
    )
    output_path = "response.mp3"
    speech.stream_to_file(output_path)
    return {"transcript": user_text, "answer": answer_text, "audio_file": output_path}

# Streaming TTS (lower latency for real-time)
def streaming_tts(text: str, output_path: str = "stream_output.mp3") -> None:
    """Stream TTS bytes as they arrive — good for low-latency voice assistants."""
    with client.audio.speech.with_streaming_response.create(
        model="tts-1", voice="nova", input=text
    ) as resp:
        resp.stream_to_file(output_path)
    print(f"Saved streaming TTS to {output_path}")

★ Connections

Relationship Topics
Builds on Multimodal Ai, Ai Agents
Leads to Conversational AI, Accessibility, IoT/Edge AI
Compare with Text-based chatbots, Screen-based UI
Cross-domain Signal processing, Linguistics, UX design

◆ Production Failure Modes

Failure Symptoms Root Cause Mitigation
Latency kills UX Users hang up or disengage during voice interaction STT + LLM + TTS pipeline too slow Streaming STT/TTS, edge processing, speculative responses
Accent/dialect failures System fails on non-standard English or regional accents STT trained on limited accent diversity Accent-specific models, preprocessing, user adaptation
Turn-taking confusion System talks over user or has awkward pauses No barge-in detection, fixed silence thresholds Voice activity detection (VAD), adaptive silence detection

◆ Hands-On Exercises

Exercise 1: Build a Voice-to-Voice Pipeline

Goal: Create an end-to-end voice conversation system Time: 45 minutes Steps: 1. Set up STT (Whisper or Deepgram) 2. Connect to an LLM for response generation 3. Add TTS (ElevenLabs or Coqui) for audio output 4. Measure end-to-end latency for 5 conversation turns Expected Output: Working voice pipeline with latency measurements per stage


Type Resource Why
🔧 Hands-on OpenAI Audio API Production speech-to-text and text-to-speech
🔧 Hands-on ElevenLabs Documentation State-of-the-art voice synthesis
📄 Paper Radford et al. "Whisper" (2022) Robust speech recognition via large-scale supervision

★ Sources

  • OpenAI Realtime API — https://platform.openai.com/docs/guides/realtime
  • Whisper — https://github.com/openai/whisper
  • ElevenLabs — https://elevenlabs.io/docs
  • Deepgram — https://deepgram.com/docs
  • deeplearning.ai, "Building Live Voice Agents with Google's ADK" (2025)