OpenAI describes how to build voice agents that understand audio and respond in natural language using the OpenAI API and Agents SDK. Two primary architectures are available: a speech-to-speech approach using the Realtime API and a chained approach that converts audio to text, uses a large language model, and then converts text back to speech. The speech-to-speech architecture relies on a single multimodal model, gpt-4o-realtime-preview, that processes audio inputs and outputs in real time without depending on transcripts, which allows it to perceive emotion, intent, and noise and respond directly in speech for highly interactive, low latency use cases such as language tutoring, conversational search, and interactive customer service.
The chained architecture runs audio through gpt-4o-transcribe, then a text model such as gpt-4.1, and finally gpt-4o-mini-tts for speech synthesis, which is recommended for developers new to voice agents or those converting existing large language model applications into voice experiences. This design emphasizes control, transparency, robust function calling, and structured workflows for customer support, sales triage, and scenarios requiring transcripts or scripted responses. Building a speech-to-speech agent requires establishing a realtime connection via WebRTC or WebSocket, creating a session with the Realtime API, and using a model with realtime audio input and output. WebRTC is generally better for client side browser agents, while WebSockets are preferred for server side agents such as phone call handlers, and the TypeScript Agents SDK automatically selects the appropriate transport.
Designing an effective voice agent starts with focusing on a single task, limiting available tools, and providing a clear escape hatch such as a handoff to a human or another specialized agent. Critical information should often be included directly in the prompt rather than requiring a tool call. Prompting is especially important for speech-to-speech agents, with detailed templates for identity, task, demeanor, tone, enthusiasm, formality, emotion, filler words, pacing, and specific instructions, and conversation flows can be encoded as JSON based state machines. The guide shows how to implement agent handoff using tools like transferAgents, how to use the Realtime API’s session.update event to switch to specialized agents, and how to extend agents with dedicated models by exposing text based agents as function tools, including a supervisorAgent example that forwards cases to another service.
For chained architectures, the Python Agents SDK offers a VoicePipeline that runs a speech to text model, an agentic workflow, and then a text to speech model, with installation provided by pip install openai-agents[voice]. Developers must decide how to capture audio and handle turn detection, choosing between manual turn detection for push to talk scenarios or automatic turn detection using Voice Activity Detection with gpt-4o-transcribe and gpt-4o-mini-transcribe through the Realtime Transcription API or the Audio Transcription API. When adapting text based agents to voice, prompts should encourage concise conversational tone, short sentences, avoidance of complex punctuation, emojis, formatting, lists, or enumerations, and responses should be streamed to reduce latency. For audio output, the Speech API and the latest model, gpt-4o-mini-tts, produce high quality expressive audio, with wav or pcm formats recommended for lowest latency, and developers can implement chunking to send complete sentences as soon as they are available. Voice style and personality are controlled through the instructions field, illustrated with example prompts for a patient teacher and a fitness instructor, and further customization is available in the text to speech guide.
