How to build low latency voice agents with the OpenAI API

OpenAI outlines two main architectures for building voice agents and explains how to design prompts, handle audio, and integrate specialized models for real world use cases.

OpenAI describes how to build voice agents that understand audio and respond in natural language using the OpenAI API and Agents SDK. Two primary architectures are available: a speech-to-speech approach using the Realtime API and a chained approach that converts audio to text, uses a large language model, and then converts text back to speech. The speech-to-speech architecture relies on a single multimodal model, gpt-4o-realtime-preview, that processes audio inputs and outputs in real time without depending on transcripts, which allows it to perceive emotion, intent, and noise and respond directly in speech for highly interactive, low latency use cases such as language tutoring, conversational search, and interactive customer service.

The chained architecture runs audio through gpt-4o-transcribe, then a text model such as gpt-4.1, and finally gpt-4o-mini-tts for speech synthesis, which is recommended for developers new to voice agents or those converting existing large language model applications into voice experiences. This design emphasizes control, transparency, robust function calling, and structured workflows for customer support, sales triage, and scenarios requiring transcripts or scripted responses. Building a speech-to-speech agent requires establishing a realtime connection via WebRTC or WebSocket, creating a session with the Realtime API, and using a model with realtime audio input and output. WebRTC is generally better for client side browser agents, while WebSockets are preferred for server side agents such as phone call handlers, and the TypeScript Agents SDK automatically selects the appropriate transport.

Designing an effective voice agent starts with focusing on a single task, limiting available tools, and providing a clear escape hatch such as a handoff to a human or another specialized agent. Critical information should often be included directly in the prompt rather than requiring a tool call. Prompting is especially important for speech-to-speech agents, with detailed templates for identity, task, demeanor, tone, enthusiasm, formality, emotion, filler words, pacing, and specific instructions, and conversation flows can be encoded as JSON based state machines. The guide shows how to implement agent handoff using tools like transferAgents, how to use the Realtime API’s session.update event to switch to specialized agents, and how to extend agents with dedicated models by exposing text based agents as function tools, including a supervisorAgent example that forwards cases to another service.

For chained architectures, the Python Agents SDK offers a VoicePipeline that runs a speech to text model, an agentic workflow, and then a text to speech model, with installation provided by pip install openai-agents[voice]. Developers must decide how to capture audio and handle turn detection, choosing between manual turn detection for push to talk scenarios or automatic turn detection using Voice Activity Detection with gpt-4o-transcribe and gpt-4o-mini-transcribe through the Realtime Transcription API or the Audio Transcription API. When adapting text based agents to voice, prompts should encourage concise conversational tone, short sentences, avoidance of complex punctuation, emojis, formatting, lists, or enumerations, and responses should be streamed to reduce latency. For audio output, the Speech API and the latest model, gpt-4o-mini-tts, produce high quality expressive audio, with wav or pcm formats recommended for lowest latency, and developers can implement chunking to send complete sentences as soon as they are available. Voice style and personality are controlled through the instructions field, illustrated with example prompts for a patient teacher and a fitness instructor, and further customization is available in the text to speech guide.

55

Impact Score

Google unveils new Artificial Intelligence models and personal agents

Google used its I/O developer conference to introduce updated Gemini models and personal Artificial Intelligence agents aimed at competing more aggressively with OpenAI and Anthropic. The push centers on stronger models, wider product integration, and a broader enterprise and developer pitch.

Policymakers weigh pause on Artificial Intelligence data center construction

Federal, state, and local officials are moving to slow or condition large data center development as concerns grow over electricity costs, grid strain, environmental effects, and labor standards. Proposed moratoriums and tax incentive changes are creating new uncertainty for developers, hyperscalers, and financiers.

European Union delays key Artificial Intelligence Act obligations

European Union lawmakers have agreed to revise the Artificial Intelligence Act, delaying major high-risk compliance obligations and easing some overlapping requirements. The changes give businesses more time to prepare while preserving the law’s core framework for high-risk systems and transparency rules.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.