How to build low latency voice agents with the OpenAI API

OpenAI outlines two main architectures for building voice agents and explains how to design prompts, handle audio, and integrate specialized models for real world use cases.

OpenAI describes how to build voice agents that understand audio and respond in natural language using the OpenAI API and Agents SDK. Two primary architectures are available: a speech-to-speech approach using the Realtime API and a chained approach that converts audio to text, uses a large language model, and then converts text back to speech. The speech-to-speech architecture relies on a single multimodal model, gpt-4o-realtime-preview, that processes audio inputs and outputs in real time without depending on transcripts, which allows it to perceive emotion, intent, and noise and respond directly in speech for highly interactive, low latency use cases such as language tutoring, conversational search, and interactive customer service.

The chained architecture runs audio through gpt-4o-transcribe, then a text model such as gpt-4.1, and finally gpt-4o-mini-tts for speech synthesis, which is recommended for developers new to voice agents or those converting existing large language model applications into voice experiences. This design emphasizes control, transparency, robust function calling, and structured workflows for customer support, sales triage, and scenarios requiring transcripts or scripted responses. Building a speech-to-speech agent requires establishing a realtime connection via WebRTC or WebSocket, creating a session with the Realtime API, and using a model with realtime audio input and output. WebRTC is generally better for client side browser agents, while WebSockets are preferred for server side agents such as phone call handlers, and the TypeScript Agents SDK automatically selects the appropriate transport.

Designing an effective voice agent starts with focusing on a single task, limiting available tools, and providing a clear escape hatch such as a handoff to a human or another specialized agent. Critical information should often be included directly in the prompt rather than requiring a tool call. Prompting is especially important for speech-to-speech agents, with detailed templates for identity, task, demeanor, tone, enthusiasm, formality, emotion, filler words, pacing, and specific instructions, and conversation flows can be encoded as JSON based state machines. The guide shows how to implement agent handoff using tools like transferAgents, how to use the Realtime API’s session.update event to switch to specialized agents, and how to extend agents with dedicated models by exposing text based agents as function tools, including a supervisorAgent example that forwards cases to another service.

For chained architectures, the Python Agents SDK offers a VoicePipeline that runs a speech to text model, an agentic workflow, and then a text to speech model, with installation provided by pip install openai-agents[voice]. Developers must decide how to capture audio and handle turn detection, choosing between manual turn detection for push to talk scenarios or automatic turn detection using Voice Activity Detection with gpt-4o-transcribe and gpt-4o-mini-transcribe through the Realtime Transcription API or the Audio Transcription API. When adapting text based agents to voice, prompts should encourage concise conversational tone, short sentences, avoidance of complex punctuation, emojis, formatting, lists, or enumerations, and responses should be streamed to reduce latency. For audio output, the Speech API and the latest model, gpt-4o-mini-tts, produce high quality expressive audio, with wav or pcm formats recommended for lowest latency, and developers can implement chunking to send complete sentences as soon as they are available. Voice style and personality are controlled through the instructions field, illustrated with example prompts for a patient teacher and a fitness instructor, and further customization is available in the text to speech guide.

55

Impact Score

Anu Bradford on tech sovereignty and regulatory fragmentation

Anu Bradford argues that Europe is wavering in its role as the world’s digital rule-setter just as governments everywhere move toward more state control over technology. Global companies are being pushed to treat geopolitical risk, data sovereignty, and Artificial Intelligence governance as core strategic issues.

Mistral launches text-to-speech model

Mistral has expanded its Voxtral family with a text-to-speech system aimed at enterprise voice applications. The company is positioning the open-weights model as a flexible alternative for organizations that want more control over deployment, cost and customization.

UK Parliament opens workforce inquiry on Artificial Intelligence

A UK Parliament committee is examining how Artificial Intelligence is changing business and work, with a focus on both economic opportunity and labour disruption. The inquiry is seeking evidence on government priorities as adoption expands across the economy.

Windows 11 tightens kernel trust for older drivers

Microsoft is changing Windows 11 kernel policy so new drivers must be signed through the Windows Hardware Compatibility Program. Older trusted drivers will still be allowed in some cases to preserve compatibility during the transition.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.