How to build low latency voice agents with the OpenAI API

OpenAI outlines two main architectures for building voice agents and explains how to design prompts, handle audio, and integrate specialized models for real world use cases.

OpenAI describes how to build voice agents that understand audio and respond in natural language using the OpenAI API and Agents SDK. Two primary architectures are available: a speech-to-speech approach using the Realtime API and a chained approach that converts audio to text, uses a large language model, and then converts text back to speech. The speech-to-speech architecture relies on a single multimodal model, gpt-4o-realtime-preview, that processes audio inputs and outputs in real time without depending on transcripts, which allows it to perceive emotion, intent, and noise and respond directly in speech for highly interactive, low latency use cases such as language tutoring, conversational search, and interactive customer service.

The chained architecture runs audio through gpt-4o-transcribe, then a text model such as gpt-4.1, and finally gpt-4o-mini-tts for speech synthesis, which is recommended for developers new to voice agents or those converting existing large language model applications into voice experiences. This design emphasizes control, transparency, robust function calling, and structured workflows for customer support, sales triage, and scenarios requiring transcripts or scripted responses. Building a speech-to-speech agent requires establishing a realtime connection via WebRTC or WebSocket, creating a session with the Realtime API, and using a model with realtime audio input and output. WebRTC is generally better for client side browser agents, while WebSockets are preferred for server side agents such as phone call handlers, and the TypeScript Agents SDK automatically selects the appropriate transport.

Designing an effective voice agent starts with focusing on a single task, limiting available tools, and providing a clear escape hatch such as a handoff to a human or another specialized agent. Critical information should often be included directly in the prompt rather than requiring a tool call. Prompting is especially important for speech-to-speech agents, with detailed templates for identity, task, demeanor, tone, enthusiasm, formality, emotion, filler words, pacing, and specific instructions, and conversation flows can be encoded as JSON based state machines. The guide shows how to implement agent handoff using tools like transferAgents, how to use the Realtime API’s session.update event to switch to specialized agents, and how to extend agents with dedicated models by exposing text based agents as function tools, including a supervisorAgent example that forwards cases to another service.

For chained architectures, the Python Agents SDK offers a VoicePipeline that runs a speech to text model, an agentic workflow, and then a text to speech model, with installation provided by pip install openai-agents[voice]. Developers must decide how to capture audio and handle turn detection, choosing between manual turn detection for push to talk scenarios or automatic turn detection using Voice Activity Detection with gpt-4o-transcribe and gpt-4o-mini-transcribe through the Realtime Transcription API or the Audio Transcription API. When adapting text based agents to voice, prompts should encourage concise conversational tone, short sentences, avoidance of complex punctuation, emojis, formatting, lists, or enumerations, and responses should be streamed to reduce latency. For audio output, the Speech API and the latest model, gpt-4o-mini-tts, produce high quality expressive audio, with wav or pcm formats recommended for lowest latency, and developers can implement chunking to send complete sentences as soon as they are available. Voice style and personality are controlled through the instructions field, illustrated with example prompts for a patient teacher and a fitness instructor, and further customization is available in the text to speech guide.

55

Impact Score

Intel shuts down software defined silicon paywall for server features

Intel has quietly ended its software defined silicon On Demand program for Xeon servers after customers rejected the idea of paying extra to unlock built-in hardware features. The move signals a pullback from hardware paywalls that had raised concerns about feature gating beyond traditional software subscriptions.

Discord rolls out global age verification and teen default settings

Discord is introducing global teen-by-default settings in early March 2026, requiring age verification via government ID or facial scan to access age-gated content. The rollout expands an existing system used in the UK and Australia and is already drawing privacy concerns.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.