Azure OpenAI’s GPT Realtime API supports interactive, low-latency ‘speech in, speech out’ conversations and is part of the GPT-4o model family. You can stream audio to the model and receive audio responses in real time via WebRTC or WebSocket. The documentation recommends WebRTC for client-side applications such as web and mobile apps because it is designed for low-latency audio streaming, while WebSocket is suggested for server-to-server scenarios where extreme low latency is not required.
The article lists supported realtime models and recommends API versions: gpt-4o-realtime-preview and gpt-4o-mini-realtime-preview (both version 2024-12-17), gpt-realtime (version 2025-08-28), and gpt-realtime-mini (version 2025-10-06). It notes that Realtime API support was first added in API version 2024-10-01-preview (retired) and recommends using the generally available API version 2025-08-28 when possible. To deploy a model, follow the Azure Artificial Intelligence Foundry portal workflow: create or select a project, open Models + endpoints under My assets, choose Deploy model > Deploy base model, select gpt-realtime, and complete the deployment wizard.
The guide covers prerequisites and authentication options. You need an Azure subscription, a deployed gpt-realtime or gpt-realtime-mini model, and Node.js or Python/TypeScript environments depending on sample code. Microsoft Entra ID keyless authentication is recommended; it requires Azure CLI and assignment of the Cognitive Services User role. The article also describes API-key authentication and cautions about storing keys securely. Example session configuration shows audio input and output settings (transcription with whisper-1, audio/pcm at 24000 Hz, server_vad turn detection, and output voice ‘alloy’). Client samples for JavaScript, Python, and TypeScript demonstrate event handling for session.created, session.updated, response.output_audio.delta, response.output_audio_transcript.delta, and response.done, and include example console output of transcript deltas and audio chunk sizes to illustrate real-time interaction patterns.
