GPT realtime API for speech and audio

Azure OpenAI's GPT Realtime API delivers low-latency, speech-in/speech-out conversational capabilities and can be used via WebRTC for client apps or WebSocket for server-to-server scenarios. This article covers supported models, authentication options, deployment steps in the Azure Artificial Intelligence Foundry portal, and example client code in JavaScript, Python, and TypeScript.

Azure OpenAI’s GPT Realtime API supports interactive, low-latency ‘speech in, speech out’ conversations and is part of the GPT-4o model family. You can stream audio to the model and receive audio responses in real time via WebRTC or WebSocket. The documentation recommends WebRTC for client-side applications such as web and mobile apps because it is designed for low-latency audio streaming, while WebSocket is suggested for server-to-server scenarios where extreme low latency is not required.

The article lists supported realtime models and recommends API versions: gpt-4o-realtime-preview and gpt-4o-mini-realtime-preview (both version 2024-12-17), gpt-realtime (version 2025-08-28), and gpt-realtime-mini (version 2025-10-06). It notes that Realtime API support was first added in API version 2024-10-01-preview (retired) and recommends using the generally available API version 2025-08-28 when possible. To deploy a model, follow the Azure Artificial Intelligence Foundry portal workflow: create or select a project, open Models + endpoints under My assets, choose Deploy model > Deploy base model, select gpt-realtime, and complete the deployment wizard.

The guide covers prerequisites and authentication options. You need an Azure subscription, a deployed gpt-realtime or gpt-realtime-mini model, and Node.js or Python/TypeScript environments depending on sample code. Microsoft Entra ID keyless authentication is recommended; it requires Azure CLI and assignment of the Cognitive Services User role. The article also describes API-key authentication and cautions about storing keys securely. Example session configuration shows audio input and output settings (transcription with whisper-1, audio/pcm at 24000 Hz, server_vad turn detection, and output voice ‘alloy’). Client samples for JavaScript, Python, and TypeScript demonstrate event handling for session.created, session.updated, response.output_audio.delta, response.output_audio_transcript.delta, and response.done, and include example console output of transcript deltas and audio chunk sizes to illustrate real-time interaction patterns.

56

Impact Score

Artificial intelligence detects suicide risk missed by standard assessments

Researchers at Touro University report that an Artificial intelligence tool using large language models detected signals of perceived suicide risk that standard multiple-choice assessments missed. The study applied Claude 3.5 Sonnet to audio interview responses and compared model outputs with participants’ self-rated likelihood of attempting suicide.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.