GPT realtime API for speech and audio

Azure OpenAI's GPT Realtime API delivers low-latency, speech-in/speech-out conversational capabilities and can be used via WebRTC for client apps or WebSocket for server-to-server scenarios. This article covers supported models, authentication options, deployment steps in the Azure Artificial Intelligence Foundry portal, and example client code in JavaScript, Python, and TypeScript.

Azure OpenAI’s GPT Realtime API supports interactive, low-latency ‘speech in, speech out’ conversations and is part of the GPT-4o model family. You can stream audio to the model and receive audio responses in real time via WebRTC or WebSocket. The documentation recommends WebRTC for client-side applications such as web and mobile apps because it is designed for low-latency audio streaming, while WebSocket is suggested for server-to-server scenarios where extreme low latency is not required.

The article lists supported realtime models and recommends API versions: gpt-4o-realtime-preview and gpt-4o-mini-realtime-preview (both version 2024-12-17), gpt-realtime (version 2025-08-28), and gpt-realtime-mini (version 2025-10-06). It notes that Realtime API support was first added in API version 2024-10-01-preview (retired) and recommends using the generally available API version 2025-08-28 when possible. To deploy a model, follow the Azure Artificial Intelligence Foundry portal workflow: create or select a project, open Models + endpoints under My assets, choose Deploy model > Deploy base model, select gpt-realtime, and complete the deployment wizard.

The guide covers prerequisites and authentication options. You need an Azure subscription, a deployed gpt-realtime or gpt-realtime-mini model, and Node.js or Python/TypeScript environments depending on sample code. Microsoft Entra ID keyless authentication is recommended; it requires Azure CLI and assignment of the Cognitive Services User role. The article also describes API-key authentication and cautions about storing keys securely. Example session configuration shows audio input and output settings (transcription with whisper-1, audio/pcm at 24000 Hz, server_vad turn detection, and output voice ‘alloy’). Client samples for JavaScript, Python, and TypeScript demonstrate event handling for session.created, session.updated, response.output_audio.delta, response.output_audio_transcript.delta, and response.done, and include example console output of transcript deltas and audio chunk sizes to illustrate real-time interaction patterns.

56

Impact Score

Progressive autonomy with model evolution

Models often internalize capabilities previously enforced by agent scaffolding; the article recommends auditing and removing unnecessary prompts and orchestration as newer models arrive.

Korea joins artificial intelligence industrial revolution with NVIDIA partnership

At the APEC Summit in Gyeongju, NVIDIA CEO Jensen Huang announced a national-scale sovereign artificial intelligence initiative that will deploy more than a quarter-million NVIDIA GPUs across South Korea. The plan combines government-led cloud deployments, massive private AI factories and coordinated research and training programs.

What the EU Artificial Intelligence Act means for U.S. employers

The EU Artificial Intelligence Act, effective August 1, 2024, reaches U.S. employers that use Artificial Intelligence affecting EU candidates or workers and treats many HR uses as high risk. Employers should inventory tools, prepare worker notice and human oversight, and strengthen vendor contracts ahead of phased obligations through 2026 and 2027.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.