OpenAI Realtime API Goes GA: What It Means for Voice App Developers
If you've tried to build a voice AI app before, you know the pain: stitch together a speech-to-text model, pipe the transcript into a language model, run the output through a text-to-speech engine, and somehow keep the whole chain under a second of latency. The OpenAI Realtime API replaces that three-step chain with a single persistent connection that handles audio in and audio out, end to end.
The API moved to general availability, which means it's no longer behind a waitlist and is considered production-ready. Here's what that actually changes for developers building voice applications today.
What the GA Release Actually Delivers
The core promise of the Realtime API is speech-to-speech interaction without intermediate text. Audio goes in over a persistent connection, the model processes it, and audio comes back β all in one hop. That eliminates the serialization overhead of converting audio to text before sending it to a language model.
Key capabilities that shipped as generally available:
- Bidirectional audio streaming over WebSockets and WebRTC
- Server-side voice activity detection (VAD) so you don't have to build your own turn detection logic
- Interruption handling β the model can stop mid-sentence when the user starts talking again
- Function calling in voice sessions, so your assistant can trigger backend actions during a conversation
- Multiple built-in voices covering different tonal profiles
The underlying model is a variant of GPT-4o trained specifically for audio understanding and generation. It retains context across a session, so it understands references to earlier parts of the conversation without you re-sending the full transcript on every turn.
How the Realtime API Works Under the Hood
The Realtime API uses an event-driven architecture over a persistent connection. You open a session, send audio chunks as events, and receive response events back. There's no request-response cycle in the traditional HTTP sense β it's closer to a duplex stream.
The session lifecycle looks like this:
- Create a session and receive a short-lived ephemeral token (for browser clients) or use your API key directly (for server-side clients).
- Open a WebSocket or WebRTC connection using that token.
- Send
input_audio_buffer.appendevents with base64-encoded PCM audio chunks. - The server detects speech activity and commits the buffer automatically (or you commit it manually).
- The model generates a response and streams back
response.audio.deltaevents you can play in real time.
Session configuration β the system prompt, voice, VAD settings, and which tools are available β is sent in the initial session.update event. You can update it mid-session if needed, though changing the voice mid-session doesn't affect audio already being generated.
WebSockets vs WebRTC: Choosing the Right Transport
The API supports two transport mechanisms, and choosing the right one depends on where your audio is coming from.
WebSockets
WebSockets are the right choice for server-to-server integrations. If your application is capturing audio on a server (from a PSTN call, for example, or from a media processing pipeline), WebSockets give you direct control over every byte. The connection goes from your server to OpenAI's servers, and your server is responsible for proxying audio to and from the end user.
Use WebSockets when you need to intercept, log, or transform audio events β for compliance recording, analytics, or routing logic.
WebRTC
WebRTC is designed for browser-to-OpenAI connections. The browser negotiates a peer connection directly with OpenAI's infrastructure, which means audio takes the shortest possible path and benefits from WebRTC's built-in jitter buffering, packet loss concealment, and adaptive bitrate handling.
For a consumer-facing voice app where users are on mobile browsers or web apps, WebRTC gives you noticeably better audio quality and lower perceived latency. The tradeoff is less control: audio isn't routed through your server, so you can't intercept it without additional setup.
If you're building a cross-platform mobile app, it's worth reading how Flutter handles real-time calling use cases before committing to a transport strategy.
Getting Started: Your First Realtime Session
The fastest way to verify your setup is a minimal WebSocket client on the server side. You'll need Node.js and the ws package, or Python with websockets.
Here's a minimal Python example that opens a session and sends a text message to get a spoken response back:
import asyncio
import json
import websockets
import base64
API_KEY = "sk-..." # your OpenAI key
URL = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
async def run_session():
headers = {
"Authorization": f"Bearer {API_KEY}",
"OpenAI-Beta": "realtime=v1",
}
async with websockets.connect(URL, extra_headers=headers) as ws:
# Configure the session
await ws.send(json.dumps({
"type": "session.update",
"session": {
"modalities": ["audio", "text"],
"voice": "alloy",
"instructions": "You are a helpful assistant. Keep responses brief.",
"turn_detection": {"type": "server_vad"}
}
}))
# Send a text message to trigger a spoken response
await ws.send(json.dumps({
"type": "conversation.item.create",
"item": {
"type": "message",
"role": "user",
"content": [{"type": "input_text", "text": "Say hello in one sentence."}]
}
}))
await ws.send(json.dumps({"type": "response.create"}))
# Collect audio delta events
audio_chunks = []
async for raw in ws:
event = json.loads(raw)
if event["type"] == "response.audio.delta":
audio_chunks.append(base64.b64decode(event["delta"]))
elif event["type"] == "response.done":
print(f"Done. Received {len(audio_chunks)} audio chunks.")
break
asyncio.run(run_session())
The audio chunks are raw 24kHz 16-bit PCM by default. You can write them to a WAV file or stream them directly to a speaker using pyaudio. For production, you'll want to handle reconnection logic and event errors β the snippet above is just enough to prove the connection works.
If you're already working with async patterns in Python, the practical guide to Python async/await on this blog covers the concurrency model you'll rely on here.
Handling Audio Input and Output
The Realtime API expects audio in PCM16 format at 24kHz, mono channel. If your source audio is in a different format (most microphone APIs on mobile and web will give you something different), you need to resample and re-encode before sending.
On the browser side, the Web Audio API's AudioWorklet is the modern approach for capturing and resampling mic input in real time. Avoid ScriptProcessorNode β it's deprecated and runs on the main thread, which introduces perceptible jitter.
For output, you receive PCM16 chunks in response.audio.delta events. The simplest playback approach in a browser is to decode each chunk into an AudioBuffer and schedule it on an AudioContext with tight timing. If you let chunks queue up before playing, you'll introduce buffering delay that kills the real-time feel.
One thing that catches developers off guard: if the user's microphone input leaks into the audio output (no echo cancellation), the server VAD will interpret the model's own voice as user speech and trigger an interruption loop. Always enable the browser's built-in echo cancellation when capturing mic audio:
const stream = await navigator.mediaDevices.getUserMedia({
audio: {
echoCancellation: true,
noiseSuppression: true,
sampleRate: 24000
}
});
Managing Conversations and Turn Detection
The server VAD mode handles turn detection automatically. The model listens for silence after speech activity and commits the audio buffer, triggering a response. You can tune the VAD sensitivity with threshold, prefix_padding_ms, and silence_duration_ms parameters in the session config.
If server VAD is too aggressive for your use case β say, users often pause mid-thought β you can switch to manual turn detection by setting turn_detection to null. You then control when the buffer is committed by sending input_audio_buffer.commit explicitly. A push-to-talk UI maps naturally to this mode.
Interruptions work out of the box in server VAD mode. When the model detects speech while it's generating a response, it sends a response.cancelled event and stops generating. Your client should stop playing audio immediately when it receives response.cancelled β if you keep playing buffered chunks, the user will hear the model keep talking even though the server already stopped.
Common Pitfalls and Gotchas
Session timeouts. Realtime sessions have a maximum duration. If a user leaves your app open without interacting, the session will eventually close. Build reconnection logic that creates a new session and restores the conversation context from your own state store.
Token costs are per-audio-second, not per-transcript-word. You're billed for the duration of audio processed, regardless of how much was actually spoken. A 30-second silence is not free. Use VAD settings to avoid sending long silent buffers, and close sessions when the user is clearly inactive.
Function call latency adds up. When the model triggers a function call, it pauses audio generation until your function returns a result. If your function hits a slow database or external API, the user hears silence. Keep tool implementations fast β under 200ms where possible β and return a brief acknowledgment message before doing heavy work if you need to.
Context window management. Long sessions accumulate conversation history that counts against the model's context window. For very long sessions, you'll need to truncate older conversation items. The API exposes conversation item IDs so you can delete specific entries; prune from the oldest end rather than summarizing, since summarization introduces its own latency.
Audio format mismatches. Sending audio at the wrong sample rate is the single most common setup error. The model will process it, but the output will sound wrong or the VAD will misfire. Always verify your audio pipeline is outputting 24kHz PCM16 before debugging anything else.
Pricing and What It Costs to Run a Voice App
The Realtime API is priced per token, with separate rates for audio input tokens and audio output tokens, plus the standard text token rates for any text-based context. Audio tokens are significantly more expensive than text tokens on a per-token basis, which reflects the cost of the specialized audio processing.
A rough way to think about it: a one-minute voice conversation involves audio input tokens (the user speaking), audio output tokens (the model responding), and text tokens (system prompt, conversation history, function definitions). The audio portions dominate the cost for a typical call.
Before going to production, benchmark a representative conversation and calculate cost per session. For high-volume consumer apps, even small optimizations β shorter system prompts, aggressive session closing, VAD tuning to avoid processing silence β compound into meaningful savings. This kind of cost-awareness matters as much as the technical integration, and it's worth thinking through alongside broader model selection decisions for your product.
Also keep in mind that the API landscape is evolving fast. OpenAI is not the only player with low-latency audio capabilities β Google's Gemini 2.5 Flash has its own audio streaming features worth evaluating for your specific use case, particularly if you're already in the Google Cloud ecosystem.
Next Steps
If you're ready to move from reading to building, here are the concrete actions to take:
- Set up a minimal WebSocket client using the Python snippet above and verify you can get an audio response back. This takes about 15 minutes and proves your credentials and network path work.
- Pick your transport early. Decide WebSockets or WebRTC based on whether your architecture is server-mediated or browser-direct. Switching later requires rebuilding your audio pipeline.
- Fix your audio format before anything else. Confirm your microphone capture pipeline outputs 24kHz PCM16 mono. Run a quick test: record 5 seconds, check the file header, and send it to the API. This will save hours of debugging.
- Enable echo cancellation on mic capture from day one. Skipping this causes a feedback loop that's tricky to diagnose if you don't know to look for it.
- Instrument your session costs early. Log audio duration per session from the start. It's much easier to set budgets and alerts before you have real users than to retroactively understand a billing spike.
The GA release removes the main barrier β access β and signals that the API surface is stable enough to build on. The fundamentals are solid; what you do with them is up to you.
Frequently Asked Questions
What audio format does the OpenAI Realtime API require for input?
The Realtime API expects PCM16 audio at 24kHz sample rate, mono channel. If your microphone or audio source produces a different format, you need to resample and re-encode it before sending chunks to the API.
Can I use the OpenAI Realtime API in a mobile app, not just a browser?
Yes, you can use it in mobile apps via the WebSocket transport on the server side, or by integrating WebRTC support in your mobile client. Flutter and React Native both have WebRTC libraries that can negotiate a direct peer connection with OpenAI's infrastructure.
How does interruption handling work in the Realtime API?
When server-side VAD detects the user speaking while the model is generating audio, the API automatically sends a response.cancelled event and stops generation. Your client should immediately stop playing any buffered audio chunks to avoid the model's voice continuing after cancellation.
How much does it cost to run a one-minute conversation with the Realtime API?
Exact pricing depends on OpenAI's current rate card, but audio input and output tokens are billed separately at rates higher than standard text tokens. A one-minute conversation typically involves substantial audio token usage on both sides, so benchmark a representative session during development and track cost per session before scaling.
What is the difference between server VAD and manual turn detection in the Realtime API?
Server VAD automatically detects when the user stops speaking and triggers a response, which works well for natural conversation. Manual turn detection gives you full control β you commit the audio buffer explicitly, which is ideal for push-to-talk interfaces or situations where users frequently pause mid-sentence.
π€ Share this article
Sign in to saveRelated Articles
Comments (0)
No comments yet. Be the first!