OpenAI explains its low-latency voice stack: relay + transceiver WebRTC architecture
OpenAI detailed how it reworked WebRTC at global scale to keep voice interactions responsive. The design splits packet routing (relay) from session termination (transceiver) to reduce public UDP surface area while preserving session ownership.
OpenAI published a technical deep dive into how it delivers **low-latency voice AI** for ChatGPT voice and the Realtime API.
## The problem: voice UX punishes latency
Voice feels natural only when turn-taking is fast and stable. At OpenAI’s scale, that translates to:
- fast connection setup
- low, stable media round-trip time (RTT)
- low jitter/packet loss across global networks
## Why WebRTC
OpenAI emphasizes WebRTC’s standardized solutions for:
- NAT traversal (ICE)
- encrypted transport (DTLS + SRTP)
- codec negotiation
- network adaptation and quality control (RTCP)
For AI, streaming audio enables real-time transcription, reasoning, and speech generation **while the user is still talking**.
## Key architectural shift: split relay + transceiver
OpenAI reports that classic “one UDP port per session” termination becomes operationally painful at high concurrency (port management, security policy surface area, and Kubernetes autoscaling friction).
Their approach separates responsibilities:
- **Relay:** a lightweight UDP forwarder with a small public footprint
- **Transceiver:** the stateful owner of each WebRTC session (ICE/DTLS/SRTP/session lifecycle)
A crucial trick is routing first packets using an existing WebRTC-native identifier: the ICE **ufrag** embedded in STUN checks.
## Takeaways for builders
If you are building real-time voice or agentic systems:
- budget for networking architecture early (especially first-hop latency)
- treat session ownership as a core scaling constraint
- minimize externally exposed UDP ranges where possible
This write-up is especially relevant for teams building on WebRTC for client-to-server AI interactions.
Source: OpenAI