May 8, 2026APIAgentsInfrastructure

OpenAI Drops Three Voice Agent Pieces at Once

OpenAI launched three new voice models on May 7 and they're all aimed at the same place — voice agents that don't sound like answering machines. GPT-Realtime-2 inherits GPT-5-class reasoning so the voice can handle multi-step requests without losing track. GPT-Realtime-Translate does live translation across 70 input languages and 13 output languages at conversational pace. GPT-Realtime-Whisper does real-time speech-to-text as the conversation unfolds.

The framing OpenAI used in the launch post is the giveaway: voice should "listen, reason, translate, transcribe, and take action as a conversation unfolds." That's not a TTS upgrade — that's the voice equivalent of a tool-using agent loop. You can stack the three pieces to build voice agents that translate live for a customer service call, transcribe the entire interaction for compliance, and route the conversation to actions on the backend, all within one Realtime API session.

Pricing keeps the categories separate. Translate and Whisper bill per minute. Realtime-2 is token-priced through the Realtime API. The split says OpenAI sees Realtime-2 as the agent core and Translate/Whisper as utility primitives anyone can drop into their stack — the way Stripe split Payments from Checkout from Connect.

The competitive read: ElevenLabs has the voices, Deepgram has the transcription, but nobody else has the GPT-5 reasoning loop fused with both at sub-second latency. Customer service is the headline target but the broader bet is that voice becomes the dominant interface for agents in the next 12 months — and OpenAI just made sure the building blocks for that interface live on its API.

https://techcrunch.com/2026/05/07/openai-launches-new-voice-intelligence-features-in-its-api/
← Previous
Anthropic Hands Petri to Meridian Labs
Next →
Perplexity Puts a Computer-Use Agent on Every Mac
← Back to all articles

Comments

Loading...
>_