Thinking Machines Builds Voice AI That Listens While Talking

Mira Murati's Thinking Machines is researching full-duplex voice AI that can listen and speak simultaneously, tackling one of the biggest UX gaps in current voice assistants and synthetic speech systems.

Share
Thinking Machines Builds Voice AI That Listens While Talking

Thinking Machines Lab, the high-profile AI startup founded by former OpenAI CTO Mira Murati, is tackling one of the most stubborn limitations in conversational AI: the inability of voice models to listen and speak at the same time. According to a new TechCrunch report, the company is researching full-duplex voice systems — AI that can process incoming audio while it is still generating its own response, much like humans do during natural conversation.

The Turn-Taking Problem

Nearly every voice assistant in production today — including OpenAI's Advanced Voice Mode, Google's Gemini Live, and ElevenLabs' Conversational AI — operates in a fundamentally half-duplex manner. The model waits for the user to finish speaking, detects end-of-turn via voice activity detection (VAD) or a dedicated turn-taking classifier, then generates a response. While modern systems have shrunk latency dramatically (OpenAI's Realtime API claims sub-300ms turn latency), the underlying architecture still assumes strict alternation.

This breaks down in real human conversation, which is full of backchannels ("mhm," "yeah," "right"), interruptions, overlapping speech, and mid-sentence corrections. Current systems either ignore these signals or get derailed by them. Thinking Machines argues that solving this is a prerequisite for voice AI that feels genuinely natural rather than transactional.

What Full-Duplex Actually Requires

Building a full-duplex model is harder than it sounds. It requires the AI to:

  • Continuously encode incoming audio while simultaneously running an autoregressive decoder producing speech output.
  • Maintain two parallel attention streams — one for listening, one for speaking — and decide in real time whether to keep talking, yield the floor, or react to a backchannel without stopping.
  • Reason about prosody and intent, distinguishing an encouraging "uh-huh" from a genuine interruption that should cause the model to stop and re-plan.

Recent academic work — notably Kyutai's Moshi model and Meta's research on dGSLM (dialogue Generative Spoken Language Model) — has demonstrated that full-duplex architectures are feasible. Moshi, for instance, models two audio streams (user and assistant) jointly with a small temporal hierarchy, achieving theoretical latency near 160ms. Thinking Machines appears to be pursuing a related but more ambitious direction, aiming for production-grade conversational quality.

Why It Matters for Synthetic Media

For the synthetic voice and deepfake landscape, full-duplex voice AI is a double-edged development. On one hand, it dramatically improves accessibility, customer service, and assistive tech. On the other, it makes real-time voice impersonation far more convincing. Today, voice-cloned scam calls — already a fast-growing fraud vector flagged by financial regulators — are detectable in part because the cloned voice cannot react naturally to interruptions or overlapping speech. A full-duplex cloned voice that responds to backchannels, pauses gracefully when interrupted, and resumes mid-thought would be substantially harder for victims to detect in the moment.

This has direct implications for voice authentication systems used by banks and call centers. Liveness checks that rely on conversational dynamics — asking the caller to interrupt, repeat, or engage in rapid back-and-forth — would need to evolve. Detection vendors like Pindrop, Reality Defender, and GetReal will likely need new signal classes focused on micro-timing of turn transitions and prosodic alignment.

Thinking Machines' Strategic Position

Thinking Machines launched in 2025 with a reported $2 billion seed round at a $10 billion valuation, pulling in senior researchers from OpenAI, Meta, and Google. The company has been deliberately quiet about its product roadmap, but its public research output — including work on reproducible inference and reinforcement learning stability — suggests a focus on foundational infrastructure rather than chasing chatbot benchmarks.

A push into full-duplex voice fits this pattern. Rather than competing on text-model leaderboards, the lab is targeting a clear capability gap where current frontier models all perform poorly. If Thinking Machines ships a production-quality full-duplex model, it would put pressure on OpenAI, Google, and ElevenLabs to follow — and potentially redefine what users expect from voice AI.

The Open Question

What remains unclear is whether Thinking Machines plans to release this work openly, license it via API, or embed it in a consumer product. Murati has hinted at an emphasis on open research, but the commercial stakes for real-time voice are enormous. Either way, full-duplex conversation is shaping up to be the next major axis of competition in voice AI — and a new frontier for both synthetic media creators and the authenticity systems trying to keep up with them.


Stay informed on AI video and digital authenticity. Follow Skrew AI News.