Thinking Machines shows off preview of near-realtime AI voice and video conversation with new 'interaction models'
AI Summary
Byron Allen, a billionaire media mogul, is acquiring control of BuzzFeed for $120 million. This move is seen as a lifeline for BuzzFeed, which has been experiencing financial challenges and a decline in revenue.
Is AI leaving the era of "turn-based" chat? Right now, all of us who use AI models regularly for work or in our personal lives know that the basic interaction mode across text, imagery, audio, and video remains the same: the human user provides an input, waits anywhere between milliseconds to minutes (or in some cases, for particularly tough queries, hours and days), and the AI model provides an output. But if AI is to really take on the load of jobs requiring natural interaction, it will need to do more than provide this kind of "turn-based" interactivity — it will ultimately need to respond more fluidly and naturally to human inputs, even responding while also processing the next human input, be it text or another format. That at least seems to be the contention of Thinking Machines, the well-funded AI startup founded last year by former OpenAI chief technology officer Mira Murati and former OpenAI researcher and co-founder John Schulman, among others. Today, the firm announced a research preview of what it deems to be "interaction models, a new class of native multimodal systems that treats interactivity as a first-class citizen of model architecture rather than an external software "harness," scoring some impressive gains on third-party benchmarks and reduced latency as a result. However, the models are not yet available to the general public or even enterprises — the company says in its announcement blog post: "In the coming months, we will open a limited research preview to collect feedback, with a wider release later this year." 'Full duplex' simultaneous input/output processing At the heart of this announcement is a fundamental shift in how AI perceives time and presence. Current frontier models typically experience reality in a single thread; they wait for a user to finish an input before they begin processing, and their perception freezes while they generate a response. In their blog post, the Thinking Machines researchers described the status quo as a limitation that forces humans to "contort themselves" to AI interfaces, phrasing questions like emails and batching their thoughts. To solve this "collaboration bottleneck," Thinking Machines has moved away from the standard alternating token sequence. Instead, they use a multi-stream, micro-turn design that processes 200ms chunks of input and output simultaneously. This "full-duplex" architecture allows the model to listen, talk, and see in real time, enabling it to backchannel while a user speaks or interject when it notices a visual cue—such as a user writing a bug in a code snippet or a friend entering a video frame. Technically, the model utilizes encoder-free early fusion. Rather than relying on massive standalone encoders like Whisper for audio, the system takes in raw audio signals as dMel and image patches (40x40) through a lightweight embedding layer, co-training all components from scratch within the transformer. Dual model system The research preview introduces TML-Interaction-Small, a 276-billion parameter Mixture-of-Experts (MoE) model with 12 billion active parameters. Because real-time interaction requires near-instantaneous response times that often conflict with deep reasoning, the company has architected a two-part system: The Interaction Model: Stays in a constant exchange with the user, handling dialog management, presence, and immediate follow-ups. The Background Model: An asynchronous agent that handles sustained reasoning, web browsing, or complex tool calls, streaming results back to the interaction model to be woven naturally into the conversation. This setup allows the AI to perform tasks like live translation or generating a UI chart while continuing to listen to user feedback—a capability demonstrated in the announcement video where the model provided typical human reaction times for various cues while simultaneously generating a bar chart. Impressive performance on major benchmarks against other leading AI labs' fast interaction models To prove the efficacy of this approach, the lab utilized FD-bench, a benchmark specifically designed to measure interaction quality rather than just raw intelligence.The results show that TML-Interaction-Small significantly outperforms existing real-time systems: Responsiveness: It achieved a turn-taking latency of 0.40 seconds, compared to 0.57s for Gemini-3.1-flash-live and 1.18s for GPT-realtime-2.0 (minimal). Interaction Quality: On FD-bench V1.5, it scored 77.8, nearly doubling the scores of its primary competitors (GPT-realtime-2.0 minimal scored 46.8). Visual Proactivity: In specialized tests like RepCount-A (counting physical repetitions in video) and ProactiveVideoQA, Thinking Machines’ model successfully engaged with the visual world while other frontier models remained silent or provided incorrect answers. Metric TML-Interaction-Small GPT-realtime-2.0 (min) Gemini-3.1-flash-live (min) Turn-taking latency (s) 0.40 1.18 0.57 Interaction Quality (Avg) 7