Thesis

Voice remains our most intuitive form of communication, yet existing AI systems create an "uncanny valley" effect—not because they sound robotic, but because they lack the subtle imperfections that make human speech authentic. While digital voice assistant adoption continues to grow, with 153.5 million U.S. users expected by 2025, these systems struggle with fundamental limitations in emotional intelligence, as evidenced by accuracy being users' primary concern.

Sesame's recently released CSM-1b (Conversational Speech Model) approaches this challenge differently—instead of perfecting pronunciation, it embraces the inherent messiness of human speech. Unlike traditional cascading systems that convert text to speech in separate processes, CSM-1b generates emotionally authentic speech directly from input using an end-to-end multimodal architecture that combines a Llama backbone with a specialized audio decoder, eliminating the unnatural delays that plague voice AI.

This breakthrough addresses three core limitations of existing voice technology:

  1. Emotional Intelligence: CSM maintains complete conversational context, allowing it to recognize and respond to emotional nuances in real-time—a critical capability that traditional systems lack. In blind tests conducted in March 2025, participants couldn't distinguish between CSM and actual humans during short conversations, though longer interactions still revealed some limitations.
  2. Natural Imperfections: The model intentionally incorporates human-like elements such as micro-pauses, self-corrections, and subtle emotional cues. This approach creates what Sesame calls "voice presence"—the magical quality that makes spoken interactions feel real and valued, a feature that users report creates remarkably natural interactions as of March 2025.
  3. Unified Processing: By operating directly on RVQ audio tokens (similar to technology developed by Google's SoundStream and Meta's Encodec), CSM achieves near-instantaneous responsiveness that allows for natural conversation rhythms including interruptions—essential for replicating how the 90% of emotional meaning in human communication conveyed through non-verbal cues as of May 2024.

The market implications of this innovation are substantial. The global voice assistant market is projected to grow from $4.87 billion in 2023 to $33.74 billion by 2030 at a CAGR of 28.9%, while the broader voice and speech recognition market is expected to reach $53.67 billion by 2030. With 8.4 billion voice assistants expected to be in use globally by the end of 2024, Sesame's breakthrough positions it to capture significant value in this rapidly evolving ecosystem by solving the fundamental problem that has limited voice AI adoption: the inability to engage in natural, emotionally intelligent conversation.

Founding Story

Sesame AI was founded in 2023 by a team of experienced industry professionals who combined expertise in AI, speech modeling, and consumer hardware. The founding team identified key limitations in existing voice interfaces through their previous work building VR platforms, production-scale AI systems, and speech recognition technologies.

Founding Team

The founders recognized three critical gaps in voice interfaces:

  1. Emotional Flatness: Existing systems felt transactional rather than conversational
  2. Hardware Limitations: AR/VR over-indexed on visual displays rather than audio-first interactions
  3. Technical Fragmentation: Traditional TTS pipelines couldn't handle real-time vocal nuance

The collaboration began through: