AI/MLexperimental

Voice-First Interface

An experiment in voice-driven web UIs with real-time transcription and natural language commands. The trigger was watching how often I reach for keyboard shortcuts in tools I use daily, and wondering whether voice could be a faster path for some of those interactions. The answer is: sometimes, on the right device, in the right room. This prototype combines Whisper transcription with an intent classifier and a small command router. It is interesting, not a product.

Web Speech APIWhisperOpenAIReact

What this is

A lab, not a product.

An experiment in voice-driven web UIs with real-time transcription and natural language commands. The trigger was watching how often I reach for keyboard shortcuts in tools I use daily, and wondering whether voice could be a faster path for some of those interactions. The answer is: sometimes, on the right device, in the right room. This prototype combines Whisper transcription with an intent classifier and a small command router. It is interesting, not a product.

5

Features

4

Learnings

4

Technologies

Capabilities

What it does

The features that actually got built and run in this prototype.

feature_01.ts
Real-time transcription with Whisper as a fallback when the browser API is poor
feature_02.ts
Natural language command routing with a small intent classifier on top of the transcript
feature_03.ts
Voice-controlled navigation across the app with predictable, stable command names
feature_04.ts
Multi-language support driven by Whisper, with auto-detection of input language
feature_05.ts
Accessibility-focused design, relevant to healthcare

The stack

What it is built with

The libraries and runtimes I picked for this lab and why they earned their place.

Web Speech API
Whisper
OpenAIReact

What I learned

Learnings, in order of how much they surprised me

The things I would tell another engineer before they tried the same experiment.

01
Browser Speech API quality varies hugely across devices. Safari, Chrome, and Firefox give wildly different transcripts
02
Command disambiguation needs robust intent classification, not pattern matching. Users will not say what you expect
03
Visual feedback during voice input is the difference between trust and confusion. Show the live transcript
04
Voice is great in private, awkward in shared offices, and unusable on a train. Context dominates the experience

Note: This is an experimental project in the experimental stage. It is a learning exercise and technical exploration rather than a production-ready solution. Patterns and code may change.

Want me to build something like this for you?

If this kind of work fits your roadmap, I take on a small number of paid projects each quarter.