David H Sells

Professional Portfolio

Building a Hands-Free AI Assistant: Speech Recognition Meets LLMs

Posted on May 20, 2025 by David H Sells

A robot with headphones and a microphone

TL;DR

I built a hands-free AI assistant that lets you talk to an LLM without touching your keyboard. Speak, wait for silence, and let the AI respond with its synthesized voice. All using JavaScript, WebSockets, and the Web Speech API. Code included!

The "Why?"

Ever had that moment when you're elbow-deep in cookie dough and suddenly need to convert tablespoons to milliliters? Or maybe you're changing a tire and need to remember the proper torque settings?

I found myself constantly wanting to talk to AI assistants without having to touch anything. Sure, there are commercial solutions like Alexa and Google Assistant, but I wanted something:

  1. That I could customize completely
  2. That would use my choice of language model
  3. That wouldn't constantly listen and send audio to the cloud
  4. That I could host on my own hardware

So I built this hands-free LLM interface that uses speech recognition to understand you, sends your question to any LLM, and then speaks the response back to you.

The Magic Ingredients

Our speech-powered AI assistant requires three main components:

  1. Speech Recognition - To understand what you're saying
  2. LLM API Communication - To get intelligent responses
  3. Speech Synthesis - To speak those responses back to you

Let's dive into how we built each part!

The Architecture: Client and Server in Perfect Harmony

There are two main parts to our application:

  1. Client (index.html) - Handles speech recognition and synthesis in the browser
  2. Server (index.js) - Communicates with our LLM API
Client-Server Architecture Diagram

The Client: Teaching Your Browser to Listen and Speak

Our client code (in index.html) does two critical things:

  • Listens for your voice input until you stop talking
  • Speaks the AI's response back to you

Speech Recognition: It's All About the Silence

The challenge with speech recognition isn't getting the words—it's knowing when you're done talking! Our solution uses a silence detection approach that automatically stops listening after you've been quiet for a few seconds.

function handleSpeechResult(event) {
    // Get the text you've spoken so far
    const result = event.results[event.results.length - 1][0].transcript;
    
    // Reset our silence timer
    if (silenceTimer) clearTimeout(silenceTimer);
    
    // Start a new silence timer - if you stop talking, this will trigger
    silenceTimer = setTimeout(() => {
        // You've been quiet long enough, stop listening
        this.stop();
        processFinalSpeech(result);
    }, CONFIG.silenceTimeout);
}

This is genius in its simplicity. Every time you say something, we reset the timer. When you stop talking, the timer counts down and then triggers our processing function.

Speech Synthesis: Making Your Computer Talk Back

Once we get the AI's response, we use the browser's built-in speech synthesis to read it aloud:

function speakText(text) {
    const utterance = new SpeechSynthesisUtterance(text);
    speechSynthesis.speak(utterance);
}

Browser speech synthesis might not sound like Morgan Freeman, but it's surprisingly good these days. And unlike recorded audio, it can say literally anything our AI responds with!

The Server: Middleware Magic

The server part (in index.js) is the bridge between your voice and the AI's brain. It:

  1. Hosts the HTML interface
  2. Handles WebSocket connections for real-time communication
  3. Sends your spoken text to the LLM API

The most interesting part is how we talk to the LLM:

async function queryLLM(ws, message) {
    try {
        const response = await got(LLM_API_ENDPOINT, {
            method: 'POST',
            headers: {
                'content-type': 'application/json',
                'Authorization': `Bearer ${API_KEY}`,
            },
            json: {
                model: LLM_MODEL,
                messages: [
                    { role: 'user', content: message }
                ],
                max_tokens: 1000
            }
        });

        const data = JSON.parse(response.body);
        const content = data.choices[0].message.content;
        ws.send(content);
    } catch (error) {
        console.error('Error querying LLM:', error);
        ws.send('Sorry, I encountered an error processing your request.');
    }
}

This function takes what you said, packages it for the LLM API, and sends the response back to your browser via WebSockets. The beauty of this approach is that it works with virtually any LLM API that follows the standard chat completions format.

The Whole Conversation Flow

Here's what happens when you use this application:

  1. You click "Start Listening"
  2. Your browser asks for microphone permission
  3. You speak your question or command
  4. You stop talking and wait (for about 2 seconds)
  5. The browser detects silence and sends your speech text to the server
  6. The server forwards your text to the LLM API
  7. The LLM thinks and generates a response
  8. The response travels back through the server to your browser
  9. Your browser speaks the response aloud

It's like a digital game of telephone, except nothing gets lost in translation!

Customization Ideas

You can easily adapt this code for your needs:

  • Change the LLM to use OpenAI, Claude, PaLM, or any other API
  • Adjust the silence timeout (currently 15 seconds)
  • Modify the UI design
  • Add conversation history
  • Implement voice authentication
  • Add wake word detection

The Technical Challenges I Faced

Building this wasn't all sunshine and JavaScript. Here are some hurdles I overcame:

  1. Browser Compatibility: The Web Speech API isn't universally supported (I'm looking at you, Firefox)
  2. Silence Detection: Finding the right timeout value that doesn't cut you off mid-sentence but also doesn't wait forever
  3. WebSocket Stability: Ensuring connections remain stable and reconnect if broken
  4. API Rate Limiting: Managing requests to stay within API limits
  5. Voice Synthesis Quality: Working with the limitations of browser-based speech synthesis

Why This Matters: The Future of Human-Computer Interaction

Voice interfaces are becoming increasingly important. They're not just convenient—they're essential for:

  • Accessibility for those with mobility impairments
  • Hands-free operation in industrial, medical, or culinary settings
  • Reducing screen time while maintaining productivity
  • Creating more natural human-computer interactions

Conclusion: Talk Is No Longer Cheap—It's Valuable!

This project demonstrates how relatively simple web technologies can create a powerful hands-free AI assistant. The combination of speech recognition, LLMs, and speech synthesis opens up entirely new ways to interact with artificial intelligence.

By understanding when you've stopped talking, sending your words to an LLM, and speaking the response, we've created a system that feels almost like talking to a real person—except this person knows practically everything (or at least pretends to!).

So next time you're up to your elbows in engine grease, bread dough, or finger paint, remember that your AI assistant is just a few spoken words away!


Code Download

Full code is available on my GitHub: https://home.davidhsells.ca/Public/voicechat.git


Have you built something similar or have ideas for improvements? Let me know in the comments below!

Tags: #JavaScript #AI #SpeechRecognition #LLM #WebDevelopment #Accessibility