Exploring the Web Speech API

Experimenting with voice on the web using the Web Speech Synthesis and Recognition API

The Web Speech API is one of those web technologies that no one ever talks about or writes about. In this blog post, we are going to take a closer look at what the API is capable of, what its limitations and strengths are, and how web developers can utilize it to enhance the user’s browsing experience.

“The Web Speech API enables you to incorporate voice data into web apps. The Web Speech API has two parts: SpeechSynthesis (Text-to-Speech), and SpeechRecognition (Asynchronous Speech Recognition)” — Mozilla Developer Network

Can the Web Speech API be used to interact with complex web forms? This is our research question for this blog post, and we are going to use both parts of the API to answer it. But let’s not get ahead of ourselves, let’s start by learning how the API actually works.

Making browsers talk with Speech Synthesis

Note: Left: Microsoft Edge | Right: Google Chrome
Demo: https://experiment-web-speech.now.sh/pages/blog.html

Controlling the browser’s voice using Speech Utterance

We are going to use the Speech Synthesis API to read out one of our previous blog posts. The speechSynthesis interface is the service controller. This interface can be used to retrieve the available synthesis voices on the user’s device, play and pause the speech, and much more.

The SpeechSynthesisUtterance interface represents a speech request. It contains the text that the speechSynthesis service will read out, and it contains basic information like the voice’s pitch, volume, language, and rate.

const speechSynthesis = speechSynthesis;
const speechUtterance = new SpeechSynthesisUtterance();

function isPreferredVoice(voice) {
  return ["Google US English", "Microsoft Jessa Online"].any(preferredVoice =>
    voice.name.startsWith(preferredVoice)
  );
}

function onVoiceChange() {
  speechSynthesis.addEventListener("voiceschanged", () => {
    const voices = speechSynthesis.getVoices();
    speechUtterance.voice = voices.find(isPreferredVoice);
    speechUtterance.lang = "en-US";
    speechUtterance.volume = 1;
    speechUtterance.pitch = 1;
    speechUtterance.rate = 1;
  });
}

When a website is fully loaded, the speechSynthesis API will fetch all available voices asynchronously. Once done, it will fire a voiceschanged event letting us know that everything is ready to go. Utterances added to the queue before this event is fired will still work. They will, however, use the browser’s default voice with its default settings.

The getVoices() method will return every loaded voice. This list contains both native voices and browser-specific ones. Not every browser provides custom voice services, both Google Chrome and Microsoft Edge do. Generally speaking, these voices sound much better, but you are sacrificing privacy for quality. They also require an internet connection.

Warning: While using Google Chrome’s custom voice service, each utterance instance has a character limit of 200-300. If the utterance’s text changes, the limit is reset. It is unknown whether this is a limitation or a bug.

The function above sets up our SpeechSynthesisUtterance instance once the voiceschanged event is fired. Using the Array.find() method I select my voice of choice, gracefully falling back to the browser’s default.

Reading content outloud with .speak()

function onPlay() {
  playButton.addEventListener("click", () => {
    speechSynthesis.cancel();
    speechUtterance.text = getElementText(blogPost);
    speechSynthesis.speak(speechUtterance);
  });

  function getElementText(element) {
    return Array.from(element.children)
      .map(item => item.textContent.trim())
      .map(addPunctuation)
      .join("");
  }

  function addPunctuation(text) {
    const hasPunctuation = /[.!?]$/.test(text);
    return hasPunctuation ? text : text + ". ";
  }
}

Everything is ready and configured, now we need some content which speechSynthesis can read aloud. The function getElementText loops through our blog post, concatenating the content of each element to a single string.

You may have noticed the addPunctuation function. This function makes sure that the speechSynthesis interface adds a pause after each sentence.

There is a rare Chrome bug that can freeze the speechSynthesis API throughout the entire browser window. Running the cancel() method before the speak() method will always unfreeze the API. Finally, we set the SpeechSynthesisUtterance text to the value of getElementText and bind everything to an event listener.

if (speechSynthesis && speechUtterance) {
  onVoiceChange();
  onPlay();
}

Now, all we need to do is to check whether both interfaces are supported.

SpeechSynthesis is supported in all major browsers except IE

Listening to users with Speech Recognition

Note: This part of the experiment requires Google Chrome.
Demo: https://experiment-web-speech.now.sh/

Shouldn't it be awesome to navigate a website using voice commands? Yes, it should, so let’s build it! We can easily achieve this using SpeechRecognition. For this demo, we are going to create three voice commands, “search for…”, “go to …” and “navigate to …”. The last two being identical. The API will continuously listen for new commands and react accordingly to each one.

const SpeechRecognition = SpeechRecognition || webkitSpeechRecognition;

if (SpeechRecognition) {
  const recognition = new SpeechRecognition();
  recognition.lang = "en-US";
  recognition.continuous = true;
  recognition.interimResults = false;
}

If the user's device supports the SpeechRecognition API, we configure it straight away. We want the API to keep listening for new commands after parsing the previous one. By setting continuous to true we achieve the desired result. For our use case, we do not want to receive interim results, we only need the final recognition; so we set interimResults to false. The lang property speaks for itself.

Executing voice commands

To track and manage all valid commands, we store them in an object called commands.

const commands = {
  "search for": term => (searchInput.value = term),
  "navigate to": destination => navigateTo(destination),
  "go to": destination => navigateTo(destination)
};

function navigateTo(destination) {
  const destinationLocations = {
    home: "/",
    homepage: "/",
    contact: "/contact",
    blog: "/blog"
  };

  const location = destinationLocations[destination];

  if (location) {
    window.location.href = location;
  } else {
    console.log(`Unknown destination: '${destination}'`);
  }
}

Each command has a callback function. The function receives a parameter containing the remainder of the recognized speech. This means that if the user says “navigate to contact”, the “navigate to” command will be fired, and its callback function will receive the string contact as the parameter.

function onSpeechRecognitionEvents() {
  recognition.addEventListener("result", event => {
    if (typeof event.results === "undefined") return;
    const transcript = event.results[event.results.length - 1][0].transcript
      .toLowerCase()
      .trim();

    for (let command in commands) {
      if (transcript.indexOf(command) === 0) {
        if (transcript[command.length] === undefined) {
          commands[command]();
        } else {
          const param = transcript
            .substring(command.length, transcript.length)
            .trim();
          commands[command](param);
        }
      }
    }
  });
}

The SpeechRecognition's result event fires once the API has finished transcribing some user input. The event.result array keeps track of every returned result, which means the last array item is the latest recognized transcript.

We then need to loop through our commands object and check if the recognized transcript is something we are listening for. If it is, we check if the command received any parameter or not, in both cases, we call the commands' callback function.

if (SpeechRecognition) {
  onSpeechRecognitionEvents();
  recognition.start();
}

We are now ready to initialize our event listeners and to call speechRecognition.start() to start listening for user commands.

Putting Everything Together

Note: This part of the experiment requires Google Chrome.
Demo: https://experiment-web-speech.now.sh/pages/contact.html

For the final demo, I wanted to combine both interfaces to improve the user’s experience of filling out web forms. Once a user focuses on an input field, textarea or button, its associated label is going to be read aloud using speechSynthesis. The user can then fill in the form using his voice.

I have extracted most of the required code from the previous two demos, so I won’t explain everything in detail. I will, however, explain any new concept or workaround if applicable.

function speak(message) {
  if (speechSynthesis.speaking) {
    speechSynthesis.cancel();
  }

  speechUtterance.text = message;
  speechSynthesis.speak(speechUtterance);
}

Every time a form element is focused, the function above is called. The message parameter equals the associated label. speechSynthesis.cancel(); makes sure to replace the current utterance in place of adding a new one to the queue. This is important in case the user is skipping input fields using the tab key.

function onSpeechUtteranceEvents() {
  speechUtterance.addEventListener("end", () => {
    speechRecognition.abort();
    speechRecognition.start();
  });
}

The SpeechRecognition and speechSynthesis interfaces do not work well together, at least, not if they are activated simultaneously. To prevent strange bugs and unwanted results, we make sure to only start transcribing the user’s input after the speechSynthesis interface is done speaking.

Conclusion

The Web Speech API is powerful and somewhat underused. However, there are a few annoying bugs and the SpeechRecognition interface is poorly supported. speechSynthesis works surprisingly well once you iron out all of its quirks and issues. Some browsers provide custom voices, and these voices sound better than the native ones. Their quality does vary a lot from browser to browser; Microsoft Edge’s sounds much more natural compared to Google Chrome’s, for example.

SpeechRecognition works fine for short sentences; results tend to get inconsistent with longer paragraphs. SpeechRecognition is also quite bad at recognizing more complex character sequences like email addresses and phone numbers. Thus, I should say that the Web Speech API is not suited for interacting with web forms. Yet, it does have its use cases if you are creative enough.