JavaScript Text to Speech and Its Many Quirks / Coder's Block

Let’s talk about using JavaScript for speech synthesis, also known as text to speech. You can use it to make the browser read text out loud, which is pretty neat. It’s all done with vanilla JavaScript and surprisingly easy to get started with, though you’ll start uncovering quirks as you dive deeper into it.

Naturally, you’ll want to turn your sound on for the demos in this article.

Quick Start

Here’s the code to make text to speech happen.

const utterance = new SpeechSynthesisUtterance('Hello!');
window.speechSynthesis.speak(utterance);

Seriously, that’s it! Here’s a demo to hear for yourself.

See the Pen Simple Text to Speech by Will Boyd (@lonekorean) on CodePen.

The demo does add a line of code worth discussing.

window.speechSynthesis.cancel();

If window.speechSynthesis is already speaking and you ask it to speak something else, it’ll get added to a queue. For the demo, I’d rather nix the queue and cancel any speaking in progress so that new text can be spoken immediately. That’s what cancel() does.

Customizing the Voice

You can change the voice used when speaking. The available voices will differ depending on browser and OS. Use the following to get an array of available voices (SpeechSynthesisVoice objects as documented here).

const voices = window.speechSynthesis.getVoices();

Be warned that while some voices are available immediately, other voices may be added asynchronously. Fortunately, you can detect when new voices are added.

window.speechSynthesis.onvoiceschanged = function() {
  const updatedVoices = window.speechSynthesis.getVoices();
};

To use a voice, get it from the array and set it on the SpeechSynthesisUtterance object before speaking. Here’s an example that uses the last voice in the array.

const voices = window.speechSynthesis.getVoices();
const lastVoice = voices[voices.length - 1];

const utterance = new SpeechSynthesisUtterance('Hello!');
utterance.voice = lastVoice; // change voice
window.speechSynthesis.speak(utterance);

You can also adjust pitch, rate, and volume properties. These all default to 1 if not specified.

const utterance = new SpeechSynthesisUtterance('Hello!');
utterance.pitch = 0.7;  // a little lower
utterance.rate = 1.4;   // a little faster
utterance.volume = 0.8; // a little quieter
window.speechSynthesis.speak(utterance);

Here’s where it gets messy. Different voices can have different ranges of usable values for pitch and rate. On top of that, different browsers have their own quirks when setting these properties. I’ll detail the quirks I found near the end of this article.

My testing found that it’s safest to keep both pitch and rate between 0.1 and 2, inclusive. Thankfully, volume is easy: 0 to 1, no surprises.

Alright, let’s try all this stuff out!

See the Pen Text to Speech With Voice Customization by Will Boyd (@lonekorean) on CodePen.

Pausing and Resuming

Speaking can be paused and resumed. It’s pretty straightforward.

window.speechSynthesis.pause();
window.speechSynthesis.resume();

Sadly, pausing/resuming does not work on Android. I’ll talk more about it when I cover quirks later in this article.

To find out if speaking is paused, check window.speechSynthesis.paused. To find out if speaking is in progress, check window.speechSynthesis.speaking. Note that these are not mutually exclusive! Speaking is considered in progress even if it’s paused. Also, it’s possible to be in a paused state even when there is nothing to speak.

Events

You can add event listeners to a SpeechSynthesisUtterance object to react to the following events.

'start' fires when speaking starts.
'pause' fires when speaking is paused.
'resume' fires when speaking is resumed.
'end' fires when speaking reaches the end of the text. Browsers other than Safari will also fire this when speaking is cancelled.
'boundary' fires when speaking reaches a new word or sentence. It does not fire on Android, unfortunately. We’ll talk more about this one in a bit.

All of these are SpeechSynthesisEvent events that provide a few extra properties you might find handy. Here’s an example of handling one of them.

const utterance = new SpeechSynthesisUtterance('These are words!');
utterance.addEventListener('pause', function(event) {
  console.log('Paused after ' + event.elapsedTime + 'ms.');
});
window.speechSynthesis.speak(utterance);

'boundary' events come with a name property that will be set to either 'word' or 'sentence' as appropriate. 'boundary' events also provide a charIndex value in all browsers and a charLength value in all browsers except Safari. Putting these together, you can tell which word in the text is being spoken at that moment.

Our final demo has buttons to play, pause, and stop speaking. It listens for 'start', 'pause', 'resume', and 'end' events to enable/disable buttons when appropriate and uses 'boundary' events to highlight the current word.

See the Pen Text to Speech With Event Handling by Will Boyd (@lonekorean) on CodePen.

Appendix of Quirks

JavaScript speech synthesis has pretty good browser support, but is still considered experimental. I found it easy to use at a basic level, but the more I dug in, the quirkier things got. I’ll list the bugs and gotchas I found here to hopefully help save some of your sanity.

Using a different SpeechSynthesisVoice for the voice:

[Chrome/Firefox on Android] The voice cannot be changed from the device’s default voice.
[Safari] addEventListener is undefined on window.speechSynthesis, so you can’t use it to listen for the 'voiceschanged' event. Use onvoiceschanged instead, as documented here.
[Safari] Be mindful that SpeechSynthesisVoice.name is not always unique. If you need a unique identifier for voices, use SpeechSynthesisVoice.voiceURI.

Setting SpeechSynthesisUtterance properties:

[Safari] All values for pitch at 0.5 and below sound the same.
[Safari] If speaking happens with a rate of 0.5 or higher, and the rate is then changed to below 0.5, subsequent speaking will retain the previous higher rate.
[Edge] When using a non-local voice, any value set for pitch is ignored and will always sound like it’s at 1.
[Chrome] When using a non-local voice, setting pitch to 0 will revert to 1.
[Chrome] When using a non-local voice, speaking will not happen if rate is higher than 2 .
[all browsers] Some voices may further constrain the usable range of values for pitch and rate.

Handling SpeechSynthesisEvent events:

[Chrome/Firefox on Android] The 'boundary' event does not fire.
[Safari] The 'end' event does not fire if speaking is stopped via cancel().
[Safari] charLength is not provided on 'boundary' events.
[macOS] The 'boundary' event will never have a name value of 'sentence'. Instead, you’ll see an extra event with 'word'.
[Chrome/Edge on Windows] 'boundary' events that happen for a sentence will provide a charLength of 0, not the length of the sentence.
[Chrome/Edge/Safari] charIndex is always 0 for events that are not 'boundary' events.

Other stuff:

[Chrome/Firefox on Android] Pausing and resuming does not work.
[iOS] Speaking is inaudible when the “soft mute” switch is on.
[Android/iOS] There are a couple more issues on mobile devices that I haven’t mentioned specifically, but you can read about them in this helpful write-up.

Conclusion

Whew! That’s a lot of quirks. It’s disappointing to see so many issues, especially the ones on Android, but again, this is an experimental feature.

JavaScript text to speech at its basic level actually has really good cross-browser compatibility — just be careful if you go beyond the basic. I still had a lot of fun playing with it. Hopefully future development will iron out all these quirks.