Let’s talk about using JavaScript for speech synthesis, also known as text to speech. You can use it to make the browser read text out loud, which is pretty neat. It’s all done with vanilla JavaScript and surprisingly easy to get started with, though you’ll start uncovering quirks as you dive deeper into it.
Naturally, you’ll want to turn your sound on for the demos in this article.
Quick Start
Here’s the code to make text to speech happen.
const utterance = new SpeechSynthesisUtterance('Hello!');
window.speechSynthesis.speak(utterance);
Seriously, that’s it! Here’s a demo to hear for yourself.
See the Pen Simple Text to Speech by Will Boyd (@lonekorean) on CodePen.
The demo does add a line of code worth discussing.
window.speechSynthesis.cancel();
If window.speechSynthesis
is already speaking and you ask it to speak something else, it’ll get added to a queue. For the demo, I’d rather nix the queue and cancel any speaking in progress so that new text can be spoken immediately. That’s what cancel()
does.
Customizing the Voice
You can change the voice used when speaking. The available voices will differ depending on browser and OS. Use the following to get an array of available voices (SpeechSynthesisVoice
objects as documented here).
const voices = window.speechSynthesis.getVoices();
Be warned that while some voices are available immediately, other voices may be added asynchronously. Fortunately, you can detect when new voices are added.
window.speechSynthesis.onvoiceschanged = function() {
const updatedVoices = window.speechSynthesis.getVoices();
};
To use a voice, get it from the array and set it on the SpeechSynthesisUtterance
object before speaking. Here’s an example that uses the last voice in the array.
const voices = window.speechSynthesis.getVoices();
const lastVoice = voices[voices.length - 1];
const utterance = new SpeechSynthesisUtterance('Hello!');
utterance.voice = lastVoice; // change voice
window.speechSynthesis.speak(utterance);
You can also adjust pitch
, rate
, and volume
properties. These all default to 1
if not specified.
const utterance = new SpeechSynthesisUtterance('Hello!');
utterance.pitch = 0.7; // a little lower
utterance.rate = 1.4; // a little faster
utterance.volume = 0.8; // a little quieter
window.speechSynthesis.speak(utterance);
Here’s where it gets messy. Different voices can have different ranges of usable values for pitch
and rate
. On top of that, different browsers have their own quirks when setting these properties. I’ll detail the quirks I found near the end of this article.
My testing found that it’s safest to keep both pitch
and rate
between 0.1
and 2
, inclusive. Thankfully, volume
is easy: 0
to 1
, no surprises.
Alright, let’s try all this stuff out!
See the Pen Text to Speech With Voice Customization by Will Boyd (@lonekorean) on CodePen.
Pausing and Resuming
Speaking can be paused and resumed. It’s pretty straightforward.
window.speechSynthesis.pause();
window.speechSynthesis.resume();
Sadly, pausing/resuming does not work on Android. I’ll talk more about it when I cover quirks later in this article.
To find out if speaking is paused, check window.speechSynthesis.paused
. To find out if speaking is in progress, check window.speechSynthesis.speaking
. Note that these are not mutually exclusive! Speaking is considered in progress even if it’s paused. Also, it’s possible to be in a paused state even when there is nothing to speak.
Events
You can add event listeners to a SpeechSynthesisUtterance
object to react to the following events.
'start'
fires when speaking starts.'pause'
fires when speaking is paused.'resume'
fires when speaking is resumed.'end'
fires when speaking reaches the end of the text. Browsers other than Safari will also fire this when speaking is cancelled.'boundary'
fires when speaking reaches a new word or sentence. It does not fire on Android, unfortunately. We’ll talk more about this one in a bit.
All of these are SpeechSynthesisEvent
events that provide a few extra properties you might find handy. Here’s an example of handling one of them.
const utterance = new SpeechSynthesisUtterance('These are words!');
utterance.addEventListener('pause', function(event) {
console.log('Paused after ' + event.elapsedTime + 'ms.');
});
window.speechSynthesis.speak(utterance);
'boundary'
events come with a name
property that will be set to either 'word'
or 'sentence'
as appropriate. 'boundary'
events also provide a charIndex
value in all browsers and a charLength
value in all browsers except Safari. Putting these together, you can tell which word in the text is being spoken at that moment.
Our final demo has buttons to play, pause, and stop speaking. It listens for 'start'
, 'pause'
, 'resume'
, and 'end'
events to enable/disable buttons when appropriate and uses 'boundary'
events to highlight the current word.
See the Pen Text to Speech With Event Handling by Will Boyd (@lonekorean) on CodePen.
Appendix of Quirks
JavaScript speech synthesis has pretty good browser support, but is still considered experimental. I found it easy to use at a basic level, but the more I dug in, the quirkier things got. I’ll list the bugs and gotchas I found here to hopefully help save some of your sanity.
Using a different SpeechSynthesisVoice
for the voice:
- [Chrome/Firefox on Android] The voice cannot be changed from the device’s default voice.
- [Safari]
addEventListener
is undefined onwindow.speechSynthesis
, so you can’t use it to listen for the'voiceschanged'
event. Useonvoiceschanged
instead, as documented here. - [Safari] Be mindful that
SpeechSynthesisVoice.name
is not always unique. If you need a unique identifier for voices, useSpeechSynthesisVoice.voiceURI
.
Setting SpeechSynthesisUtterance
properties:
- [Safari] All values for
pitch
at0.5
and below sound the same. - [Safari] If speaking happens with a
rate
of0.5
or higher, and therate
is then changed to below0.5
, subsequent speaking will retain the previous higherrate
. - [Edge] When using a non-local voice, any value set for
pitch
is ignored and will always sound like it’s at1
. - [Chrome] When using a non-local voice, setting
pitch
to0
will revert to1
. - [Chrome] When using a non-local voice, speaking will not happen if
rate
is higher than2
. - [all browsers] Some voices may further constrain the usable range of values for
pitch
andrate
.
Handling SpeechSynthesisEvent
events:
- [Chrome/Firefox on Android] The
'boundary'
event does not fire. - [Safari] The
'end'
event does not fire if speaking is stopped viacancel()
. - [Safari]
charLength
is not provided on'boundary'
events. - [macOS] The
'boundary'
event will never have aname
value of'sentence'
. Instead, you’ll see an extra event with'word'
. - [Chrome/Edge on Windows]
'boundary'
events that happen for a sentence will provide acharLength
of0
, not the length of the sentence. - [Chrome/Edge/Safari]
charIndex
is always0
for events that are not'boundary'
events.
Other stuff:
- [Chrome/Firefox on Android] Pausing and resuming does not work.
- [iOS] Speaking is inaudible when the “soft mute” switch is on.
- [Android/iOS] There are a couple more issues on mobile devices that I haven’t mentioned specifically, but you can read about them in this helpful write-up.
Conclusion
Whew! That’s a lot of quirks. It’s disappointing to see so many issues, especially the ones on Android, but again, this is an experimental feature.
JavaScript text to speech at its basic level actually has really good cross-browser compatibility — just be careful if you go beyond the basic. I still had a lot of fun playing with it. Hopefully future development will iron out all these quirks.