For me, voice control is one of those technologies that seems to be perennially almost but not quite there yet. We've spent the last 20 years correcting the Word documents we dictated to Dragon, having our in-car Bluetooth call the wrong person, sending incomprehensible text messages with Siri and recently getting Alexa to play a record that isn't the one we wanted. (Although it might have a chance of being by the right artist, which is progress of a sort.)

But much like the great VR revolution we've been heralding since the '90s, there is some promise in recent developments, and niche areas where it can be useful. Here are some of the challenges of voice as it currently stands, and how to get around them.

Open-Ended vs. Closed-Ended

The problem with voice commands isn't just transcribing speech to text. The best technology is approaching 95% in ideal circumstances, and surprisingly humans aren't much better than this in pure mechanical transcription. Where we have the machines licked is in processing context. A simple phrase like "turn off the light" could mean turn off the lights in the room you've just left, turn off all the lights in the room you're in, or turn off the main light while leaving a reading or auxiliary light on. Humans are very good at understanding this context, whereas Alexa or Google Home are left asking "which lamp?" or (worse) turning off a random light.

What helps the machines is giving them additional clues for the context. After all, it's how humans solve the "which light?" question - by considering recent events (did I just leave a room without switching off the light?) and other non-speech cues (if the requester is reading a book, they probably don't want all the lights off). With a machine, we have to do this by making our requests closed-ended.

In machine learning terms, a closed-ended question is one where there are at most a few hundred possible responses, and ideally only a dozen. "Where?" is open-ended: someone could answer with a town, a county, a postcode or even saying, "just around the corner". By contrast, "which country?" is closed-ended: there is a small definitive list of possible responses.

Note that this is a very particular definition of closed-ended. Not only does the list of responses need to be fixed, it needs to be small and easily distinguishable. While the list of songs available on Spotify is finite, the set is just too big and contains too many homophones, which is how you can end up listening to Radio Gaga instead of Lady Gaga.

(Again, humans win at context: your friends probably know your opinions on Queen).

Closed-Ended Questions and Affordance

So in order for our responses to be understandable our questions need to have a fixed list of responses. This is great when it's obvious for a human what the available responses are. "Which country?", "which airline?", "which make of car?" - we're great at understanding what the machine needs from us, and it's great at interpreting our response.

Ask something like, "which mode of transport?" though, and suddenly we're having a fight with a bot because we keep saying "rail" and it's looking for "train".

The problem here is the lack of cues available in a voice interface. On a website, mode of transport would be a drop down list, and we'd do the mapping ourselves. (Humans: winners at context!) Bots don't have that rich knowledge of synonyms, and although you can program them with a few it takes a lot of work and correction before it can understand that, "rail", "West Coast Main Line", "Greater Anglia" and "an 8-coach Class 455 combination manufactured in approximately 1982" all mean the same thing as "train".

So for an effective voice interface, we not only need closed-ended questions - we need closed-ended questions where it is obvious to the human what the allowable responses are.

Hotel Echo Lima Papa!

With the above, we might decide to build a great voice interface that asks the user to confirm their identity:

"What's your customer number?"

"316529."

"What's your postcode?"

"E1 6QL."

...

"To confirm, that's policy 316529 and postcode 'hey one sequel'?"

This is a problem common to many voice recognition platforms. Because the bots are intended to be general-purpose, they assume the user is speaking in dictionary words. And there's often no way to get around that: Lex, for example, doesn't let you provide regular expressions when you want to recognise something like a postcode.

This is compounded by postcodes being an open-ended problem from the point of view of a bot: there are many thousands of possible responses with a huge number of homophones. It gets worse when you consider some users will try to be smart and use phonetic alphabet or say things like, "that's 'M' for 'Maple'."

So we're now at the point where voice interfaces need to have closed-ended questions, which can be answered in numbers, dates or simple dictionary words, where the list of available answers is obvious to a human.

Wait, that's not what I meant!

With all this in mind, you created a voice interface that satisfied all of the following:

  • The context/purpose for interaction is known
  • All the questions are closed-ended
  • All the questions can be answered with numbers, dates or simple dictionary words
  • All the questions make it obvious to the user what the possible responses are

And after all that… midway through question 8 of your 10-question flow, the user realises that they didn't mean to say "Tuesday" in answer to question 2, they meant to say "Thursday". Or maybe they said "Thursday" but there was some microphone noise and it got misinterpreted.

Currently with Lex and Alexa (and likely some other voice tech), the user's only option to correct this is at the end when the bot reads back a confirmation prompt. Lex is particularly bad at this, because if the user says "no, that's not correct" at this point, the entire intent is cancelled and they have to go back and start again.

So you need to do a confirmation at the end of each question. "Which day?"; "Thursday"; "So that's 'Thursday'?". Annoying! Or keep your question flow short enough that the user isn't too upset by having to go through the whole thing again - maybe 2 or 3 questions at most before a confirmation happens.

Even then, you'll still get the occasional user who wants to change their response. This is the final horror voice interfaces have in store for us: no matter how carefully you design your closed-ended, well-afforded voice interface, some users will still give you unexpected open-ended input! (And get very annoyed when your bot responds only with, "I don't understand" or "You need to supply a country".)

The way out is to keep the interaction short and simple enough that the user doesn't need to go back and revisit earlier responses. Ideally, a single utterance or an utterance followed by a single question - it's no coincidence that this is how Alexa, Google Home and Siri all prefer to operate… even if not always well:

"Alexa, turn on the red lamp."

"Okay, turning on the bed lamp."

Where now?

If you've got a problem which can be solved by asking the user a simple question and responding to it in some way, then voice is worth investigating. Consider, say, an out-of-hours call centre that lets users ask simple questions like "when do you open?", "what is my account balance?" or "how much is a ticket from London to Guildford?". Or just a call routing system that doesn't involve the pain of going through a lengthy IVR tree.

However, natural conversations are a long way off. They involve a huge amount of additional data, context and conversational navigation that machines just don't have the skills for yet. With continuing developments in neural nets and machine learning we may one day see the voice bot that can handle, "hang on, go back to that last question" but we're not there yet.

Voice recognition has improved a lot over the last decade (from 80% to 95% transcription accuracy in ideal circumstances), but without that ability to process context or track forward and backward through a conversation it's a long way off doing more than executing simple commands. That, and we need to start naming our smart lamps less ambiguously.