Will intelligent assistants ever understand us better? Learn more about speech recognition accuracy in our series about the elements that will define the next age of intelligent assistants.
TRANSCRIPT:
My name is Tim Tuttle. I'm with Expect Labs, and I'm doing a four-part video blog about the four ways that we think intelligent assistant technology is gonna get better over the next 3 to 5 years. We spent a lot of time at Expect Labs working on this and we think this is an exciting time and is going to get much better. We think they're going to get better in four areas: one, they're going to become faster, two, they're going to become more accurate, three they're going to get smarter, and four, they're going to have the ability to anticipate what we want better.
Today I'm going to talk about how intelligent assistants are going to get more accurate over the next three to five years. Now when you look at the challenge and how intelligent assistants can be more accurate, it really boils down to making each individual component more accurate. Specifically, you're talking about intelligent assistants that can listen and understand what you say. That means they have to get more accurate in their speech recognition, and they have to get more accurate with their ability to understand the meaning of what you say, or the semantic understanding. Those are the two areas.
So when you look at the speech recognition piece, the way the speech recognition works is it's typically a big data problem. There's modules that try to capture acoustic characteristics and the quality of audio, and then there's modules try to capture the probability, the likelihood that you're going to use certain word combinations. Those are called the language models. In both of these areas we're seeing dramatic improvements in accuracy because these models are getting better. In the area of acoustic models, the initial acoustic models that speech recognition systems relied on a few years ago were typically trained on a very specific set of environmental conditions, like talking in a quiet room very close to the microphone. Since today we use our smart phones, we use our devices when we're outside, when we're walking down the street, when the TVs on in the background. These acoustic models need to be able to adapt with situations. So what's happening is this acoustic models are now being trained for a wide variety of conditions as data sets are getting much larger. Over the years we're going to be able to use speech recognition systems even if the stereo is on in the background or in a noisy room, so that's already starting to happen.
The second part of speech recognition is the models. Now the language models attempt to guess at what the most likely combinations of words you're going to use, you know, based on how common those word combinations are used in the English language. Traditionally those language files were not very personalized for any individual user. So I'll give you an example. I have a friend who's named Marsal Gavaldà, and since the files have no idea he's a friend of mine they'll always think I'm saying something like "magic marker" or something like that. Or if I live in Australia and say "koala bear" all the time, because the people in the United States don't say "koala bear" a lot the speech recognition engine will think you're saying qualifier, or something like that. What's going to happen to change this and improve things in the next couple years is that these language models are going to get much more personalized around you individual language. For example, if I'm using an intelligent system like Siri, that intelligent system will be able to know how often I use certain words by not only listening to the questions that I asked, but it also may be able to maybe look at the e-mails that I send or the documents that I put online, like in Dropbox or Google Docs, and from those be able to understand the likelihood that I'm going to say "koala bear" versus, say the word "qualifier." And that's starting to happen right now. It's going to happen significantly more in the next few years, and accuracy is going to get much better for speech recognition.