"Why Iteration is not Innovation"

Watch our recorded WEBINAR!

OK, Google: throw yourself away — making voice recognition disposable

Human voice-to-computer interfacing is one of the great frontiers of modern technology. It’s the user interface of the future, some would say. Such a complex task could unlock A.I. for good. And while Siri, Ok Google, Alexa and Cortana are all impressive specimens to be sure, none of them, I think you’d agree, are all the way there yet. I think all of us envision (or at least wish for) a Tony Stark-to-J.A.R.V.I.S. type of relationship between our vocal commands and the responsiveness of the machine to which we direct them.

Now, our machines in question will almost assuredly be our phone (or some web of connected devices interacting with it). So, it makes sense to focus our collective technological efforts on getting the voice/computer interface right there first. But, Pete Warden, engineer at Google, wants you to think differently about the underlying premise of voice recognition. Instead of making it huge and all-powerful, let’s make it small, cheap and disposable. As reported in the MIT Technology Review:

“His idea is simple enough: cut down the neural networks that are usually used to process sound until they’re efficient enough to run on cheap, lightweight chips. ‘What I want is a 50-cent chip that can do simple voice recognition and run for a year on a coin battery,’ he explained during last week’s Arm Research Summit in Cambridge, U.K. ‘We’re not there yet … but I really think this is doable with even the current technology that we have now.’

If achievable, it would open an entirely new world of possibilities for voice recognition and artificial intelligence. At that low a price, the hardware would, for all intents-and-purposes, be disposable, allowing creators to implement those chips in all sorts of new ways we wouldn’t dream of at the moment. “The devices could be used to build cheap dolls that respond to your kids, for instance, or simple home electronics like lamps that are voice-activated,” Jamie Condliffe writes.

Warden imagines some other uses, though — industrial settings, believe it or not. In such a use case, the sensors could be listening for noises rather than voices: ‘hundreds of sensors spotting tell-tale audio signatures of squeaking wheels in factory equipment, or chirping crickets in a farm field.’

The problem Warden runs into, though, is that most voice recognition systems are incredibly resource hungry, making the small, inexpensive and energy efficient dream a difficult ask. Condliffe writes:

“squeezing down, say, the AI that powers Amazon’s AI assistant, Alexa, to run on simple battery-powered chips with clock speeds of just hundreds of megahertz isn’t feasible. That’s partly because Alexa has to interpret a lot of different sounds, but also because most voice recognition AIs use neural networks that are resource-hungry, which is why Alexa sends its processing to the cloud.

To combat that, his team has zeroed in on specific words/commands to focus on in the beginning. They’ve also developed a new voice recognition algorithm that computers and calculates sound differently. His method slices an audio clip in short snippets and then keys into the frequency content of each sample. The system then compares of the frequency plots one after another to create a two-dimensional image contrasting frequency content vs. time. Then, they apply visual-recognition algorithms to identify the tell-tale signs of a person saying a specific, individual word.

There was just one problem… Their first attempt required 8,000,000 calculations to crunch a one-second audio clip, metering out at 89 percent accuracy. Now, a modern smartphone could handle that, and it’d even be pretty interactive at that speed; that would also improve efficiency and speed by removing the need to send everything to the cloud. But that still doesn’t solve the issue of getting to that speed on a smaller, less-powerful and cheaper chip.

So, the team “borrowed algorithmic tricks that help Android phones recognize the phrase “OK, Google,” [and] the system was able to analyze a second of speech with 85 percent accuracy by performing just 750,000 calculations,” according to MIT Technology Review.

While a significant improvement, that’s still not getting it to the size, speed and energy consumption of say, a Raspberry Pi or Arduino. And, there are a lot of obstacles in their way to making that dream a reality. If progress continues on this front, however, it would totally change the game as far as how, where and why we deploy A.I. It also stands to reckon that unlike so many technologies when they first come out, A.I. might not only not add to economic inequality due to lack of affordable access, it might very well combat the situation outright.

We’re certainly cheering Warden and his team on; what they’re working on could quite literally change the world.


Leave a Reply

Your email address will not be published. Required fields are marked *

Captcha *

Jeff Francis

Jeff Francis is a veteran entrepreneur and founder of Dallas-based digital product studio ENO8. Jeff founded ENO8 to empower companies of all sizes to design, develop and deliver innovative, impactful digital products. With more than 18 years working with early-stage startups, Jeff has a passion for creating and growing new businesses from the ground up, and has honed a unique ability to assist companies with aligning their technology product initiatives with real business outcomes.

Get In The Know

Sign up for power-packed emails to get critical insights into why software fails and how you can succeed!


Whether you have your ducks in a row or just an idea, we’ll help you create software your customers will Love.


Beat the Odds of Software Failure

2/3 of software projects fail. Our handbook will show you how to be that 1 in 3.