This article addresses the fundamentals of how voice recognition works, and how best to integrate it in your new consumer product.
Thanks to Amazon, Google, and Apple, voice recognition has finally gone mainstream.
I’ve personally dreamed about intelligent voice recognition since I was a kid tinkering with electronics and programming my Commodore 64.
Listening to my son interact the other day with our Amazon Echo smart speaker I definitely realized that my childhood dream is finally here. Although, I am still waiting for levitating, self-driving cars.
So what is the best way to integrate voice command capabilities into your hardware product?
There are fundamentally two approaches for adding voice-activated features to a product: on-device or cloud-based.
On-device voice activation means that all voice processing is done locally on the device itself. For cloud-based solutions most of the heavy-duty processing is performed over the cloud on super fast computer servers.
However, before we look at the two methods of incorporating voice recognition into your product, it’s important for you to understand the fundamentals of how voice recognition works.
How does voice recognition work?
The first step to recognizing speech is to actually recognize the individual words being spoken. This is usually done in hardware.
The first step is converting the incoming analog speech to digital data via an Analog-to-Digital Converter (ADC). This signal is then processed to remove background noise, and normalized to account for amplitude variations. It can then be re-sampled to accommodate the rate at which the speaker spoke.
Next is phonetic analysis.
Normal speech is just sentences made up of individual words separated by periods of silence. Each sentence is separated from the next one by slightly longer pauses. In turn, at a simple level, each word is made up of a combination of phonemes and fricatives.
A phoneme is like a fundamental unit of speech such as the “b” sound in the word “but”, or the “p” sound in the word “place”.
A fricative is a different fundamental unit of speech. Examples of fricatives include the “s” sound in “shut”, or the “s” sound in the word “yes”. There are forty-four phonemes and nine fricatives in the English language.
By properly isolating such speech segments, and matching their sequences and combinations, it is possible, through a dictionary lookup, to arrive at a very good, though not always perfect, determination of the word being spoken.
If the speech recognition application is simple word matching, or recognizing single-word commands, then the above is all that is required. As mentioned earlier, this can be accomplished locally by a microcontroller.
Otherwise, the next process is syntactic analysis. Syntactic analysis helps improve the accuracy of the word recognition.
For example, consider these sentences: “The man is a bird” and “the man his a bird”. The phonetic analysis could not determine if the third word is “is” or “his”. In this case, a syntactic analysis will immediately be able to determine that it should be “is” since there is no verb in the second sentence.
Continuing with the sentence in the last paragraph, it is easy to see that while the sentence “The man is a bird” is syntactically correct, it simply does not make sense. The process of determining if the sentence makes any kind of sense, or not, is called semantic analysis.
This helps in determining what exactly is being spoken or requested. In this particular case, a smart semantic analyzer will probably decide that the proper sentence is: “the man is a nerd”.
Finally, sometimes even semantic analysis fails to determine the exact sentence being decoded. Consider these two phrases: “this is definitely CMOS”, and “this is definitely sea moss”.
If this was a conversation fragment between two electrical engineers, then the first sentence is probably the correct one.
On the other hand, two marine biologists would most likely be referring to sea moss, not CMOS.
So, context clues are important to proper determination in these cases. AI is usually employed to learn the speech style of the speaker based on previously collected and analyzed examples.
On-Device Voice Commands
On-device voice capabilities are usually best for products with simple voice activation features, and/or for products that don’t have an internet connection.
For example, if your product needs to respond to simple, single-word commands like go, stop, reset, etc. then doing everything locally on your device itself makes the most sense. This is commonly called keyword spotting.
Implementing relatively simple voice command capabilities can be done via a low-cost embedded microcontroller, without the need for the speed and overhead of a faster, and more complex, microprocessor.
From a hardware design perspective, adding simple voice commands isn’t very complex, and most of the development work will be on the software side.
One software work-around is from a company called Sensory which offers an embedded voice recognition engine called Truly Handsfree which features a small vocabulary. It can run on an ARM Cortex-M4 microcontroller.
ARM has also released an open-source library for keyword spotting applications that runs on Cortex-M microcontrollers.
Another software option is from a company called Snips. They offer a full voice recognition platform called Snips Flow that runs on Linux or Android operating systems. Snips Flow is pushing the boundaries of using AI on very small devices. They offer a nice user interface that allows you to customize your voice assistant.
They also offer a voice command solution called Snips Commands that runs on Cortex-M4 microcontrollers.
Snips believes you shouldn’t need to connect to the cloud to start a pot of coffee or turn your thermostat down. Instead, companies or entrepreneurs can create unique voice tools to run exclusively on their exact devices.
Snips is one way to add advanced voice activation to your product without relying on the tech giants Google or Amazon.
For full speech recognition, the additional stages past the phonetic analysis stage, are not something that is easily performed by a microcontroller with limited resources.
However, despite the limitations of simple stand alone, microcontroller-based speech recognition, some elaborate and seemingly intelligent speech recognition applications can still be implemented by standalone microcontrollers, especially the likes of 32-bit STM’s with suitably large memory sizes.
Consider, as an example, an automated banking application. The user may ask for:
“Can you tell me the exchange rate of the dollar against the euro?”, or
“What is the exchange rate of the dollar versus the euro?”, or
“What’s the rate of the dollar to euro like today?”
A full speech recognition system will try to fully analyze these requests before concluding that the user simply wants to know the exchange rate of the dollar to euro.
However, a simple word recognition picking up the words “dollar”, “euro” and “exchange” should be able to determine what the user is asking in the context of a banking application.
How exactly the question is posed is irrelevant in this case, and a microcontroller will do just as good a job as a full-blown speech recognition.
Since there is no need to send the request to some remote server the response will probably be faster too.
The primary advantages of on-device voice recognition over cloud-based solutions is that no internet connection is required, you get a faster response, and the increased level of privacy and security offered since all of your voice data remains local.
For products only requiring simple word commands it is almost always easier and less complex to do all of the voice recognition on-device.
Cloud-Based Voice Recognition
At this point in time, Google Assistant and Amazon Alexa are the two main cloud-based solutions. But Alexa is presently available in more products, and they aren’t slowing down. So how do you chose?
Armen Gharabegian, chief executive at ShadeCraft, says they chose Alexa over Google voice for their voice-controlled, garden umbrellas because “it was much easier and simpler to integrate”.
He envisions his customers controlling their umbrellas by the pool, but also being able to access the entire Amazon ecosystem. However, the company is also developing with Google Assistant with an eye on having their products compatible with both systems in the future.
Consensus still seems to be that Google is far superior for asking questions and searching the web, while Alexa has more apps, is supported by more third party products, and does the best at juggling everything in your personal digital ecosphere.
Estimates show that Amazon has 41% of the market on smart speakers world wide, with Google coming in next with 28%. Alexa dominates in other product categories too.
Some products support both Alexa and Google Assistant. You can choose to talk to the U by Moen shower with Google Assistant, Amazon Alexa, or Siri via the Apple HomeKit. However its still unknown how confused users will get juggling multiple voice platforms.
Many developers are concurrently developing with Alexa and Google Assistant, although their current products may only feature one platform.
Google went all out at the Consumer Electronics Show (CES) this year to promote Google Assistant and its use in, well, practically everything. Use Google Assistant to do everything from boarding your United Airlines flight to charging your electric car.
While trailing Amazon Alexa, Google Assistant is quickly gaining traction, especially in the smart home ecosystem. Google tapped their Google Services resources and Google A.I. to create new ways for customers to interact with Google Assistant.
Google boasts that 1,600 home-automation products and 10,000 devices are now compatible with Google Assistant. Google aims for these devices to be available everywhere in your home.
Amazon has made it hassle-free to add Alexa to products, by offering their assistant on a single chip – the Alexa Connect Kit. Developers of commercial products can also apply for the “Works with Alexa” certification, which allows you to display your compatibility with Alexa on your packaging.
Google announced a similar chip early this year with a strikingly similar name – Google Assistant Connect. However, any product using that chip will have to wirelessly connect to a Google smart device, which will process voice data.
Google Assistant has exclusive access to some of the search firm’s other products, letting users control Chromecast audio streams or display YouTube and Google Maps on devices with screens.
Voice recognition has finally gone mainstream, and its momentum is increasing quickly. More and more companies are incorporating voice features into their products.
For simple products, or those without an internet connection, your best option is to implement simple voice commands with all of the processing done locally on the device. Higher performance microcontrollers can typically perform the speech analysis up through the phonetic analysis stage.
Complex products that require full voice recognition will usually need a cloud-based solution. Cloud based voice recognition systems will perform the syntactic and semantic analysis necessary for complex voice recognition capabilities.
Cloud-based solutions typically require a high performance microprocessor or Digital Signal Processor (DSP) to locally perform the necessary pre-processing steps.Finally, don't forget to download your free PDF: Ultimate Guide to Develop and Sell Your New Electronic Hardware Product. You will also receive my weekly newsletter where I share premium content not available on my blog.
Other content you may like:
- How to Turn Your Raspberry Pi Into an Amazon Echo/Dot Using Alexa
- The Definitive Guide to Pricing Your New Electronic Hardware Product
- Why You Must Simplify Your New Product Idea to Succeed
- Teardown of an Amazon Echo Dot
- Microcontroller or Microprocessor: Which is Right for Your New Product?