In-Vehicle Voice Control: Evolution Through The Times

Divya MS
15 Feb 2022
09:30 AM
4 Min Read

Voice is a natural communication medium for humans. It is easier and less distracting to converse by voice compared to touch-screen operations or using mechanical knobs and switches.


In-Vehicle Voice Control

The skyrocketing popularity of voice technology is majorly attributed to the advancement of voice assistants like Amazon Alexa, Apple Siri, and Google Assistant. Nevertheless, it is not something that evolved very quickly. Several decades of research and development laid the foundation of voice experience, which we enjoy in the present age. 

Intelligent assistants in vehicles will be our companions in the future, just as smartphones are today. In the automotive segment, voice technology was initially intended to reduce distractions that compromise safety. Voice assistants primarily helped operate infotainment functions, while keeping the driver’s eyes on the road and hands on the steering wheel during drives. As they developed, their applications evolved, and the experiences seen in science fiction films became a reality.

Voice is a natural communication medium for humans. It is easier and less distracting to converse by voice compared to touch-screen operations or using mechanical knobs and switches. Today, as we anticipate our vehicles to process and recognise human language for several functions, it would be interesting to know how voice technology in cars started, evolved to its present form and where it is leading us. 

Early Days

By the mid-90s, the primitive version of locally embedded voice dialogue systems appeared in certain luxury cars. Such systems allowed hands-free operation of functions like the car’s phone, radio, CD player, air conditioning, etc. The technology recognised a limited vocabulary list for around 50-60 commands in total. 

To kick-off the voice command, a user had to press a Push To Talk (PTT) button that was part of the infotainment unit or steering wheel. The user had to memorise all the different commands or listen to the system’s voice guidance to understand which command to use. 

Here is a sample dialogue to invoke navigation:

User presses the PTT button.

System: ‘Please say a command.’

User: ‘Start navigation.’

System: ‘Start navigation. Which city?’

User: ‘Sanford.’

System: ‘Sanford. Which street?’

User: ‘Smith Street.’

System: ‘Smith Street. Which building?’

User: ‘Sanford Historical Museum.’

System: ‘Sanford Historical Museum. Show map or start navigation?’

User: ‘Start navigation.’

System: ‘Starting navigation.’

In-Vehicle Voice Control
In earlier systems, the accuracy of speech recognition was very low.

As in this case, multiple commands are required to complete a function. Such snags, in turn, increased the cognitive load of drivers and affected reaction time to drive events. In general, the pipeline of a voice dialogue system comprises of three steps:

  • Speech recognition is the conversion of a user’s voice to text. A speech recognition module typically includes a feature extraction algorithm, an acoustic model, a language model and a search algorithm (decoder).
  • Feed the captured text input to the decision-making part of the software, which usually generates a text output.
  • Speech synthesis (text to speech) is the conversion of text output to speech to give an auditory response to the user.

In earlier systems, the accuracy of speech recognition was very low. For the system to detect commands, users needed to speak slowly and clearly, ensuring a quiet environment by rolling up windows and even switching the blower direction. Such issues discouraged users from embracing the facilities of voice control initially. Integrated sound signal pre-processing became an essential part of the speech recognition module, mainly to avoid the influence of noise and interference of other sources.

In-Vehicle Voice Control
In autonomous driving, drivers become passengers and voice control has to be allowed from any location in the vehicle. 

Present Era

The development of connectivity and artificial intelligence brought in a big boom to the capability of voice control systems. Today’s high-end systems converse with users in a more natural tone without being restricted to a pre-set list of commands or a sophisticated dialogue sequence. For example, the navigation use case mentioned above is achievable with a single command: ‘Navigate to Sanford Historical Museum’.

Alternatively, any word combination leading to the same meaning can be used, like:

‘Direct me to Sanford Historical Museum’.

‘Go to Sanford Historical Museum’.

‘Sanford Historical Museum, please’.

This reduces the cognitive load of drivers and moves closer to the original goal of reduced driver distraction – such NLU (Natural Language Understanding) capable systems require more processing power and resources. Where not economical to dedicate a high-end SoC (System-on-Chip) for an embedded voice dialogue system, cloud-based speech recognition solutions are advantageous. 

However, cloud-only solutions have other shortcomings like longer response times and unavailability during the absence of an internet connection. As a trade-off, hybrid solutions are preferred as they process the speech signals both locally and in the cloud. The cloud server’s response is prioritised over the embedded system’s response, as the former is more accurate, robust, and understands natural language. 

The system has to use the embedded voice agent’s response – for decision making – if the cloud server fails to respond within a defined time interval. This approach ensures a seamless experience for the user, as the voice commands for local vehicle control functions remain available during limited connectivity scenarios. Additionally, wake word support of new systems helps in hands-free voice dialogue invocation instead of pressing a PTT button.

In-Vehicle Voice Control
Users can also control their smart home devices from their car through voice assistants. 

When projection technologies (Android Auto and Apple Car Life) became popular, voice assistants like Google Assistant and Siri directly entered infotainment systems. Third-party applications like Amazon Alexa also opened doors to be integrated into these systems. Users can also control their smart home devices from their car through such voice assistants. 

From an OEM perspective, it is more economical to integrate voice assistants and develop custom vehicle-specific skills, as needed, rather than developing fully functional voice recognition solutions at their side. However, OEMs who focus more on the data privacy of users and their cars still prefer custom-tailored solutions for car-specific use cases. 

Future

Both proprietary voice recognition technology and voice assistants are progressing rapidly, targeting user experience beyond voice. Future systems shall have the capability to combine multiple commands more naturally like, ‘Go to Sanford Historical Museum if it is open for the next three hours and there isn’t any traffic congestion on the way’. The system will search for the operational hours of the museum, fetch traffic congestion data from the navigation provider, and start navigation if the desired conditions are met. 

In-Vehicle Voice Control
Multi-modal inputs with voice, hand gestures, gaze, and handwriting will push the in-car user experience to the next level in the future.

The addition of Teachable AI will enhance the learnability and predictability of future systems. Biometric tracking from voice signals identifies who is talking and brings a personalised experience. With separate audio zones, each passenger/driver will be able to use voice control simultaneously. With the advent of autonomous driving, drivers become passengers and voice control has to be allowed from any location in the vehicle. 

Voice search may become more popular than textual search. Speech recognition may happen fully or partially in edge devices to address data privacy challenges. Word error rates of speech recognition would reduce drastically. Multi-modal inputs with voice, hand gestures, gaze, and handwriting will push the in-car user experience to the next level. There are high chances for conversational AI to be the primary interface for Metaverse applications. 

Afterword

The rapid growth of voice technology in recent years and promising new developments in the arena confirm that tremendous opportunity lies ahead in the design, development, and testing of in-vehicle voice experience.

The in-car experience will be a critical differentiator as customers seek better automobile capabilities, and manufacturers who emphasise privacy and sleeker, more efficient technology will pull ahead of the pack.

About the Author: Divya MS is Senior Technical Architect at QuEST Global.

Share This Page