Speech recognition is a deep and complex field. It has a rich history that dates back to the 1950s, but it’s only in recent years that the technology has reached a level of maturity that it has become a widely used technology in products. In this article, we’ll take a deep dive into how speech recognition works, covering topics from audio capture, to acoustic processing, to language modeling, to decoding, and everything in between.
How Speech Recognition Works
Speech recognition systems involve a number of steps:
1. Audio Capture
An audio recording is made of one or more speakers. The audio is often sampled at a rate of 8,000 Hz or 16,000 Hz. Recordings may be mono, stereo, of even multi-channel audio depending on the specifics of the situation.
2. Pre-processing
The audio is sometimes cleaned up using noise reduction and other algorithms that reduce the impact of background noise.
3. Feature Extraction
At this stage, the audio is converted into a sequence of features. The most common type of feature historically used in speech recognition is called a Mel Frequency Cepstral Coefficient, or MFCC. These are a set of features that represent the amplitude and frequency components of the audio. There are many other possible features that could be used at this stage, ranging from engineered to (more recently) learned representations, or in some cases, raw audio.
4. Prediction
Features are then converted into a probability distribution over words, letters, or phonemes. In state-of-the-art systems, this is typically done with a neural network. Prediction can involve large learned models that require graphics accelerator hardware for fast processing.
5. Decoding
The prediction is often converted into a final text transcription using a language model. At each time step, the word or letter with the highest likelihood of being spoken is returned as the output. This is an optional step, and often not used if the model is simply predicting raw letters or phonemes.
History Of Speech Recognition
Speech recognition has a rich history that goes back many decades. Here’s a brief look at some of the key milestones:
1. Early Research Systems
There are some major research projects that began the investigation into speech recognition in the 1950s. These systems were not very effective and not widely used, but did prove that speech recognition was possible.
2. Commercial Products
A number of notable companies in the field of speech recognition emerged in the 1960s and 1970s. These systems were not capable of true real-time speech recognition, and required hours of processing time. However, they did demonstrate some interesting capabilities, and spurred further research. With the advent of MFCCs in the 1980s, these systems started to become more capable.
Products such as Dragon Dictate and ViaVoice were born in the 1990s. These systems were initially not very accurate, but their ability to capture the interest of mainstream consumers made them significant in the industry. Systems at the time required a significant amount of user training, and were not capable of capturing speech in the wild. They were best used to transcribe speech from audio that had been specially dictated for transcription purposes.
3. Deep Learning Revolution
In the 2010s, a new paradigm of machine learning emerged that we now refer to as deep learning. As with many advances in AI, neural networks were not a completely new idea – they had been around since the 1960s. However, it wasn’t until the 2010s that the latest generation of deep learning research began to produce dramatic improvements in the field of speech recognition. Today, all of the major speech recognition systems in the world are built using deep learning.
4. End-to-end Speech Recognition
In recent years, the speech recognition field has made great strides in the development of so-called "end-to-end' speech recognition systems. End-to-end speech recognition systems historically involved chaining together multiple complex modules, each of which had to be carefully developed. These systems were complex and difficult to engineer. The more modules, the harder it was to develop and maintain. With the latest developments in deep learning, researchers have started to develop models that are able to process an audio signal from end-to-end. These models are capable of converting audio to text directly, without having to go through a series of intermediate steps. While these models are relatively new, they are rapidly becoming the status quo amongst best-of-breed systems. Hyperia is one of the leaders in this area, and has built one of the most advanced end-to-end speech recognition systems specifically designed for call and meeting transcription.
5. Self-supervised Learning
One major limitation of traditional speech recognition systems is that they need to be trained extensively on thousands of hours of carefully labeled speech data. This data is extremely expensive and time-consuming to collect, and must span a wide variety of speakers, ages, genders, regional accents, and so on. With the advent of self-supervised learning, it became possible to build speech recognition systems that can learn on their own from more widely available unlabeled data. This technique has had a huge impact on the field of speech recognition, allowing systems to be trained using less data that have superior accuracy and performance.
Challenges In Speech Recognition
Speech recognition contains many technical challenges that must be accounted for when building systems that must be robust in real-world conditions. These include:
1. Speaker Variability
An important limitation of speech recognition is the variability of speakers. Human speech is very variable, and is affected by numerous factors, including age, gender, and accent.
2. Acoustic Variability
Speech is also very variable in terms of the acoustic characteristics of the audio signal. This includes acoustic intensity (is the person speaking loudly or softly), background noise, and talking speed.
3. Lexical Variability
Human speech is also very variable in terms of the vocabulary that is used. Languages such as english contain many hundreds of thousands of words (millions if you include proper nouns). Other lexical variability challenges include words that have similar pronunciations (eg, "read" vs "reed") and colloquialisms.
Importance Of Large Datasets in Training
Modern speech recognition systems are data hungry beasts, requiring massive datasets consisting of many tens of thousands of hours of carefully labeled audio. Quality training datasets must have several key characteristics, including:
1. Tens Of Millions Of Annotated Examples
One of the biggest challenges in building a speech recognition system is training it on enough data. Building a speech recognition system requires tens of millions of examples of transcribed text from human speakers. This requirement poses a problem, because it can be very difficult to get enough data.
2. Diverse And Broad Training Material
The data needs to be diverse and broadly representative of the different types of speech that the system will encounter. The systems need to understand people speaking on-the-go, while driving, while working out, etc. And it needs to understand different types of people speaking at different speeds about different topics.
3. Widely Varying Accents, Vocabulary And Speech Style
The system needs to be trained on data from all types of individuals, including people with different accents, and people who have different styles of speaking. Languages today contain many hundreds of thousands of words, and even more words are constantly being created. Examples of all of these variations are needed for the system to be trained effectively.
4. Varying Background Noise And Compression
Finally, the training data needs to include examples of people speaking in varying amounts of background noise. The systems need to be able to accurately recognize speech in situations like car engines running, vacuum cleaners operating, etc. Reverb and echo effects are also important to model, so that the system doesn’t make mistakes for example when someone is speaking over a speakerphone. Compression artifacts and other distortions of the audio signal are also an important part of the problem.
5. Different Types Of Speech
In addition to variations in speaker, acoustic, and vocabulary, the system also needs to be trained on many different types of speech. For example, the system needs to be able to recognize speech from meetings, one-on-one conversations, TV shows, etc. The systems need to be able to understand speech in a wide variety of contexts.
Challenges Of Transcribing Calls And Meetings
Transcribing calls and meetings is one of the most technically difficult problems in the speech recognition space, due to specific technical challenges that do not arise in more traditional use cases such as dictation or medical transcription. These include:
1. Spontaneous Speech
Calls and meetings often contain unexpected elements, such as spontaneous comments, questions, or commentary. The cadence and delivery of spontaneous speech is often different from that of prepared speech. Furthermore, the spontaneous speech can come from multiple speakers, and be difficult to distinguish between in real time.
2. Overtalk
It’s common for people to speak over each other during a conversation. This is especially true for meetings, where there are often many people speaking around the same time. The systems need to be able to understand the speech over the voices of others.
3. Disfluencies
People often interrupt themselves with things like uhms, uhs, and ummms. Or they may speak in fragmented sentences, or have other disfluencies such as running on sentences, starting sentences over, etc. The systems need to be able to understand speech that is “broken”.
4. High-level Understanding Of Context
Because of the high degree of variability in human speech, an important goal of speech recognition is the ability to understand the high-level meaning of the speech and not just the exact words being spoken. This is often referred to as understanding the “intent” of the speaker.
What To Look For In A Speech Recognition Solution
There are many factors to consider when choosing a speech recognition solution that is right for your business, including:
1. Speaker Independent
Systems should able to provide a high level of performance for many different speakers, without the need for training for each individual speaker.
2. Support For Accents And Dialects
The system should have support for different dialects and accents. For example, support for transcription from speakers from different US cities and states, or by non-native speakers.
3. Wide Vocabulary Support
The system should be able to transcribe words from a very large vocabulary, including words that are rarely used, such as words from specialized domains like medicine, law, or science. Vocabularies today contain many hundreds of thousands of words.
4. Fast Processing
The systems should be able to quickly and accurately transcribe speech from calls and meetings. It’s common for transcribed calls to contain thousands or even tens of thousands of words.
5. Low-latency Transcription
It’s important for the system to produce results in near real-time, so that the transcription can be used to drive business decisions. For example, insights from a transcription of a call can be provided to a sales force so that they can quickly take action on a potential new sale.
Hyperia's Speech Technology
Hyperia's machine learning team has invested massive time and effort in building one of the most advanced speech engines available today. Reasons our solution is state-of-the-art include:
1. Unsupervised Pre-training
Hyperia uses unsupervised pre-training to leverage the benefit of learning from hundreds of thousands of hours of audio from a wide variety of sources. The pre-training is used to learn a understanding of audio that can be used as a more powerful feature representation than what is provided by traditionally engineered input features such as mel spectrograms. Unsupervised pre-training is an important part of how Hyperia was able to achieve the state of the art in speech recognition.
2. Neural Network Architecture
Hyperia uses a state-of-the-art neural network architectures for performing speech recognition. These architectures result in the development of large custom built speech models containing many hundreds of millions of parameters.
3. Gigantic Training Dataset
Hyperia leverages the giant dataset of hundreds of thousands of hours of audio used in its pre-training process to train the neural network model. We then perform supervised training using many tens of thousands of hours of carefully labeled speech data, gathered from a wide variety of speakers with different ages, genders, dialects, and vocabularies.
Why Transcription Is Just Part Of The Problem
Accurately transcribing speech to text is just one of the problems that must be solved to successfully capture calls and meetings in a matter that will lead to actionable business insights. Other things that matter include:
1. Speaker Identification
Speaker identification is critical if a system needs to associate a transcription with a particular speaker.
2. Diarization In Calls
A meeting can contain a lot of back and forth activity. To identify which speaker is speaking at any given time, it’s important to understand when speaker changes occur.
3. NLP For Conversations
Natural language processing (NLP) is an important tool for analyzing the content of human conversations. In many cases, after a transcription is produced, further NLP processing needs to be applied to generate insights from the transcribed conversation. For example, it’s common to use NLP to identify topics, facts, and opinions. Specialized NLP systems must be developed to deal with the complexities of conversations.
4. Generating Searchable Knowledge
The information contained in a conversation can be difficult to search. To make information more easily searchable, it’s possible to use NLP to generate knowledge graphs from the transcribed conversation. For example, a knowledge graph for a meeting might include things like who said what about which topics at which times.
A Final Word
Speech recognition is an extremely difficult problem. In this article, we’ve just scratched the surface. There are many nuances and details that are beyond the scope of this overview. These include the details of things like acoustic processing, acoustic modeling, feature engineering, decoding, etc. However, we hope that this article has provided a high level overview of how speech recognition works, why it’s an important technology for businesses, and how Hyperia is using state of the art technology to produce advanced solutions for call and meeting transcription.
This is the most exciting time to be a speech recognition researcher. The amount of recent progress in the field is truly remarkable, and has already produced dramatic improvements in the accuracy, performance, and cost of speech recognition systems. One of the most exciting things about the field is that it’s still early days. The advances over the last few years have been driven by deep learning and neural networks, but there is still much left to be discovered. For example, unsupervised pre-training has only recently been used in production systems, but it’s already proven to be incredibly effective. Who knows what the next revolutionary approach will be?