Overview of Speech Recognition Systems
A speech recognition system (Speech Recognition System) is a technology that converts human speech into a form that can be understood by a computer, which takes spoken input and converts it into textual information.
Speech recognition systems are used in many different applications, including spoken dialogue systems, automated voice response (IVR) systems, voice command control systems, and text editors that use spoken input. These systems use speech recognition systems to allow users to enter information and interact with computers by voice.
A speech recognition system will typically work in the following steps
- Receive speech input: speech is received using a device such as a microphone or telephone.
- Speech preprocessing: preprocesses the speech data, such as removing noise and normalizing the speech signal.
- Feature extraction: Extract useful features from the speech signal. A common method used is Mel-frequency cepstrum coefficients (MFCC).
- Input to speech recognition model: The features are input to the speech recognition model, and the model is applied to convert speech into text.
- Text generation: Recognized text information is generated from the output of the speech recognition model.
- Application processing: The recognized text information is applied to the appropriate application or system, such as sending a text message, executing a command, or generating a search query.
Speech recognition systems are now commonly trained using deep learning techniques, which use large training data sets to learn the correspondence between speech and text to achieve high recognition accuracy. Advanced signal processing methods and model optimization are also used to meet requirements such as real-time performance and noise immunity.
Next, we describe how this speech recognition system is built.
How to create a voice recognition system
The general approach and steps for creating a speech recognition system are as follows
- Data collection and preparation: To train a speech recognition system, a large amount of speech data is required. Therefore, we collect speech data from a variety of speakers, accents, and environments, and associate each speech data with a textual transcription (label) of the correct answer.
- Data preprocessing: Speech data is preprocessed to extract features. Common preprocessing methods include the following steps
- Segment the audio data into frames.
- Perform a short-time Fourier transform (STFT) on each frame to obtain the frequency spectrum.
- Extract features such as Mel Frequency Cepstrum Coefficients (MFCC) and Mel Frequency Banks from the frequency spectrum.
 
- Model building: Build a model for speech recognition. Common methods include the following.
- GMM-HMM models that combine hidden Markov models (HMM) and mixed Gaussian models (GMM)
- Deep learning models (recurrent neural networks, convolutional neural networks, transformer described in “Overview of Transformer Models, Algorithms, and Examples of Implementations“, etc.)
- End-to-End (E2E) models (using CTC, Attention, Transformer, etc.)
 
- Model training and evaluation: Split into training and test data sets to train and evaluate the model. Training algorithms used include Stochastic Gradient Descent (SGD) described in “Overview of Stochastic Gradient Descent (SGD), its algorithms and examples of implementation“, Adam, and Adagrad. As training progresses, the performance of the model is evaluated and adjusted as needed.
- Inference: Inference is performed on new speech data using models that have completed training. Inference involves preprocessing and feature extraction to obtain predictions by the model. The final predictions are expressed as text transcriptions or word sequences.
The implementation of a speech recognition system consists of the above steps: data preparation, preprocessing, model building, training, and inference. In each step, it is important to select appropriate algorithms and models and adjust hyperparameters. In actual implementation, libraries and frameworks (e.g., Kaldi, TensorFlow, PyTorch) can be used to streamline the work.
The following sections describe these libraries and frameworks in detail.
Libraries and frameworks for creating speech recognition systems
The following frameworks and libraries are commonly used to create speech recognition systems.
- Kaldi: Kaldi is a widely used open source framework for speech recognition. It provides the tools and libraries necessary to build reliable speech recognition models and can be used from Python as well as C++.
- TensorFlow: TensorFlow will be a popular framework for machine learning and deep learning. It also provides a powerful toolset for building
- PyTorch: PyTorch will be another popular machine learning and deep learning framework; with PyTorch, users can build flexible and high-performance speech recognition models, as well as dynamic computational graph features and an API suitable for research and prototype development.
- OpenSeq2Seq: OpenSeq2Seq will be NVIDIA’s open source Seq2Seq framework described in “Overview of the Seq2Seq (Sequence-to-Sequence) model and examples of algorithms and implementations” for speech recognition and natural language processing; OpenSeq2Seq allows for fast GPU-based training and inference and provides state-of-the-art model architectures (e.g. Transformer) and data extension methods.
Next, we describe an example implementation of a specific speech recognition system using Python.
Implementation in python
The speech recognition system in python follows these steps
- Loading speech data: First, speech data must be loaded. Generally, audio files in WAV or MP3 format are used; audio files can be easily loaded using Librosa or Pydub, Python’s speech processing libraries.
<Example implementation in Librosa>
Librosa is a Python speech processing library that can easily read, analyze, and convert speech data. Below is a basic implementation example for loading voice data using Librosa.
import librosa
# Audio file path
audio_path = 'path/to/audio/file.wav'
# Loading voice data
audio, sr = librosa.load(audio_path)
# Displays information about the loaded audio data
print("sampling rate:", sr)  # Sampling rate (samples/second)
print("Length of voice data:", len(audio))  # Length of voice data (number of samples)
# Playback of audio data (playback requires a separate playback library such as PyAudio)
import sounddevice as sd
sd.play(audio, sr)In the above code, the librosa.load function is first used to load the specified audio file, and librosa.load returns the audio data and sampling rate (samples/second). The loaded voice data is stored as a NumPy array. Next, information about the loaded audio data is displayed, len(audio) represents the length of the audio data (number of samples), and finally, the audio data is played back using the sounddevice library. A separate playback library is required for playback, and sounddevice is used in the above example. The playback library must be installed appropriately for the environment in which it is to be run.
This code is a simple example that reads the specified audio file, displays the information, and plays it back; Librosa has a variety of audio processing functions that allow for more advanced analysis and feature extraction.
- Data preprocessing: Audio data must be preprocessed to extract features.
- Divide the audio data into frames.
- Perform a short-time Fourier transform (STFT) on each frame to obtain the frequency spectrum.
- Extract features such as Mel Frequency Cepstrum Coefficients (MFCC) and Mel Frequency Banks from the frequency spectrum.
 
<For feature extraction by Mel-frequency cepstrum coefficients>
Mel Frequency Cepstral Coefficients (MFCCs) are widely used features in the field of signal processing and acoustic recognition of speech and music. The following describes the procedure for extracting MFCCs from frame segmentation and frequency spectrum transformation of a speech signal.
- Splitting the speech signal into frames: Before extracting the frequency spectrum, the speech signal is split into short frames (typically 20 to 40 ms). The size of the frames is related to the trade-off between the temporal and frequency resolution of the signal.
- Apply a window function to the frames: A window function (usually a Hamming window) is applied to each frame to smooth the waveform of the frequency spectrum. This reduces the edge effects that occur on both sides of the frame.
- Calculate the frequency spectrum of the frames: For each frame, the frequency spectrum is calculated using the Fast Fourier Transform (FFT). This yields the amplitude spectrum for each frame.
- Create Mel Filter Banks: Create filter banks evenly spaced on the mel scale. The mel scale is a nonlinear scale based on human auditory characteristics.
- Apply the mel filter banks to the frequency spectrum: For the frequency spectrum of each frame, the mel filter banks are applied and the bank energies are calculated. This transforms the frequency spectrum into a Mel frequency spectrum.
- Calculate Mel-frequency Cepstrum coefficients: The Inverse Discrete Cosine Transform (IDCT) is applied to the Mel-frequency spectrum to obtain the Mel-frequency Cepstrum coefficients. Usually, the lower frequency components are retained up to the higher components.
- Compute dynamic features of the mel-frequency cepstrum coefficients (optional): In addition to static features for each frame, dynamic features such as frame-to-frame differences and double differences may be added to the MFCCs.
Following the above procedure, mel-frequency cepstrum coefficients can be extracted from the frequency spectrum. This technique is commonly used in tasks such as speech recognition, speaker identification, and music information retrieval.
Next, we describe the specific implementation of these procedures. The implementation is described in two stages. One is “Python implementation of frame segmentation using STFT,” which covers everything from frame segmentation to frequency spectrum transformation, and the other is “Python implementation of feature extraction using Mel-frequency cepstrum coefficients,” which covers feature extraction using MFCC.
<Implementation in python of frame segmentation using STFT>
This section describes an example implementation of frame segmentation of speech data using STFT (Short-Time Fourier Transform). STFT is a method for converting time-domain speech data into the frequency domain and calculating a spectrogram for each frame.
import numpy as np
import librosa
def frame_split_stft(audio, frame_length, hop_length):
    # Calculate STFT
    stft = librosa.stft(audio, n_fft=frame_length, hop_length=hop_length)
    
    # Convert complex spectrograms to amplitude spectrograms
    magnitude = np.abs(stft)
    
    return magnitudeThe above code defines the frame_split_stft function, which takes three arguments: audio (input audio data), frame_length (frame length), and hop_length (overlap size between frames).
Inside the function, the STFT function librosa.stft is used to calculate the STFT of the input audio data, with the n_fft argument specifying the frame length and the hop_length argument specifying the overlap size between frames. The computed STFT is a complex spectrogram, but here it is converted to an amplitude spectrogram; the np.abs function is used to take the absolute value of the STFT to obtain the amplitude spectrogram, and finally, the amplitude spectrogram is returned.
With this implementation example, the speech data is divided into frames by STFT based on the specified frame length and hop length, and the amplitude spectrogram is obtained. The amplitude spectrogram contains information in the frequency domain and can be used for tasks such as speech processing and speech recognition.
<Implementation in python of feature extraction using Mel-frequency cepstrum coefficients>
The following is a basic implementation for extracting Mel Frequency Cepstrum Coefficients (MFCCs) from a frequency spectrum.
import numpy as np
from scipy.fftpack import dct
# Function to create a mel filter bank
def mel_filter_bank(num_filters, fft_size, sample_rate):
    # Calculate the upper and lower frequencies of the mel filter bank
    min_hz = 0
    max_hz = sample_rate / 2
    min_mel = hz_to_mel(min_hz)
    max_mel = hz_to_mel(max_hz)
    # Equal center frequencies for Melfilter banks
    mel_points = np.linspace(min_mel, max_mel, num_filters + 2)
    hz_points = mel_to_hz(mel_points)
    # Calculate the frequency response of each filter
    filter_bank = []
    for i in range(1, len(hz_points) - 1):
        lower = int(fft_size * hz_points[i - 1] / sample_rate)
        upper = int(fft_size * hz_points[i + 1] / sample_rate)
        center = int(fft_size * hz_points[i] / sample_rate)
        filter = np.zeros(fft_size)
        filter[lower:center] = np.linspace(0, 1, center - lower)
        filter[center:upper] = np.linspace(1, 0, upper - center)
        filter_bank.append(filter)
    return np.array(filter_bank)
# Function to convert mel frequencies to hertz
def mel_to_hz(mel):
    return 700 * (10**(mel / 2595) - 1)
# Function to convert Hertz to Mel frequency
def hz_to_mel(hz):
    return 2595 * np.log10(1 + hz / 700)
# Function to extract MFCCs from frequency spectrum
def extract_mfccs(spectrum, num_filters=20, num_ceps=13):
    # Create Melfilter Bank
    filter_bank = mel_filter_bank(num_filters, len(spectrum), sample_rate)
    # Calculate Mel Frequency Spectrum
    mel_spectrum = np.log10(np.dot(spectrum, filter_bank.T) + 1e-10)
    # Apply Inverse Discrete Cosine Transform (IDCT) to obtain MFCCs
    mfccs = dct(mel_spectrum, type=2, axis=1, norm='ortho')[:, :num_ceps]
    return mfccs
# Example of frequency spectrum
spectrum = np.random.rand(512)
# Sample rate setting
sample_rate = 44100
# Extraction of MFCCs
mfccs = extract_mfccs(spectrum)
print(mfccs.shape)  # (num_frames, num_ceps)In this code, the mel_filter_bank function creates mel filter banks and the extract_mfccs function extracts MFCCs from the frequency spectrum. The number of mel_filter_banks and the number of dimensions of MFCCs can be specified as arguments.
- Model building: it is common to use deep learning frameworks (e.g., TensorFlow or PyTorch) to build speech recognition models. The following is a simple example of them.
import tensorflow as tf
# Model Definition
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(64, activation='relu', input_shape=(input_dim,)))
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dense(num_classes, activation='softmax'))
# Model Compilation
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
# Model Training
model.fit(X_train, y_train, epochs=10, batch_size=32)- Model training and evaluation: use labeled audio data to train the model. Split into training and test data sets to evaluate model performance.
- Inference: Inference is performed on new speech data using the trained model. The model produces output for feature extraction and speech recognition.
Reference Information and Reference Books
For more information on voice recognition technology, please refer to “Speech Recognition Technology.
Reference book were “Automatic Speech Recognition: A Deep Learning Approach“

“Robust Automatic Speech Recognition: A Bridge to Practical Applications“

“Audio Processing and Speech Recognition: Concepts, Techniques and Research Overviews“

1. Introductory and Foundational Texts
Speech and Language Processing
By Daniel Jurafsky and James H. Martin
- 
A comprehensive textbook covering speech recognition, language modeling, and natural language processing (NLP). 
- 
The 3rd edition (draft) includes modern deep learning-based approaches. 
Fundamentals of Speech Recognition
By Lawrence Rabiner and Biing-Hwang Juang
- 
A classic text that covers the fundamentals of speech recognition, especially statistical approaches using Hidden Markov Models (HMMs). 
- 
Highly recommended for understanding the foundations of acoustic modeling and pattern recognition in speech. 
2. Speech and Audio Signal Processing
Speech and Audio Signal Processing: Processing and Perception of Speech and Music
By Ben Gold, Nelson Morgan, and Dan Ellis
- 
Focuses on the signal processing techniques for both speech and music, including spectral analysis, filter banks, and feature extraction. 
3. Deep Learning Approaches
Automatic Speech Recognition: A Deep Learning Approach”
By Dong Yu and Li Deng
- 
A more practical guide for building modern speech recognition systems using deep learning techniques. 
4. Practical and Real-World Systems
By Zhen-Hua Ling, Lei Xie
- 
Discusses building speech recognition and synthesis systems that work in real-world noisy conditions, with a focus on robustness and evaluation methods. 
5. Hands-on and Implementation-Focused
Technical Report: A Practical Guide to Kaldi ASR Optimization
- 
A practical guide to using Kaldi, an open-source speech recognition toolkit. 
- 
Includes tutorials on data preparation, training, decoding, and adaptation. 
 
  
  
  
  
コメント