Speech Recognition Technology

Machine Learning Artificial Intelligence Algorithm Digital Transformation Deep Learning Mathematics Probabilistic Generative Models Navigation of this blog

About Speech Recognition Technology

Speech recognition technology involves the automatic recognition of human speech by a computer and its conversion into text format. This technology has developed significantly in recent years and is widely used in smartphones, smart speakers, and business applications with speech recognition capabilities.

Speech recognition technology is realized by combining various technological elements such as signal processing, A/D conversion, acoustic modeling, language modeling, real-time data processing, and machine learning. Since speech signals are analog signals, what is commonly done to make them into digital signals that can be processed by computers is to perform Fourier transforms in the frequency domain, extract features by performing spectral analysis, and then, by combining acoustic modeling, language modeling, etc., divide speech into speech units The speech is then divided into speech units and converted into text data using speech recognition algorithms such as deep learning.

These speech recognition algorithms can be broadly classified into the following categories

  • Acoustic model-based methods: These methods use acoustic models to extract frame-by-frame acoustic features from speech waveforms, which are then used for speech recognition. Typical methods include the GMM-HMM method using Hidden Markov Models (HMM) and the DNN-HMM method using Deep Neural Networks (DNN). Acoustic models reflect language-dependent characteristics, and speech recognition is performed by combining acoustic features of speech with language models.
  • Neural network-based methods: These methods use deep neural networks to perform speech recognition, typically using Long Short-Term Memory (LSTM) or models based on Recurrent Neural Networks (RNN). This method takes speech waveforms as direct input and simultaneously extracts speech and language features for speech recognition.
  • End-to-end model: This method performs speech-to-text conversion in one step. In this method, speech-to-text conversion is performed directly using a Seq2Seq model described in “Overview of the Seq2Seq (Sequence-to-Sequence) model and examples of algorithms and implementations” with a deep neural network (Transformer) using speech-text pairs as training data. This method requires a large amount of training data to improve the accuracy of speech recognition, but it provides higher accuracy in speech recognition than other methods. This method is most commonly used in recent years.

Speech recognition technology is not only an input technology that digitizes the user’s voice and inputs it into a computer, but it can also be used to build services that understand the operation status of devices and the surrounding environment by recognizing and analyzing the sounds generated by IOT devices, machines, and environmental sounds. It is also a technology that can realize more advanced human-machine interaction.

This blog covers a wide range of topics from the basics of this speech recognition technology to its various applications.

Implementation

A speech recognition system (Speech Recognition System) is a technology that converts human speech into a form that can be understood by a computer. This section describes the procedure for building a speech recognition system, and also describes a concrete implementation using python.

  • Preprocessing for speech recognition processing

Pre-processing for speech recognition is the step of converting speech data into a format that can be input into a model and effectively perform learning and inference, and requires the following pre-processing methods.

The Seq2Seq (Sequence-to-Sequence) model is a deep learning model for taking sequence data as input and outputting sequence data, and in particular, it is an approach that can handle input and output sequences of different lengths. and dialogue systems, and is widely used in a variety of natural language processing tasks.

The main approaches to using artificial intelligence techniques to extract emotions include (1) natural language processing, (2) speech recognition, (3) image recognition, and (4) biometric analysis. These methods are combined with algorithms such as machine learning and deep learning, and are basically detected using large amounts of training data. Approaches that combine different modalities (text, voice, images, biometric information, etc.) to comprehensively understand emotions are also more accurate methods.

Automatic machine learning (AutoML) refers to methods and tools for automating the process of designing, training, and optimizing machine learning models.AutoML is particularly useful for users with limited machine learning expertise or those seeking to develop efficient models, with the following main goals. This section provides an overview of this AutoML and examples of various implementations.

Contrastive Predictive Coding (CPC) is a representation learning technique used to learn semantically important representations from audio and image data. This method is a form of unsupervised learning, in which representations are learned by contrasting different observations in the training data.

Similarity is a concept that describes the degree to which two or more objects or things have common features or properties and are considered similar to each other, and plays an important role in evaluating, classifying, and grouping objects in terms of comparison and relatedness. This section describes the concept of similarity and general calculation methods for various cases.

The issue of small amount of data to be trained (small data) is a problem that appears in various tasks as a factor that reduces the accuracy of machine learning. Machine learning with small data can be approached in various ways, taking into account data limitations and the risk of overlearning. This section discusses the details of each approach and implementation examples.

Self-Supervised Learning is a type of machine learning and can be considered as a type of supervised learning. While supervised learning uses labeled data to train models, self-supervised learning uses the data itself instead of labels to train models. This section describes various algorithms, applications, and implementations of self-supervised learning.

This section provides an overview of python Keras and examples of its application to basic deep learning tasks (handwriting recognition using MINIST, Autoencoder, CNN, RNN, LSTM).

Sparse modeling is a technique that takes advantage of sparsity in the representation of signals and data. Sparsity refers to the property that non-zero elements in data or signals are limited to a very small portion. The purpose of sparse modeling is to efficiently represent data by utilizing sparsity, and to perform tasks such as noise removal, feature selection, and compression.

This section provides an overview of sparse modeling algorithms such as Lasso, compression estimation, Ridge regularization, elastic nets, Fused Lasso, group regularization, message propagation algorithms, dictionary learning, etc., as well as a description of the various algorithms used in image processing, natural language processing, recommendation, signal processing The paper describes the implementation of the algorithms in various applications such as image processing, natural language processing, recommendation, machine learning, signal processing, brain science, and so on.

A topic model is a statistical model for automatically extracting topics (themes or categories) from large amounts of text data. Examples of text data here include news articles, blog posts, tweets, and customer reviews. The topic model is a principle that analyzes the pattern of word occurrences in the data to estimate the existence of topics and the relevance of each word to the topic.

This section provides an overview of this topic model and various implementations (topic extraction from documents, social media analysis, recommendations, topic extraction from image information, and topic extraction from music information), mainly using the python library.

Robust Principal Component Analysis (RPCA) is a method for finding a basis in data, and is characterized by its robustness to data containing outliers and noise. This paper describes various applications of RPCA and its concrete implementation using pyhton.

Online Prediction (Online Prediction) is a technique that uses models to make predictions in real time under conditions where data arrive sequentially.” Online learning, as described in “Overview of Online Learning, Various Algorithms, Application Examples, and Specific Implementations,” is characterized by the fact that models are learned sequentially but the immediacy of model application is not clearly defined, whereas online prediction is characterized by the fact that predictions are made immediately upon the arrival of new data and the results are used. characteristic.

This section discusses various applications and specific implementation examples for this online forecasting.

RNN (Recurrent Neural Network) is a type of neural network for modeling time-series and sequence data, and can retain past information and combine it with new information, such as speech recognition, natural language processing, video analysis, and time series prediction, It is a widely used approach for a variety of tasks.

LSTM (Long Short-Term Memory) is a type of recurrent neural network (RNN), which is a very effective deep learning model mainly for time series data and natural language processing (NLP) tasks. LSTM can retain historical information and model long-term dependencies, making it a suitable method for learning long-term information as well as short-term information.

  • About GRU (Gated Recurrent Unit)

GRU (Gated Recurrent Unit) is a type of recurrent neural network (RNN) that is widely used in deep learning models, especially for processing time series data and sequence data. The GRU is designed to model long-term dependencies in the same way as the LSTM (Long Short-Term Memory) described in “Overview of LSTM and Examples of Algorithms and Implementations,” but it is characterized by its lower computational cost than the LSTM. It is characterized by lower computational cost than LSTM.

  • About Bidirectional RNN (BRNN)

Bidirectional Recurrent Neural Network (BRNN) is a type of recurrent neural network (RNN) model that can simultaneously consider past and future information. BRNN is particularly useful for processing sequence data and is widely used in tasks such as natural language processing and It is widely used in tasks such as natural language processing and speech recognition.

  • Overview of Pointer-Generator Networks, Algorithms, and Examples of Implementations

The Pointer-Generator network is a type of deep learning model used in natural language processing (NLP) tasks, and is particularly suited for tasks such as abstract sentence generation, summarization, and information extraction from documents. The network is characterized by its ability to copy portions of text from the original document verbatim when generating sentences.

Theory

Mechanism of sound perception and relationship between speech features (spectrum, volume) and linguistic features

On AD transform of speech, analysis window, Fourier transform and vector quantization for extracting features for speech recognition from speech.

On the theory of discrete word recognition and continuous word recognition using dynamic programming

Theory and Algorithms for Speech Recognition Using Hidden Markov Models

On natural language processing such as n-gram models to deal with word complexity

Large vocabulary continuous speech recognition (LVCSR) is a technique for recognizing unconstrained general speech. The current mainstream of LVCSR is based on HMMs.

First, we will discuss learning and recognition using subword recognition units. This is a method that uses subwords obtained by segmenting words as the target of recognition. In actual large-vocabulary continuous recognition, the size of the recognition dictionary is usually about 30,000 words. It is necessary to prepare HMMs for each of the 30,000 words, but since most of the 30,000 words do not appear frequently, it is difficult to collect a sufficient amount of training data for them.

Spectral approach, cepstrum-averaged normalization approach, and microphone array approach for removing noise from speech.

Speaker Adaptation by Transfer Learning of Generative Models and Speaker Recognition by MAP Estimation in Speech Recognition

Application of deep learning to speech recognition (RNN, bi-directional RNN (BRNN), GMM-HMM, MLP-HMM)

Nonnegative matrix factorization (NMF), like linear dimensionality reduction, is a method for mapping data to a low-dimensional subspace. As the name suggests, the model assumes non-negativity for the observed data and all of its unobserved variables. Non-negative matrix factorization can be applied to any non-negative data, and can be used to compress and interpolate image data in the same way that linear dimensionality reduction is used.

In addition, when handling audio data in terms of frequency using the Fast Fourier Transform, it is often possible to obtain a better representation using a model that can assume non-negativity. In addition, since many data can be assumed to have non-negative values in recommendation algorithms and natural language processing, a wide range of applications are being attempted. Various probabilistic models have been proposed for nonnegative matrix factorization, but here we construct a model using the Poisson distribution and the gamma distribution.

Feature extraction algorithms in Music Informatics aim at deriving statistical and semantic information directly from audio signals. These may be ranging from energies in several frequency bands to musical information such as key, chords or rhythm. There is an increasing diversity and complexity of features and algorithms in this domain and applications call for a common structured representation to facilitate interoperability, reproducibility and machine interpretability. We propose a solution relying on Semantic Web technologies that is designed to serve a dual purpose (1) to represent computational workflows of audio features and (2) to provide a common structure for feature data to enable the use of Open Linked Data principles and technologies in Music Informatics. The Audio Feature Ontology is based on the analysis of existing tools and music informatics literature, which was instrumental in guiding the ontology engineering process. The ontology provides a descriptive framework for expressing different conceptualisations of the audio feature extraction domain and enables designing linked data formats for representing feature data. In this paper, we discuss important modelling decisions and introduce a harmonised ontology library consisting of modular interlinked ontologies that describe the different entities and activities involved in music creation, production and publishing.

  • Overview of Multi-Task Learning and Examples of Applications and Implementations

Multi-Task Learning is a machine learning method that simultaneously learns multiple related tasks. Usually, each task has a different data set and objective function, but Multi-Task Learning aims to incorporate these tasks into a model at the same time so that they can complement each other by utilizing their mutual relevance and shared information.

Here, we provide an overview of methods such as shared parameter models, model distillation, transfer learning, and multi-objective optimization for this multitasking, and discuss examples of applications in natural language processing, image recognition, speech recognition, and medical diagnosis, as well as a simple implementation in python.

コメント

タイトルとURLをコピーしました