Pre-processing for speech recognition
Speech recognition preprocessing is the step of converting speech data into a form that can be input into a model to effectively perform learning and inference, and requires the following preprocessing methods
Conversion of speech data:
Change sampling rate: In many cases, the sampling rate of the audio data is changed. Typically 16 kHz or 48 kHz is used.
import librosa
# Loading voice data
audio_data, sample_rate = librosa.load("audio.wav", sr=16000)
Frame Segmentation:
Short-Time Fourier Transform (STFT): splits the audio data into short frames and calculates the frequency components for each frame. This makes it easier to capture time and frequency information.
import librosa.display
import matplotlib.pyplot as plt
# Calculate STFT
spectrogram = librosa.amplitude_to_db(librosa.stft(audio_data), ref=np.max)
# STFT Display
librosa.display.specshow(spectrogram, sr=sample_rate, x_axis='time', y_axis='log')
plt.colorbar(format='%+2.0f dB')
plt.show()
Feature Extraction: Extracts features such as Mel-Frequency Cepstral Coefficients (MFCC) and Mel-Frequency Cepstral Coefficients (MFCC) in order to extract useful information from the audio data.
Mel-Frequency Filter Bank: Mel-frequency spectrograms are obtained by applying a mel-frequency filter bank to the audio data.
# Melfilter bank calculation
mel_spectrogram = librosa.feature.melspectrogram(audio_data, sr=sample_rate, n_mels=128)
# Display of melspectrogram
librosa.display.specshow(librosa.power_to_db(mel_spectrogram, ref=np.max), y_axis='mel', fmax=8000, x_axis='time')
plt.colorbar(format='%+2.0f dB')
plt.show()
Log Mel Filterbank Energies: Extracts energy by applying a Mel filterbank and taking the logarithm.
# Calculation of logmel filter bank energy
log_mel_energy = librosa.feature.mfcc(audio_data, sr=sample_rate, n_mfcc=13)
# Display of logmel filter bank energy
librosa.display.specshow(log_mel_energy, x_axis='time')
plt.colorbar()
plt.show()
Mel-Frequency Cepstral Coefficients (MFCC): MFCC is an approach that is generally considered an effective feature for speech recognition tasks.
import librosa
import librosa.display
import matplotlib.pyplot as plt
# Loading voice data
input_audio_file = "input_audio.wav"
y, sr = librosa.load(input_audio_file, sr=None)
# MFCC Extraction
mfccs = librosa.feature.mfcc(y, sr=sr, n_mfcc=13)
# Display of melspectrogram
plt.figure(figsize=(10, 4))
librosa.display.specshow(librosa.power_to_db(mfccs, ref=np.max), y_axis='mel', x_axis='time')
plt.colorbar(format='%+2.0f dB')
plt.title('Mel Spectrogram (MFCCs)')
plt.show()
In this example, the librosa.feature.mfcc function is used to extract MFCCs and then display the log mel spectrogram, and the following is done
The n_mfcc parameter specifies the number of MFCCs to extract, typically 13 coefficients are commonly used, but different values may be appropriate for different practical applications. librosa.power_to_db function applies a log transformation to scale. The librosa.display.specshow function is used to display the mel spectrogram.
Noise Reduction:
Speech data may contain various types of noise, and filtering and noise reduction techniques are applied to reduce noise.
import numpy as np
import librosa
import librosa.display
import matplotlib.pyplot as plt
# Noise reduction function
def apply_spectral_subtraction(y, noise, alpha=2.0):
# Obtain audio and noise spectra
D = librosa.stft(y)
N = librosa.stft(noise)
# spectral subtraction
magnitude = np.abs(D) - alpha * np.abs(N)
magnitude = np.maximum(magnitude, 0.0) # Clip negative values to 0
# Reverse STFT to obtain noise-reduced audio
y_reduced = librosa.istft(magnitude * np.exp(1j * np.angle(D)))
return y_reduced
# Loading voice data
input_audio_file = "input_audio.wav"
y, sr = librosa.load(input_audio_file, sr=None)
# Noise data loading (e.g. background sound with noise)
noise_audio_file = "background_noise.wav"
noise, _ = librosa.load(noise_audio_file, sr=None, duration=len(y)/sr)
# Apply noise reduction
y_reduced = apply_spectral_subtraction(y, noise, alpha=2.0)
# Display in graph
plt.figure(figsize=(12, 8))
plt.subplot(3, 1, 1)
librosa.display.waveshow(y, sr=sr)
plt.title('Original Audio')
plt.subplot(3, 1, 2)
librosa.display.waveshow(noise, sr=sr)
plt.title('Noise')
plt.subplot(3, 1, 3)
librosa.display.waveshow(y_reduced, sr=sr)
plt.title('Reduced Noise')
plt.tight_layout()
plt.show()
In this example, spectral subtraction is performed with the apply_spectral_subtraction function to reduce noise. The noise data is prepared in advance, and the actual application will require some modification, such as modeling the noise in real time. The parameters for noise and spectral subtraction need to be adjusted, and it is important to find appropriate values for the actual data.
Normalization:
Normalize the amplitude and frequency range of the speech data so that the model can be trained more consistently.
import librosa
import numpy as np
def normalize_audio(audio_file):
# Loading voice data
y, sr = librosa.load(audio_file, sr=None)
# Get maximum amplitude
max_amplitude = np.max(np.abs(y))
# normalization
normalized_audio = y / max_amplitude
return normalized_audio, sr
# examples showing the use (of a word)
input_audio_file = "sample_audio.wav"
normalized_audio, sr = normalize_audio(input_audio_file)
# Amplitude display of voice data before normalization
plt.figure(figsize=(12, 4))
plt.subplot(2, 1, 1)
librosa.display.waveshow(y, sr=sr)
plt.title('Original Audio')
# Amplitude display of voice data after normalization
plt.subplot(2, 1, 2)
librosa.display.waveshow(normalized_audio, sr=sr)
plt.title('Normalized Audio')
plt.tight_layout()
plt.show()
In this example, the librosa library is used to load the speech data, obtain the maximum amplitude, and then use it to normalize the speech data, with the normalized speech data falling within the range [-1, 1].
Normalization allows the model to absorb differences in amplitude of different speech data and allows the model to learn more robustly. However, the method and range of normalization should be adjusted depending on the task and the nature of the data.
Language Model Integration:
Use language models to better understand the linguistic context of speech data. This improves recognition accuracy.
import speech_recognition as sr
def integrate_language_model(audio_file, language_model):
# Loading voice data
r = sr.Recognizer()
with sr.AudioFile(audio_file) as source:
audio_data = r.record(source)
# Perform voice recognition
recognized_text = r.recognize_google(audio_data)
# Integrate language models
processed_text = process_with_language_model(recognized_text, language_model)
return processed_text
def process_with_language_model(text, language_model):
# Process based on the language model here.
# For example, adding keywords, contextual considerations, semantic analysis, etc.
# As a hypothetical example, if the word "hospital" is included, replace it with "medical facility
processed_text = text.replace("hospital", "medical facilities")
return processed_text
# examples showing the use
audio_file = "sample_audio.wav"
language_model = "sample_language_model"
processed_text = integrate_language_model(audio_file, language_model)
print("Processed Text:", processed_text)
In this example, speech recognition is performed using the speech_recognition library, followed by processing based on the language model using the process_with_language_model function. As a hypothetical example, if the word “hospital” is included, it is simply replaced with “medical facility” to integrate the language model.
Reference Information and Reference Books
For more information on voice recognition technology, please refer to “Speech Recognition Technology.
Reference book were “Automatic Speech Recognition: A Deep Learning Approach“
“Robust Automatic Speech Recognition: A Bridge to Practical Applications“
“Audio Processing and Speech Recognition: Concepts, Techniques and Research Overviews“
コメント