Preprocessing of Natural Language Processing
Natural language processing (NLP) preprocessing is the process of preparing text data into a form suitable for machine learning models and analysis algorithms. Since machine learning models and analysis algorithms cannot ensure high performance for all data, the selection of appropriate preprocessing is an important requirement for the success of NLP tasks. Typical NLP preprocessing methods are described below. These methods are typically implemented on a trial-and-error basis based on the characteristics of the data and task.
text lowercasing (Lowercasing)
Convert all text to lowercase to maintain word consistency and reduce the number of words. This ensures that “Apple” and “apple” are treated as the same word.
<Example Implementation>
The following is a Python example that performs text lowercasing.
def lowercase_text(text):
"""
Function to convert text to lowercase
Parameters:
- text (str): Text to be converted to lowercase
Returns:
- str: Text converted to lowercase
"""
return text.lower()
# Text Example
example_text = "This is an EXAMPLE Text."
# Text Lowercasing
lowercased_text = lowercase_text(example_text)
# Display Results
print("Original Text:", example_text)
print("Lowercased Text:", lowercased_text)
In this example, the lowercase_text function is defined to convert the given text to lowercase, which allows words with different case to be treated as the same.
Tokenization
Tokenizes text into tokens (words, punctuation, etc.). Tokenization can be based on simple whitespace or using tools such as NLTK or Spacy.
<Example Implementation>
The following is an example implementation of tokenization using the Natural Language Toolkit (NLTK), Python’s natural language processing library.
import nltk
from nltk.tokenize import word_tokenize
# Download the data you need from NLTK
nltk.download('punkt')
def tokenize_text(text):
"""
Function to tokenize text
Parameters:
- text (str): Text to be tokenized
Returns:
- list: Token List
"""
tokens = word_tokenize(text)
return tokens
# Text Example
example_text = "Tokenization is an important step in natural language processing."
# Tokenization of text
tokens = tokenize_text(example_text)
# Display Results
print("Original Text:", example_text)
print("Tokens:", tokens)
In this example, NLTK’s word_tokenize function is used to split the given text into words, and nltk.download(‘punkt’) is run to download the data needed for tokenization. There are various other methods for tokenization, and libraries such as spaCy and Stanford NLP are also used in addition to NLTK.
stopword Removal
Remove common, meaningless words (stopwords). This can improve the efficiency of analysis and modeling.
<Example Implementation>
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# Download the data you need from NLTK
nltk.download('stopwords')
nltk.download('punkt')
def remove_stopwords(text):
"""
Function to remove stop words from text
Parameters:
- text (str): Text to remove stop word
Returns:
- str: Text with stop word removed
"""
# Get NLTK stop word
stop_words = set(stopwords.words('english'))
# Tokenize text
tokens = word_tokenize(text)
# Remove stop word
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
# Concatenate deleted tokens back into a string
filtered_text = ' '.join(filtered_tokens)
return filtered_text
# Text Example
example_text = "This is an example sentence with some stop words."
# Delete stop word
text_without_stopwords = remove_stopwords(example_text)
# Display Results
print("Original Text:", example_text)
print("Text without Stopwords:", text_without_stopwords)
In this example, the NLTK stopwords set is used to obtain English stopwords and remove them from the given text.
removing special characters and numbers
Punctuation, special characters, and numbers are removed to reduce noise.
<Example of Implementation>
The following is an example implementation of removing special characters and numbers using Python.
import re
def remove_special_characters(text):
"""
Function to remove special characters and numbers from text
Parameters:
- text (str): Text to remove special characters and numbers
Returns:
- str: Text with special characters and numbers removed
"""
# Remove special characters and numbers using regular expressions
cleaned_text = re.sub(r'[^a-zA-Zs]', '', text)
return cleaned_text
# Text Example
example_text = "This is an example sentence with 123 and some special characters!@#"
# Delete special characters and numbers
text_without_special_chars = remove_special_characters(example_text)
# Display Results
print("Original Text:", example_text)
print("Text without Special Characters:", text_without_special_chars)
In this example, the re.sub() function is used to replace characters (non-alphabetic characters) matching the regular expression pattern [^a-zA-Zs] with spaces. This results in clean text with special characters and numbers removed.
word normalization (Stemming or Lemmatization)
Converts a word to its base form; Stemming converts a word to its stem, and Lemmatization converts a word to its base form, thus allowing variants of a word to be treated as the same word.
<Example Implementation>
The following is an example implementation of Stemming and Lemmatization using the NLTK library in Python.
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
# Download the data you need from NLTK
nltk.download('punkt')
nltk.download('wordnet')
def stemming_example(text):
"""
Example of Text Stemming
Parameters:
- text (str): Text to be stemmed
Returns:
- str: Stem-extracted text
"""
# Initialization of Porter Stemmer
stemmer = PorterStemmer()
# Tokenize text
tokens = word_tokenize(text)
# stem extraction
stemmed_text = ' '.join([stemmer.stem(word) for word in tokens])
return stemmed_text
def lemmatization_example(text):
"""
Example of Lemmatization of text into its basic form
Parameters:
- text (str): Text to be converted to basic form
Returns:
- str: Text converted to basic form
"""
# Initialization of WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()
# Tokenize text
tokens = word_tokenize(text)
# Convert to basic form
lemmatized_text = ' '.join([lemmatizer.lemmatize(word) for word in tokens])
return lemmatized_text
# Text Example
example_text = "Running, runners, ran: they all run on the race."
# Example of Stem Extraction
stemmed_text = stemming_example(example_text)
# Example of conversion to basic form
lemmatized_text = lemmatization_example(example_text)
# Display Results
print("Original Text:", example_text)
print("Stemmed Text:", stemmed_text)
print("Lemmatized Text:", lemmatized_text)
In this example, NLTK’s PorterStemmer is used for word stem extraction, and WordNetLemmatizer is used for conversion to the basic form. This allows the different variants of a word to be converted to a common form, reducing the number of words.
Use NGrams
By considering the combination of two or more words (Bigram, Trigram, etc.), the context can be captured more accurately.
<Example implementation>
Below is a simple example implementation of introducing NGrams using Python.
from nltk import ngrams
from nltk.tokenize import word_tokenize
def generate_ngrams(text, n):
"""
Function to generate n-grams from text
Parameters:
- text (str): Text to generate n-grams
- n (int): N-gramのN
Returns:
- list: List of generated n-grams
"""
# Tokenize text
tokens = word_tokenize(text)
# N-gram generation
n_grams = list(ngrams(tokens, n))
return n_grams
# Text Example
example_text = "Natural Language Processing is a subfield of artificial intelligence."
# Example of 2-gram
ngrams_2 = generate_ngrams(example_text, 2)
# Example of 3-gram
ngrams_3 = generate_ngrams(example_text, 3)
# Display Results
print("Original Text:", example_text)
print("2-gram:", ngrams_2)
print("3-gram:", ngrams_3)
In this example, NLTK’s ngrams function is used to generate the specified N NGrams from the given text. This facilitates obtaining contextual information based on word combinations, and by adjusting the size (N) of the NGrams, different contextual information can be extracted.
encoding:
Encoding is used to convert text data into numeric data. For example, word embeddings (Word Embeddings) may be used.
<Example of Implementation>
Word Embeddings is commonly used to convert text data into numerical data as a preprocessing method for Natural Language Processing (NLP). Below is a simple example implementation using Python to introduce word embeddings. The spaCy library is used here.
pip install spacy
Next, download the English word embedding model (e.g. en_core_web_sm).
python -m spacy download en_core_web_sm
The following is an example of encoding implementation using word embedding.
import spacy
# Load spaCy models
nlp = spacy.load('en_core_web_sm')
def text_encoding(text):
"""
Function to convert text to word embedding
Parameters:
- text (str): Text to be converted
Returns:
- numpy.ndarray: Array converted to word embedding
"""
# Text is parsed and tokenized word by word and retrieved
doc = nlp(text)
# Obtain a word embedding vector for each word
word_embeddings = [token.vector for token in doc]
return word_embeddings
# Text Example
example_text = "Natural Language Processing is fascinating."
# Conversion to word embedding
embeddings = text_encoding(example_text)
# Display Results
print("Original Text:", example_text)
print("Word Embeddings:", embeddings)
In this example, the text is analyzed using the spaCy model to obtain a word embedding vector for each word. This converts the text into numerical data. Note that although this example obtains the word embedding for each word, there is also a method to obtain the embedding for the entire sentence.
processing missing data
When text data is missing, it is important to handle it appropriately. Missing values may be removed or supplemented with appropriate alternative methods.
<Example Implementation>
Below is a simple example implementation of processing missing data using Python.
def handle_missing_data(text):
"""
Function to process missing data in text
Parameters:
- text (str): Text to be processed
Returns:
- str: Text with missing data processed
"""
# As an example, replace missing data with blanks
processed_text = text.replace('[MISSING]', '')
return processed_text
# Example text (assuming missing data is included)
example_text_with_missing_data = "This is an example [MISSING] with missing data."
# Processing of missing data
processed_text = handle_missing_data(example_text_with_missing_data)
# Display Results
print("Original Text:", example_text_with_missing_data)
print("Processed Text:", processed_text)
This example shows how to replace certain missing data in the text (in this case [MISSING]) with spaces. Depending on the characteristics of the missing data and the context, appropriate processing methods vary. Other possible processing methods include replacing the missing data with another word or phrase, complementing the missing data, or deleting the sentence containing the missing data. It is important to select the appropriate method of handling from among these depending on the characteristics of the data.
text length alignment
Aligning or trimming the length of text can be used to align the input size of the model.
<Example of Implementation>
Common ways to adjust text length include the use of padding or trimming. This is a technique to ensure that the length of the text matches the input size of the model. Below is an example implementation of adjusting text length using Python.
def adjust_text_length(text, max_length):
"""
Function to adjust text length
Parameters:
- text (str): Text to be adjusted
- max_length (int): Maximum length of target
Returns:
- str: Length-adjusted text
"""
# Add padding if text length is less than target maximum length
if len(text) < max_length: padded_text = text + ' ' * (max_length - len(text)) return padded_text # テキストの長さが目標の最大長以上であれば、トリミング elif len(text) > max_length:
trimmed_text = text[:max_length]
return trimmed_text
# If the length of the text is the same as the maximum length of the target, it is returned as is.
else:
return text
# Text Example
example_text = "This is an example sentence."
# Adjust text length (e.g., set maximum length to 10)
adjusted_text = adjust_text_length(example_text, max_length=10)
# Display Results
print("Original Text:", example_text)
print("Adjusted Text:", adjusted_text)
In this example, the adjust_text_length function is used to adjust the length of the text to the target maximum length. If the text is shorter than the target maximum length, padding is added; if longer, trimming is performed. If the target maximum length and the text length are the same, the text is returned unchanged.
Reference Information and Reference Books
For more information on natural language processing in general, see “Natural Language Processing Technology” and “Overview of Natural Language Processing and Examples of Various Implementations.
Reference books include “Natural language processing (NLP): Unleashing the Power of Human Communication through Machine Intelligence“.
“Practical Natural Language Processing: A Comprehensive Guide to Building Real-World NLP Systems“
“Natural Language Processing With Transformers: Building Language Applications With Hugging Face“
コメント