Modeling of natural language, application of unigram models and Bayesian probabilistic models

Machine Learning Natural Language Processing Artificial Intelligence Digital Transformation Markov Chain Monte Carlo Method Deep Learning Technology Probabilistic Generative Model. Topic Model Ontology Technology Intelligent information Navigation of this blog

Summary

Natural Language Processing (NLP) is a general term for technologies that mechanically process natural language used by humans for applications as diverse as text classification, document summarization, machine translation, sentiment analysis, and question answering.

The technology belongs to the field of artificial intelligence, which is a crossover of theories and techniques from various fields such as machine learning, statistics, linguistics, and computer science, etc. Basic NLP tasks include word segmentation, morphological analysis, POS tagging, syntactic analysis, semantic analysis, proper noun recognition, and collocation analysis, and these tasks are part of the preprocessing required to process text more mechanically and apply machine learning algorithms. NLP also uses techniques such as word embedding, part-of-speech tagging, name entity recognition, parsing, classification, and regression, and more recently, methods based on deep learning have become mainstream.

This paper describes various NLP techniques and their applications based on Iwanami Data Science Series Vol. 2, “Statistical Natural Language Processing: Machines for Handling Words”.

In this article, we will discuss stochastic unigram models and Bayesian estimation as an introduction to language models.

Introduction to models of language (probabilistic unigram models and Bayesian probability)

In many fields of language processing, a language model has emerged as the key to processing language. In some textbooks, the term “language model” starts with a mathematical expression such as “A language is a subset L of a set Σ* consisting of a sequence of characters x ∈ Σ.

In a more concrete sense, a language model is something that is familiar to all speakers of a language, as if they are unconsciously using it continuously.

As an example, let’s consider the following document.

「こんちには　みさなん　おんげき　ですか? 　わしたは　げんき　です。」

If you look closely, you can see that the letters have been changed in some places, which is obviously strange, but the content is understandable (readable). This is thanks to the language model in our mind.

Before considering language models, let’s define what language is in the first place. First of all, let’s limit the target language to “written language. In the world of linguistics, as represented by Saussure, some theories are built on the basis of spoken words and the sounds they make, but if we include them, the problem becomes too complicated, so we limit the domain for the purpose of simplification.

Next, “written language” is made up of letters. In the case of Japanese, these are hiragana, katakana, and kanji. Let’s go back to the definition of what “letters” are in the first place. For example, when we think of the surname “Watanabe,” there are dozens of different types of “nave,” but can we say that they are the same or different characters?

If the purpose of using these “characters” is to “create an address printing system”, then they need to be treated as different characters. Also, if it is a system to translate English into Japanese, it does not matter whether Watanbe is converted into Watanabe or Watanabe. However, if you want to create a system that correctly translates “Watanabe” and “Watanabe” by looking at the context, you will have to treat them as different characters.

In addition to Japanese, depending on the environment, the same character may be represented by multiple character codes in Persian and Romanian, and many European languages use accent marks such as umlauts, but Twitter and Facebook omit these accent marks. There are many problems with letters in the real world.

The issue becomes even more complicated when we extend the definition of what a word is. “The dictionary definition of “word” is “the smallest unit of speech that makes sense. For example, if you separate “baseball” into “field” and “ball,” it loses its original meaning.

However, whether “high school baseball” should be considered as two words in a row, “high school and baseball,” or as a single word, “high school baseball,” the two-word approach can be interpreted as “high school baseball is ‘high school baseball’ and its meaning is not lost even if it is separated. In the case of a single word, it can be interpreted as “there are people who are not interested in baseball but watch high school baseball, and there is a meaning that is lost by separation. There is no absolute correct answer to what constitutes a word.

These issues become even more complex when we extend the definition to “sentences.

The term “model” has the connotation of “a simplification of a problem that does not defeat the purpose” by ignoring or giving a simplified answer to such complex factors that may have different answers depending on one’s position. When we start thinking about what words and letters are, we need to make various divisions, and that is where modeling already begins.

In this definition of natural language, a letter or word is “anything that can be treated as a back or word by a computer” (leave it to Unicode and morphological analyzers), and a sentence is “a sequence of words, and thus also a sequence of letters.

Here, everything below is a “string of Japanese characters” (a sentence in the above definition).

今日はいい天気です(It’s a good paint today.)
今日はいいペンキです(It’s a beautiful day today.)
無色の緑の概念が眠る(The colorless green concept sleeps.)
すは気今天でいい日(???)

Of these, the first sentence is the only one that everyone can agree is natural Japanese. For each of these sentences, a Japanese speaker would have the sense of “very Japanese,” “somewhat Japanese,” “not quite Japanese,” or “not Japanese,” and would be able to make a judgment with just one glance. This is the language model. Natural language processing is the process of transforming these human senses into a form that can be handled by a computer (handled numerically).

What kind of numerical value is appropriate for the “Japanese-ness” output of a language model? For example, it may seem simplest to output 1 if it is correct as Japanese, and 0 if it is wrong, but as mentioned earlier, there are degrees of “Japanese-ness” and it cannot be reduced to a binary value. However, as mentioned earlier, there are degrees of “Japanese-ness,” so it is not something that can be modeled as a binary value. A natural approach would be to model it as a degree of probability between 0 and 1.

Let’s consider the case of speech recognition, where speech is input and text is output. As an example, let’s assume that the following outputs are given as candidates for a given input voice.

今日はいいペンキです(It’s a good paint today.)
今日はいい天気です(It’s a beautiful day today.)

Even if “paint” is more likely due to the noise and rapid speech, it is common sense to think that “weather” is closer to the correct answer than “paint” from the perspective of “Japanese-ness.

This behavior of “leaning toward the Japanese-like interpretation” is a function of the language model in humans. Modeling the function of such a model using probability, we can express it using the Bayesian formula as follows.

\(\displaystyle P(It’s a beautiful day today|[speech])=\frac{P(It’s a beautiful day today)・P([speech]|It’s a beautiful day today)}{P([speech])}\\ \displaystyle P(It’s a good paint today|[speech])=\frac{P(It’s a good paint today)・P([speech]|It’s a good paint today)}{P([speech])}\)

P([speech]|[sentence]) is the probability of pronouncing a sentence as speech, and is called the acoustic model. This acoustic model multiplied by the language model P([sentence]) is a constant multiple of P([sentence]|{speech]), as the above equation shows. Therefore, by outputting the sentence that makes P([sentence])∙P([speech]|[sentence]) the largest as the best candidate for the text of speech, the recognition result of that speech will be given in consideration of its “Japanese-ness”.

As described above, the language model gives the probability of “Japanese-ness” to a sentence, but next, we will discuss how to obtain this probability by going back to the input sentence.

As mentioned above, how to handle characters (models) differs depending on the purpose. Among these models, we will discuss the n-gram model, which is a simple yet powerful model.

As an example, let us consider the following problem.

<Problem> Choose the word that best fits the following ( ) from 1 to 3.
Taro reluctantly (. ).
1. sneeze. 2. skip . 3. homework

If P(word|reluctant) is the probability of the word following “reluctant”, the above becomes the following in common sense

P(sneeze|reluctant) = 0.1. ← A bit strange in Japanese
P(skip|reluctant) = 0.2. ← Slightly strange
P(homework|reluctantly) = 0.7. ← Reasonable answer

If we subjectively assign the probability of the answer, we get the above. Such subjective probabilities often appear in the discussion of Bayesian probability. An n-gram is a language model in which the probabilities of a sequence of n or fewer words are given in this way. The above example is a 2-gram because there are two words each. (Also called a bigram).

Here, the simplest case of n-grams, n=1, is called a unigram, which is just a “sequence of one word,” that is, each word has its own probability. This corresponds to the probability of occurrence of each word. This probability can be calculated by maximum likelihood estimation theory using the following simple formula.

\[p(w)=\displaystyle\frac{[単語wの出現した回数]}{[文の総単語数]} \]

Here, the unigram model does not take word order into account, so “today is a good day” and “is today a good day” have the same probability. This may seem unnatural from the perspective of Japanese-ness, but if we think about it from the perspective of simplifying the problem to the extent that it does not violate the purpose, which is a characteristic of the model, for example, in the problem of classifying a sentence into categories according to its topic, if the word “home run” is present, the sentence would have a high probability of belonging to a category such as “sports” or “baseball. If the word “home run” is found, the sentence is likely to belong to a category such as “sports” or “baseball,” and it is not necessary to know that the sentence is about “the player who hit the home run…” or “the home run was a priceless hit. If the problem is to generate an appropriate sentence, the order of the words cannot be ignored, and it is inappropriate to use a unigram model.

Here, in the maximum likelihood estimation approach, since the calculation is done by division as in the above equation, only one answer can be obtained. Therefore, the probability pw is estimated as only one value. This is the same as the probability pw = 0.03 when the word w appears 3 times out of 100 total words in the collected sentences, and pw = 0.03 when the word w appears 30 times out of 1000 collected sentences.

The Bayesian statistical approach is to compare 3/100 with 30/1000 and try to find the difference between the two estimates. This is to look at the “probability P(Pw) of what value the probability Pw takes”, which is expressed by the following formula using the Bayesian formula.

\[P(p_w|y)=\displaystyle\frac{P(p_w)P(y|p_w)}{P(y)}\]

This formula means that “the distribution P(pw) can be updated to a new distribution P(pw|y) under the condition that y”, and the parameter pw is estimated not as “a single value” but as “a value with blurring”.

Under certain conditions, the distributions of pw in the cases where the word w appears “3 out of 100” (solid line) and “30 out of 1000” (dashed line) are estimated using the Bayesian formula above.

In both cases, the peak of the distribution is around 0.03, but the solid line (100 units) indicates that the spread of the distribution is larger and more “blurred. By using such a Bayesian probability distribution, we can improve the accuracy of the unigram model.

Here, the Bayesian formula above raises the issue of what to do with the prior probability distribution (initial value) before updating. This is the issue that arises. The most commonly used prior probability distribution is one that has the same shape before and after the update (conjugate prior distribution) because of its ease of calculation.

From this point of view, consider the unigram. The distribution of a unigram is a multinomial distribution, which has the same form as that of a dice: “multiple events each have a probability of occurring.

On the other hand, the Dirichlet distribution as described in “Overview of Dirichlet distribution and related algorithms and implementation examples“, one of the conjugate prior distributions of the unigram used as the prior distribution of the unigram, can be imagined as a dice whose quality is unstable and the probability of the dice coming out differently due to the center of gravity shifting or the shape of the cube shifting. At the exit of the manufacturing machine, the probability of getting each eye on the dice is precisely measured.

The above image is shown below as a probability distribution for a three-sided dice.

Such Dirichlet distributions and their variants are often used in Bayesianizing language models. Among them are the Dirichlet process and the Pitman-Yaw process used in CRFs, etc.

In the next article, we will discuss the topic model in more detail.