Skipgram Overview, Algorithm and Implementation Examples

Machine Learning Natural Language Processing Artificial Intelligence Digital Transformation Image Processing Reinforcement Learning Probabilistic Generative Modeling Deep Learning Python Physics & Mathematics Navigation of this blog

Overview of Skipgram

Skip-gram is one of the methods for learning distributed representations of words (Word Embedding) widely used in the field of Natural Language Processing (NLP), which captures the meanings of words as vector representations and enables quantification of similarity and relevance of the meanings. It is also used in GNNs such as DeepWalk, which is described in “Overview of DeepWalk, Algorithms, and Examples of Implementations.

Skip-gram is a type of neural network model known as Word2Vec, which is also described in “Word2Vec.” The basic idea of skip-gram is to predict surrounding words from input words by learning words in a sentence as input and surrounding words as output. This is called a “Skip-gram”. For example, consider the sentence “I love deep learning. In this case, “love” is the central word (target) and surrounding words (context) are predicted.

Skip-gram is composed of the following 3-layer neural network.

1. Input Layer: A one-hot vector of words is input. Each word is represented as a one-hot vector whose length corresponds to the size of the dictionary.

2. Hidden Layer: The one-hot vectors obtained from the input layer are transformed by matrix multiplication with a weight matrix. This weight matrix transforms each word into a dense vector representation.

3. Output Layer: The output of the intermediate layer is interpreted as the probability distribution of each word through the softmax function of the output layer. Specifically, the model is trained to maximize the probability of the context word appearing.

By learning such a model, words are converted into a dense vector representation (Word Embedding), resulting in a vector that preserves semantic similarities and relationships.

Advantages of skip-grams include the following

Efficient training: Skip-grams can be trained efficiently even on large corpora. This is because it not only converts one-hot vector representations of words into dense vector representations, but also retains a lot of information.
Capturing semantic relations between words: The learned word vectors reflect semantic similarities and relations. For example, the difference between the vectors of “king” and “queen” is similar to the difference between the vectors of “man” and “woman”.
Reuse of word vector representations: Vector representations of words learned with Skip-grams can be reused as pre-trained embedded vectors in other natural language processing tasks (document classification, semantic analysis, machine translation, etc.).

References include.
– Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. in Advances in neural information processing systems (pp. 3111-3119).

Algorithm related to Skipgram

The following is an overview of the Skip-gram algorithm.

1. Data preparation: A corpus (a large amount of text data) is prepared, a vocabulary (a set of words) is constructed, and a unique ID is assigned to each word.

2. Input and output: In a Skip-gram, each target word is given as input and the output is a prediction of the surrounding words (context).

3. Neural Network Structure: Skip-grams consist of the following three layers of neural networks.

Input Layer: Target words are represented as one-hot vectors.
Hidden Layer: The one-hot vectors in the Input Layer are converted into a dense vector representation by a weight matrix.
Output Layer: The output of the middle layer is interpreted as a probability distribution for each word through a softmax function.

4. Learning: The network learns using a large number of input and context word pairs, and the goal of learning is to maximize the probability distribution of the context words. The goal is to maximize the probability distribution of the context word, which learns how well the surrounding words can be predicted from a given input word.

5. learned vector representation: As learning proceeds, the weight matrix in the middle layer transforms each word into a dense vector representation, and the vector representation of these words reflects their semantic similarity and association.

Skipgram Application Examples

The following are examples of Skip-gram applications.

1. Capturing semantic similarity of words: Skip-grams can learn distributed representations of words (word embedding), and thus can be used to capture semantic similarity of words, such as the relationship between “king” and “queen” or “man” and “woman”. It is possible to reflect the relationship between “king” and “queen” and between “man” and “woman” in the vector space.

2. Document Classification: Document classification uses the distributed representation of words learned by Skip-gram to represent semantic features of documents, which can then be used to train machine learning models (e.g., logistic regression, support vector machines, neural networks, etc.) to classify documents. 3. semantic analysis: semantic analysis is the analysis of the semantic features of a document.

3. semantic analysis: Semantic analysis uses the distributed representation of words learned by Skip-gram to represent the meaning of a sentence or phrase, and this representation can be used to understand the similarity of word meanings and the semantic relationship between sentences. Specifically, this includes clustering of words, detection of similar words, and calculation of sentence similarity.

4. machine translation: Machine translation uses the distributed representation of words learned by Skip-gram to model the semantic correspondence between sentences, which allows us to build a translation model between languages and obtain better translation results.

5. Semantic Question-Answering: Semantic question-answering systems use distributed representations of words learned by skip-grams to evaluate semantic similarities between questions and answers, thereby making it possible to construct more natural question-answering interfaces.

6. semantic inference: Using the distributed representation of words learned by Skip-gram, inferences can be made using semantic relations between words, e.g., “king” – “man” + “woman” = “queen”.

Example implementation of Skipgram

The following is an example of Skip-gram implementation using Python and TensorFlow.

Importing the library:

import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing import text
from tensorflow.keras.preprocessing.sequence import skipgrams

Data Preparation: Prepare text data for study.

corpus = ["I love deep learning", "I love NLP", "I enjoy studying algorithms"]
tokenizer = text.Tokenizer()
tokenizer.fit_on_texts(corpus)
vocab_size = len(tokenizer.word_index) + 1

Skip-gram data generation: Generate data to be used for training a skip-gram model.

sequences = tokenizer.texts_to_sequences(corpus)
skip_grams = [skipgrams(sequence, vocabulary_size=vocab_size, window_size=2) for sequence in sequences]
pairs, labels = skip_grams[0][0], skip_grams[0][1]
word_target, word_context = zip(*pairs)
word_target = np.array(word_target, dtype="int32")
word_context = np.array(word_context, dtype="int32")
labels = np.array(labels, dtype="int32")

Model construction: build a Skip-gram model. A simple two-layer neural network is used here.

embed_size = 50

input_target = tf.keras.layers.Input((1,))
input_context = tf.keras.layers.Input((1,))

embedding = tf.keras.layers.Embedding(vocab_size, embed_size, input_length=1, name='embedding')
target = embedding(input_target)
target = tf.keras.layers.Reshape((embed_size, 1))(target)
context = embedding(input_context)
context = tf.keras.layers.Reshape((embed_size, 1))(context)

dot_product = tf.keras.layers.Dot(axes=1)([target, context])
dot_product = tf.keras.layers.Reshape((1,))(dot_product)

output = tf.keras.layers.Dense(1, activation='sigmoid')(dot_product)

model = tf.keras.models.Model(inputs=[input_target, input_context], outputs=output)
model.compile(loss='binary_crossentropy', optimizer='adam')

Model training: train a Skip-gram model.

model.fit([word_target, word_context], labels, epochs=10, batch_size=16)

In this example, a simple Skip-gram model is built and trained using text data. Real-world applications would require a larger corpus and appropriate hyperparameter tuning, and could use learned embedding vectors for a variety of NLP tasks.

Skipgram’s challenges and how to address them

Below are the challenges of the Skip-gram and the measures taken to overcome them.

1. handling low-frequency words:

Challenge: Low-frequency words do not appear sufficiently in the training data, and it is difficult to accurately capture their semantic relationships.

Solution:
Subsampling: Ignoring low-frequency words in training improves learning efficiency. In general, the probability of sampling a word with frequency ( f(w)) is calculated as ( P(w) = 1 – sqrt{frac{t}{f(w)}} ), where ( t ) is a hyperparameter.

Negative sampling: to ensure that learning occurs even for low-frequency words, losses are computed for randomly selected negative samples (wrong words in context). This improves embedding for low-frequency words. See ‘Overview of Negative Sampling and Examples of Algorithms and Implementations’ in detail.

2. handling of compound words and proper nouns:

Challenge: Skip-grams learn word-level representations, making it difficult to accurately handle unique tokens such as compound words and proper nouns.

Solution:
Subword-level segmentation: Split words into smaller subwords to make it easier to capture the meaning of compound words and proper nouns. Examples include Byte Pair Encoding (BPE) and the WordPiece algorithmdescribed in “Overview of WordPiece and examples of algorithms and implementations“.

3. semantic polysemy and policy morphism:

Challenge: Skip-grams uniquely identify words, making it difficult to accurately represent word semantic polysemy and policy morphisms (one word can have multiple meanings).

Solution:
Consider polysemy: Polysemy can be addressed by capturing the meaning of a word in context. For example, it is possible to adjust the context window to identify the meaning of a word by the surrounding words.

Dealing with policy morphisms: There are methods to represent the different meanings of words as separate vectors. For example, methods such as sense2vec could be used. 4.

4. large data sets and computational complexity:

Challenge: Skip-grams are computationally expensive when trained on large datasets, and thus take a long time to train.

Solution:
Mini-batch learning: Split the training data into small batches for efficient use of memory and parallelization of computation. See detail in “Overview of mini-batch learning and examples of algorithms and implementations“

Distributed learning: Computation time can be reduced by parallelizing learning using multiple computation nodes, e.g., using the TensorFlow distributed learning framework.

Reference Information and Reference Books

For more information on natural language processing in general, see “Natural Language Processing Technology” and “Overview of Natural Language Processing and Examples of Various Implementations.

Reference books include “Natural language processing (NLP): Unleashing the Power of Human Communication through Machine Intelligence“.

“Practical Natural Language Processing: A Comprehensive Guide to Building Real-World NLP Systems“

“Natural Language Processing With Transformers: Building Language Applications With Hugging Face“

“Natural Language Processing with Python“

“Deep Learning“

“Speech and Language Processing“

“Word2Vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method“

“Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow“

“Neural Network Methods for Natural Language Processing“