Overview of negative sampling and examples of algorithms and implementations

Machine Learning Artificial Intelligence Digital Transformation Stochastic Generative Models Bayesian Modeling Natural Language Processing Markov Chain Monte Carlo Method Image Information Processing Reinforcement Learning Knowledge Information Processing Explainable Machine Learning Deep Learning General ML Small Data ML Navigation of this blog

Overview of negative sampling

Negative sampling is a learning algorithm in natural language processing and machine learning, especially used in word embedding models such as Word2Vec as described in ‘Word2Vec’. It is a method of selective sampling.

In models such as Word2Vec, the relationship between a word and its context (surrounding words) is learnt in order to capture the semantic similarity between words, e.g. the Skip-gram model described in ‘Overview, algorithm and implementation examples of Skipgram’, which learns to predict surrounding words from the central word. This is, however, not possible with a large vocabulary. However, this requires a large vocabulary dataset and is very computationally expensive.

Skip-gram models do not learn the relationship between each word and all other words, but selectively sample ‘positive examples’ that actually co-occur and ‘negative examples’ that do not co-occur. The method for sampling these ‘negative examples’ is negative sampling.

The flow of these processes is as follows.
1. as positive samples, pairs of central words and their surrounding words are selected.
2. as negative samples, word pairs that are not related to the central word (non-occurring word pairs) are randomly selected
3. the model is trained to have high similarity to the positive samples and low similarity to the negative samples.

This mechanism significantly reduces the training time of the model and enables efficient learning of word embeddings.

The advantages of negative sampling include
– Reduced computational cost: the process is lighter because only a small number of negative examples are used instead of computations on all words.
– Efficient learning: focusing on word pairs that do not co-occur frequently facilitates the learning of irrelevant words.

Algorithms related to negative sampling

There are several algorithms related to negative sampling, which are mainly aimed at efficient learning in natural language processing and neural network optimisation. Typical examples include.

1. Word2Vec (in particular Skip-gram with Negative Sampling, SGNS): Word2Vec is an algorithm for learning word embeddings, with two approaches: the Skip-gram model and Continuous Bag of Words (CBOW). They exist. Negative sampling is often combined with the Skip-gram model. The main steps are as follows

– Skip-gram model: predicts the surrounding context words from a central word.
– Negative sampling: reduces computational complexity by training with word pairs that actually occur (positive examples) and word pairs that do not co-occur (negative examples). In particular, it enables efficient learning by ignoring many irrelevant words.

2.Noise Contrastive Estimation (NCE): Noise Contrastive Estimation (NCE) is a method for estimating probability distributions by distinguishing between positive and negative examples (noise data), similar to negative sampling. The main difference is that negative sampling focuses on the classification problem, whereas NCE aims to learn the probability distribution itself. The main steps are as follows.

– Adding negative samples (noise) in addition to the positive samples and learning to ensure that the model correctly identifies the positive samples.
– Mainly used in models dealing with probability distributions (e.g. language models and distribution-based models).

For more information on Noise Contrastive Estimation (NCE), see also Noise Contrastive Estimation (NCE) Overview, Algorithm and Implementation Examples.

3. Contrastive Divergence (CD): Contrastive Divergence (CD) is an algorithm used for training Restricted Boltzmann Machines (RBMs) and Deep Belief Networks (DBNs) and is an efficient method for models to learn the probability distribution of data. It is an efficient method. The main steps are as follows.

– Based on positive samples, adjust the distribution produced by the model to approximate and learn the data distribution.
– Make comparisons with negative samples (samples that do not follow the distribution of the data) to make learning more efficient.

For more information on Contrastive Divergence (CD), see also ‘Contrastive Divergence (CD) overview, algorithm and implementation examples’.

4. Max-Margin Approaches: the Max-Margin Approach, like Support Vector Machines (SVMs), is a method that maximises the boundaries between classes. Negative sampling is also related to the Max-Margin Objective, which aims to increase the margin between positive and negative samples during training. The main steps are as follows.

– Assigning higher scores to positive samples and lower scores to negative samples strengthens the boundaries of the classification model.

For more information on the Max-Margin Approach, see also ‘Overview of the Max-Margin Approach and Examples of Algorithms and Implementations’.

5. Negative Log-Likelihood (NLL): Negative Log-Likelihood is sometimes used as the objective function of negative sampling. In particular, in Word2Vec and others, the model is trained to calculate the score difference between positive and negative samples and to minimise the negative log-likelihood for this.

– This adjusts the model so that correct word pairs have high scores and incorrect pairs have low scores.

For more information on Negative Log-Likelihood, see also ‘Negative Log-Likelihood Overview, Algorithm and Example Implementation’.

Negative sampling is used as part of a wide range of methods for efficiently learning from large amounts of noisy data, not only in word embedding models such as Word2Vec, but also in related algorithms such as Noise Contrastive Estimation and Contrastive Divergence, it has been applied to estimate probability distributions and optimise learning.

Examples of negative sampling applications

Negative sampling is used for efficient learning, especially on large data sets, and has been applied in various fields. Specific examples of its application are described below.

1. word embedding learning with Word2Vec: the most representative application is the Word2Vec model, which employs Skip-gram with Negative Sampling (SGNS).

– Application areas: natural language processing (NLP)
– Details: negative sampling is used to capture semantic similarity between words, efficiently learn word embeddings (vector representations), learn the relationship between a central word and its context (surrounding words) in the Skip-gram model, and sample word pairs that do not appear (negative examples) to Learning efficiency is improved.
– Example: ‘dog’ and ‘run’ in a sentence are highly related, but ‘dog’ and ‘maths’ are less related. Such irrelevant word pairs are ignored during learning to reduce computational complexity.

2. ranking models in information retrieval systems: in search engines and recommendation systems, models are trained to efficiently rank items and documents using negative sampling. For more information on ranking algorithms, see ‘Overview of Ranking Algorithms and Examples of Implementations’.

– Application areas: information retrieval, recommendation systems.
– Details: content that users click on (or rate) is treated as a positive example, while content that is not clicked on or items that are not displayed are treated as a negative sample, and the model for ranking search results and recommended items assigns a high score to positive examples, Assigning a lower score to negative samples improves the accuracy of the ranking.
– Example: if a user listens to a lot of ‘classical music’ on a music streaming service, other genres are used as negative samples to facilitate the recommendation of classical music.

3. graph embedding: negative sampling is also applied to machine learning models that deal with graph structures. For example, it is used to learn relationships in social networks and knowledge graphs.

– Application areas: social network analysis, knowledge graphs, link prediction.
– Details: graph embedding algorithms (e.g. Node2Vec and DeepWalk), which are also described in ‘Overview, algorithms and implementation examples of graph embedding’, learn relationships between nodes as vector representations, and use node pairs with actual links (positive examples) as well as node pairs without links (negative examples). Node pairs (negative samples) are used to make model training more efficient.
– Example: in Facebook friend recommendation, users with many common friends are treated as positive examples, while users with no contact at all are used as negative samples.

4. knowledge embedding in relational databases: the relational databases described in ‘Database algorithms (2) Relational databases’ Negative sampling is also used to model the relationships between entities in knowledge bases, as described in ‘Rule Bases, Knowledge Bases, Expert Systems and Relational Data’.

– Application areas: knowledge-based reasoning, relational databases.
– Details: entities in the knowledge base and their relationships are embedded in a vector, which is used to predict unknown relationships, learning entity pairs with real relationships as positive examples and using randomly chosen unrelated entity pairs as negative samples.
– Example: a knowledge base such as Wikidata, where correct relations such as ‘France’ and ‘Paris’ are learnt as positive examples and irrelevant pairs such as ‘France’ and ‘New York’ are used as negative samples.

5. multi-class classification in deep learning: negative sampling is also used for efficient training in the multi-class classification task of deep learning models, as described in ‘Overview of multi-class object detection models, algorithms and implementation examples’.

– Application areas: image recognition, natural language processing, speech recognition.
– Details: in multi-class classification tasks, training on the class of interest and all other classes results in very high computational costs. By applying negative sampling, only the target class (positive examples) and some irrelevant classes (negative examples) are learnt.
– Example: in the task of classifying ‘dog’ in image recognition, other irrelevant classes such as ‘cat’ and ‘bird’ are efficiently learnt as negative samples.

Negative sampling will be an approach used in many areas, ranging from word embedding, recommendation systems, graph analysis, knowledge-based reasoning and deep learning. The aim is to train models efficiently by eliminating unnecessary computations and reducing computational costs, especially for tasks with large data sets and complex relations.

Examples of negative sampling implementations

Implementing negative sampling is a relatively simple process, especially for word embedding models in natural language processing (e.g. Word2Vec). Below is a basic example of implementing negative sampling using Python and NumPy.

Example implementation: negative sampling for Word2Vec

This example implements negative sampling using the Skip-gram model. The central word and its surrounding words are learnt as positive examples and irrelevant words are randomly sampled as negative examples.

Assumptions:

Central word wt
Context word wt+j
Number of vocabulary words V
Number of negative samplings k(e.g. 5 negative samples)

1. importing the required library:

import numpy as np
import random
from collections import Counter

2. data preparation: first, words are collected from the text data and vocabularies are created. A simple sample sentence is used here.

# Sample data (small corpus)
corpus = [
    "the quick brown fox jumps over the lazy dog",
    "the fox is quick and the dog is lazy"
]

# Create a list of words.
words = " ".join(corpus).split()

# Assign an index to a word.
word_to_index = {word: idx for idx, word in enumerate(set(words))}
index_to_word = {idx: word for word, idx in word_to_index.items()}
vocab_size = len(word_to_index)

3. creating word frequency distributions: based on word frequencies, probability distributions for negative sampling are created so that words with high frequencies are preferentially selected.

# Counting word frequencies.
word_counts = Counter(words)

# Probability distribution of words (for negative sampling)
total_count = sum(word_counts.values())
word_prob = {word: count / total_count for word, count in word_counts.items()}

# Create weighted arrays according to word probability.
neg_sampling_probs = np.array([word_prob[index_to_word[i]] for i in range(vocab_size)])
neg_sampling_probs = neg_sampling_probs ** (3/4)  # Apply the 3/4 power.
neg_sampling_probs /= np.sum(neg_sampling_probs)  # Normalisation to probability

4. obtaining a negative sample: next, implement a function to obtain a negative sample for a specific central word.

def get_negative_samples(target_word_idx, num_samples, vocab_size, neg_sampling_probs):
    negative_samples = []
    while len(negative_samples) < num_samples:
        sampled_idx = np.random.choice(vocab_size, p=neg_sampling_probs)
        if sampled_idx != target_word_idx:  # Avoid words that are the same as the central word
            negative_samples.append(sampled_idx)
    return negative_samples

5. training the Skip-gram with Negative Sampling model: the next step is to train the Skip-gram model using positive samples (pairs of central and peripheral words) and negative samples. Here, as a simple example, the weights are not updated and only the sample acquisition flow is shown.

def train_skipgram_with_negative_sampling(center_word_idx, context_word_idx, num_neg_samples, vocab_size, neg_sampling_probs):
    # Positive sample (pairs of central and peripheral words)
    positive_sample = (center_word_idx, context_word_idx)
    
    # Obtaining negative samples.
    negative_samples = get_negative_samples(center_word_idx, num_neg_samples, vocab_size, neg_sampling_probs)
    
    # Show positive and negative samples.
    print("Positive sample: ", positive_sample)
    print("Negative samples: ", negative_samples)

# Try learning with the central word ‘fox’ and the peripheral word ‘quick’
center_word_idx = word_to_index["fox"]
context_word_idx = word_to_index["quick"]
train_skipgram_with_negative_sampling(center_word_idx, context_word_idx, num_neg_samples=5, vocab_size=vocab_size, neg_sampling_probs=neg_sampling_probs)

6. execution results:

Positive sample:  (5, 6)  # Index 5 is ‘fox’, 6 is ‘quick’.
Negative samples:  [3, 8, 2, 7, 1]  # Sampled negative samples.

DESCRIPTION:

Data preparation: a list of words and an index are prepared and a probability distribution for negative sampling is created based on word frequencies.
Obtaining negative samples: negative samples are randomly sampled with probability based on word frequency, and words identical to the central word are eliminated.
Skip-gram with Negative Sampling: negative samples are used in addition to positive samples (pairs of central and peripheral words) in the skip-gram model for efficient training.

Negative sampling issues and measures to address them

While negative sampling is useful for efficient learning on large data sets, there are some challenges. This section discusses measures to address them.

Challenge 1: Impact of negative sample selection on model performance
With negative sampling, it is important to select the correct ‘negative sample’ during training, and if a negative sample is randomly selected, the negative examples that the model learns will be weak and inefficient.

Solution:
– Frequency-based sampling: frequently occurring words and entities are often important in natural language processing and knowledge graphs. Models such as Word2Vec employ sampling with a probability of 3/4 squared of the word frequency distribution. adopted.

– Hard negative sampling: this would be a method that prioritises particularly difficult negative samples (those that the model is likely to erroneously judge to be positive examples). This can enhance the training of the model.

Challenge 2: Computational costs on large data sets
Negative sampling increases the computational load when the data is very large, as it deals with additional negative samples in addition to positive examples. In particular, when dealing with large lexicons or entity sets, it becomes difficult to select appropriate samples from the whole set.

Solution:
– Adjusting the number of samplings: the computational cost can be controlled by adjusting the number of negative samples. Usually, 5-20 negative samples is said to be a good balance between performance and efficiency of the model, but it is important to determine the appropriate number depending on the task.

– Partial sampling: negative sampling within a subset of a data set or domain, rather than sampling from the entire vocabulary or data, can reduce computational cost. For example, local sampling, where only relevant words within a context are targeted, can be effective.

Challenge 3: Balancing the quality of negative samples
If too many random negative samples are selected, they may not contain the ‘meaningful’ negative examples that the model should learn. Conversely, if only very similar samples are used as negative samples, the model may be over-fitted and over-learning may occur.

Solution:
– Dynamic sampling: a technique in which the selection of negative samples is varied as the learning progresses, sampling randomly at first and shifting to more difficult samples as the learning progresses, to ensure balanced learning.

– Adaptive negative sampling: a method of preferentially sampling negative samples from which the model is prone to error during the course of learning, thereby enabling learning to focus on areas where the model needs to improve.

Challenge 4: Capturing long-term dependencies and structural information
Negative sampling has the disadvantage that it focuses on pairs of words or entities, making it difficult to capture long-term dependencies and structural patterns.

Solution:
– Structural sampling: considering relevance in sentences and the overall graph structure, sampling can be done using context and node distance information rather than just pairs, for example, in Graph Embedding, by focusing on the neighbourhood or hub of a node to sample it, better embedding representations can be learnt.

– Improved models: using models that can learn long-term dependencies, such as RNNs and Transformers, enables learning to consider the whole context without resorting to sampling.

Challenge 5: Bias issues.
If reliance is placed on negative sampling techniques, sample bias can have a negative impact on learning, and if certain words or entities are over-selected, this bias will be reflected in the results.

Solution:
– Adjust sampling: deliberately adjusting sampling probabilities for specific classes or words can reduce bias, making frequency-based sampling and methods that use different probability distributions more effective.

– Regularisation: applying regularisation to the model can prevent excessive bias and overlearning, with L2 regularisation and dropout being particularly effective approaches.

Negative sampling has its challenges, such as increased sample selection and computational costs, balancing sample quality, ignoring long-term dependencies and bias issues, but by adjusting the sampling strategy, implementing it in a computationally efficient manner, and by using adaptive sampling and hard negative sampling Techniques can be used to mitigate these challenges.

Reference information and reference books

Reference information and reference books are listed below.

1. reference information (online resources):

– Word2Vec Explained

– PyTorch’s Negative Sampling

– TensorFlow – Word2Vec with Negative Sampling

– Gensim’s Word2Vec Documentation

2. reference books:

– “Deep Learning” by Ian Goodfellow, Yoshua Bengio, and Aaron Courville

– “Neural Network Methods for Natural Language Processing” by Yoav Goldberg

– “Speech and Language Processing” by Daniel Jurafsky and James H. Martin

– “Advances in Neural Information Processing Systems” (NIPS) Proceedings

3. related papers:

– “Efficient Estimation of Word Representations in Vector Space” by Tomas Mikolov et al.

– “Distributed Representations of Words and Phrases and their Compositionality” by Tomas Mikolov et al.