Overview of Negative Log-Likelihood and examples of algorithms and implementations

Machine Learning Artificial Intelligence Digital Transformation Stochastic Generative Models Bayesian Modeling Natural Language Processing Markov Chain Monte Carlo Method Image Information Processing Reinforcement Learning Knowledge Information Processing Explainable Machine Learning Deep Learning General ML Small Data ML Navigation of this blog

Negative Log-Likelihood overview

Negative Log-Likelihood (NLL) is one of the loss functions for optimising the parameters of a model in statistics and machine learning, especially those that are often used in models based on probability distributions (such as classification models).

The NLL is a measure of a model’s performance based on the probability that the observed data is predicted by the model, and its purpose is to optimise the parameters of the model so that the model can explain the observed data with a high probability.

Likelihood is the probability that a given data \(x \) will occur under a given parameter \(\theta \), for example, if the probability that the data \(x \) is observed using the parameter \(P(x|\theta) \) is expressed as follows.

\[
L(\theta | x) = P(x | \theta)
\]

The logarithm of this likelihood function is the log-likelihood. This is expressed as follows, as the larger the product of the probabilities, the more complex the calculation becomes, so the logarithm is taken to simplify the calculation.

\[
\log L(\theta | x) = \log P(x | \theta)
\]

The negative log-likelihood (Negative Log-Likelihood, NLL) is the log-likelihood with a minus sign, as optimisation problems often involve minimisation rather than maximisation, and is expressed by the following formula.

\[
NLL(\theta | x) = – \log P(x | \theta)
\]

As a concrete example of these, consider the two-class classification problem. If the probability that the model predicts the correct class for each sample is represented by \( P(y | x) \), then if the model predicts the correct class with high probability, the value of NLL will be small, and if it predicts the correct class with low probability, NLL will be large. Using NLL in this way allows the model to be trained to make more accurate predictions.

In the case of logistic regression, the objective is to minimise the following NLL, with the probability that the sample \( x_i \) label \( y_i \) is 1 as \( P(y_i = 1 | x_i) = \hat{y}_i \).

\[
NLL = – \sum_{i=1}^N \left( y_i \log(\hat{y}_i) + (1 – y_i) \log(1 – \hat{y}_i) \right)
\]

Where:.
– \( y_i \) is the actual label
– \( \hat{y}_i \) is the probability predicted by the model

This equation is also known as cross-entropy loss.

Some of the features of NLL include the following
– Intuitive understanding: the NLL is a measure of how ‘sure’ a model is in predicting actual data, with smaller values indicating better model predictions.
– Application to stochastic models: often used in probability-based models, such as classification problems and generative models.
– Optimisation methods: models can be trained by minimising the NLL. Specific examples include minimising the NLL and updating the model parameters, e.g. using gradient descent methods.

Algorithms related to Negative Log-Likelihood

Algorithms related to NLL minimise NLL in probabilistic models and statistical learning models, and are particularly commonly used in classification problems and problems dealing with probability distributions. This section describes typical algorithms related to NLL and related methods.

1. Maximum Likelihood Estimation (MLE): maximum likelihood estimation is a method for finding parameters such that data are observed with the highest probability, specifically by estimating the parameters of the model by maximising the likelihood function for the observed data. Since the NLL is the negative logarithm of the likelihood, minimising the NLL is equivalent to maximising the likelihood, and the algorithm to minimise the NLL is used when performing maximum likelihood estimation MLE is widely used in general probability models (e.g. Gaussian distribution, logistic regression, etc.).

2. gradient descent: to minimise NLL, gradient descent is a very common algorithm. In the gradient descent method, the parameters of the model are gradually updated according to the gradient (slope) of the NLL function to minimise the NLL.

– Stochastic Gradient Descent (SGD): instead of calculating using the entire data set, a randomly selected subset (mini-batch) of data is used to calculate the gradient and update the parameters. This improves computational efficiency and makes the algorithm suitable for large data sets. see “Overview of Stochastic Gradient Descent (SGD), its algorithms and examples of implementation”

– Variants of gradient descent methods: there are several variants of gradient descent methods, and optimisation methods such as Adam, RMSProp and Adagrad will automatically adjust the magnitude of the gradient and perform parameter updates.

3. neural networks and cross-entropy loss: neural network classification tasks use a cross-entropy loss function that is very closely related to NLL. Cross-entropy loss is equivalent to NLL in multi-class classification problems. In neural network training, the gradient descent method is used to minimise the cross-entropy loss (i.e. NLL).

4. expectation-maximisation (EM algorithm): the EM algorithm is used to optimise the NLL in the presence of unobserved variables. In particular, it is often used in parameter estimation for hidden Markov models (HMMs) and mixed Gaussian models (GMMs).The EM algorithm consists of two alternating steps that continue until convergence of the NLL, as follows

– E-step: compute the expected values of the unobserved variables using the current parameters.
– M-step: update the parameters to minimise the NLL using the expected values calculated in the E-step.

5. logistic regression: logistic regression is a widely used model in binary classification problems, which finds the optimal parameters by minimising the NLL. Logistic regression predicts the probability that the class is 1 for each data point and calculates the NLL based on that probability. Gradient descent methods are commonly used to train logistic regression.

6. softmax regression: softmax regression is a multi-class version of logistic regression that outputs probabilities for each class. Again, the aim is to minimise the NLL, and in softmax regression, probabilities are calculated for each class and the parameters are adjusted to maximise the probability of the class being the correct answer. Cross-entropy loss, which minimises the NLL, is also used when a softmax function is used in the final layer of the neural network.

7. the Bellman equation and policy gradient method: there are also algorithms in reinforcement learning where NLL is useful. For example, in the policy gradient method described in “Overview of the policy gradient method and examples of algorithms and implementations“, NLL is used to learn a policy (a probability distribution for selecting an action) and the policy is updated in a way that minimises the negative log-likelihood in order to maximise the probability for the choice of action.

On the application of Negative Log-Likelihood

NLL is widely used in probabilistic approaches to statistical modelling and machine learning and has been applied in many areas, including.

1. logistic regression:
Example: binary classification problem
Abstract: Logistic regression becomes a widely used algorithm in two-class classification Abstract: NLL predicts the probability that each sample belongs to the correct class, calculates a loss based on this probability and the model is trained to minimise this loss.
APPLICATION:
– Tasks that use medical data to predict the probability that a patient will suffer from a particular disease (e.g. predicting the risk of lung cancer).
– 1. email spam filtering, e.g. predicting the click rate of an advertisement.

2. neural network classification tasks:
Example: multi-class classification (softmax regression)
Abstract: In the multi-class classification problem, the NLL is used to calculate the loss. Predicting the probability for each class with a softmax function and minimising the NLL is important for training neural networks.
APPLICATIONS:
– Image classification: tasks to predict which class (e.g. cat, dog, car, etc.) an object in an image belongs to (e.g. image classification using CIFAR-10 or ImageNet datasets).
– Speech recognition: tasks to classify speaker intent and language from speech data.
– Natural language processing: document classification, sentiment analysis and question answering tasks.

3. Gaussian Mixture Model (GMM):
Example: clustering problem
Abstract: A GMM is a clustering method that assumes that the data follows several normal distributions (Gaussian distributions); the parameters of the GMM are estimated to minimise the NLL, and an expectation maximisation method (EM algorithm) is usually used for this optimisation.
APPLICATIONS:
– Clustering of customer data (e.g. segmentation of target groups in marketing).
– Image segmentation (techniques for automatically distinguishing regions within an image).

4. Hidden Markov Models (HMM):
Example: modelling time series data
Abstract: HMMs are probabilistic models with unobserved hidden states, applied to time series data; training of HMMs is done by minimising the NLL, whereby the hidden states and transition probabilities are optimised so that the observed data are explained by the model.
APPLICATIONS:
– Speech recognition: analysing speech data and estimating the hidden states corresponding to utterances and words.
– Stock price prediction: predicts future patterns of fluctuation based on past stock price trends.

5.Generative Models:
Example: stochastic generative models
Abstract: Generative models are used to learn the distribution of data and generate new data. The parameters of these models are learnt by minimising NLL, e.g. Variational Autoencoder (VAE) described in “”Variational Autoencoder (VAE) Overview, Algorithm and Example Implementation“, which optimises models based on NLL.
APPLICATIONS:
– Image generation: face and object image generation using VAE.
– Music generation: generating music as time-series data.

6. natural language processing (NLP):
Example: training language models
Abstract: In language models, NLL is used for the task of predicting the next word or phrase. The model is trained by calculating the probability of each word in a sentence and minimising the NLL.
APPLICATIONS:
– Machine translation: tasks that generate correct translations for sentence input (e.g. English to Japanese).
– Speech recognition: predicting the correct sequence of words when transcribing from spoken data.
– Text generation: predicting the next word or phrase and generating natural sentences (e.g. GPT-based language models).

7. reinforcement learning:
Example: policy optimisation
Abstract: In reinforcement learning, an agent learns policies in order to select optimal behaviour. Methods such as the policy gradient method use NLL to model the probability of an agent’s action and maximise that probability.
APPLICATIONS:
– Game AI: learning policies for an agent’s optimal behaviour in a game.
– Robotics: robots choose the best behaviour in an environment to perform a task.

8. Bayesian statistics:
Example: training a Bayesian model
Abstract: In Bayesian estimation, maximum a posteriori probability estimation (MAP estimation) with NLL is used to estimate the most plausible parameters for the data. In the Bayesian approach, the parameters are optimised by considering prior probabilities in addition to the likelihood.
APPLICATIONS:
– Medical data analysis: Bayesian prediction of disease risk based on patient data.
– Anomaly detection: anomaly detection tasks on sensor or network data.

Example implementation of Negative Log-Likelihood

An example implementation of Negative Log-Likelihood (NLL) in Python and PyTorch is described. The implementation shows a typical method of minimising NLL using a neural network.

1. implementation of NLL in a binary classification problem: using the logistic regression model as an example, an implementation of a loss function using NLL is presented.

Procedure:

Define the model using the torch.nn module.
Generate data for a binary classification problem.
Set NLL as loss function and optimise using gradient descent.

Example implementation:

import torch
import torch.nn as nn
import torch.optim as optim

# Data generation (simple two-class classification data)
torch.manual_seed(0)
X = torch.randn(100, 2)  # 100 samples, 2 features
y = torch.randint(0, 2, (100,)).float()  # Randomly generated labels of 0 or 1

# Definition of logistic regression models.
class LogisticRegressionModel(nn.Module):
    def __init__(self):
        super(LogisticRegressionModel, self).__init__()
        self.linear = nn.Linear(2, 1)  # 1 output from 2 features

    def forward(self, x):
        return torch.sigmoid(self.linear(x))

# Instantiating the model
model = LogisticRegressionModel()

# Binary cross-entropy (equivalent to NLL) as a loss function
criterion = nn.BCELoss()

# Optimiser (gradient descent method)
optimizer = optim.SGD(model.parameters(), lr=0.01)

# training
num_epochs = 100
for epoch in range(num_epochs):
    # Forward pass: forecasting
    y_pred = model(X).squeeze()

    # Calculation of losses (NLL)
    loss = criterion(y_pred, y)

    # Calculation of gradients and parameter updates
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if (epoch+1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

Description:

Data generation: 100 2D data (X) and corresponding 0 or 1 labels (y) are randomly created.
Model definition: simple logistic regression is defined with LogisticRegressionModel. It is a simple linear model with two inputs and one output, and a sigmoid function is applied to the outputs.
Loss function: binary cross-entropy (BCELoss). This is the loss function equivalent to the NLL in binary classification.
Optimisation: stochastic gradient descent (SGD) is used to minimise the NLL.
Training: predictions are made at each epoch, losses are calculated and parameters are updated using gradients.

2. implementation of NLL in multi-class classification (softmax regression): an example of multi-class classification using NLL is presented next, where NLL is minimised using PyTorch’s CrossEntropyLoss.

Example implementation:

import torch
import torch.nn as nn
import torch.optim as optim

# Creating datasets for multi-class classification.
torch.manual_seed(0)
X = torch.randn(100, 3)  # 100 samples, 3 features.
y = torch.randint(0, 3, (100,))  # 3 classes of labels 0, 1, 2

# Definition of softmax regression models.
class SoftmaxRegressionModel(nn.Module):
    def __init__(self):
        super(SoftmaxRegressionModel, self).__init__()
        self.linear = nn.Linear(3, 3)  # Linear transformation from 3 features to 3 classes.

    def forward(self, x):
        return self.linear(x)

# Instantiating the model
model = SoftmaxRegressionModel()

# Cross-entropy loss (loss function based on NLL)
criterion = nn.CrossEntropyLoss()

# Optimiser (gradient descent method)
optimizer = optim.SGD(model.parameters(), lr=0.01)

# training
num_epochs = 100
for epoch in range(num_epochs):
    # Forward pass: forecasting
    y_pred = model(X)

    # Calculation of losses (cross-entropy based on NLL)
    loss = criterion(y_pred, y)

    # Calculation of gradients and parameter updates
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if (epoch+1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

Description:

Data generation: 100 randomly generated samples (X) with three features and three classes of labels (y).
Model definition: with SoftmaxRegressionModel, a linear model is defined to predict the 3-class output from the 3-dimensional input.
Loss function: CrossEntropyLoss computes the NLL based on the Softmax output. This is the standard loss function for multi-class classification.
Optimisation: the model is trained to minimise the NLL in SGD.
Training: predictions are made at each epoch and the parameters are updated by calculating the losses so as to minimise the NLL.

Summary:

In binary classification, the binary cross-entropy loss function (BCELoss) corresponds to the NLL.
In multiclass classification, the cross-entropy loss (CrossEntropyLoss) calculates the NLL based on the softmax output.
In both cases, minimising the NLL allows the parameters of the model to be learnt and the prediction accuracy to be improved.

Negative Log-Likelihood challenges and measures to address them

There are several aspects of Negative Log-Likelihood (NLL) challenges, and these challenges often arise in training and evaluating models that use NLL. The following section describes some of the challenges of NLL and how they can be addressed.

1. class imbalance:

Challenge:

NLL calculates the loss based on the probability of each class being correct in the classification task. However, if the class distribution in the dataset is extremely skewed, the learning can be biased because NLL does not pay enough attention to small class losses. While majority classes are less likely to be misclassified, it will be common for the classification of minority classes to deteriorate.

Solution:

Adjust class weights: by introducing per-class weights to the loss function, NLL can be adjusted for class imbalances, e.g. CrossEntropyLoss in PyTorch allows the weight option to be used to adjust the importance of each class.

# Example of introducing weights per class in PyTorch.
class_weights = torch.tensor([0.3, 0.7])  # Example: weights for class 0 and class 1
criterion = nn.CrossEntropyLoss(weight=class_weights)

Oversampling/Undersampling: it is also useful to balance the class distribution by either increasing the data for minority classes (oversampling) or reducing the data for majority classes (undersampling).

2. overconfidence:

Challenge:

Models using NLL can sometimes be overconfident (output high probabilities) for incorrect predictions. This is frequently observed, especially in deep learning models, resulting in an overestimation of the impact of misclassification.

Solution:

Label Smoothing: Label smoothing is a technique that prevents overconfidence in the correct labels, reducing the confidence level by spreading out the errors slightly. This prevents overconfidence in the model and produces more reliable predictions.

# Example of PyTorch label smoothing.
criterion = nn.CrossEntropyLoss(label_smoothing=0.1) # Smoothing of 0.1

Regularisation: use techniques such as L2 regularisation (weighted decays) and dropouts to prevent models from overfitting and to ensure that uncertainty is adequately represented.

3. the Vanishing Gradient Problem:

Challenge:

Particularly in deep learning models, a vanishing gradient problem described in “The vanishing gradient problem and its countermeasures” can occur during NLL minimisation, making it difficult for the parameters of the initial layer to be updated. This can result in slow or extremely slow training progress.

Solution:

Choice of activation function: use an activation function that mitigates gradient disappearance, such as ReLU (Rectified Linear Unit) or Leaky ReLU. These can be useful to avoid the gradient loss problem.
Batch Normalisation: standardises the output of each layer to ensure proper propagation of the gradient and to avoid gradient loss.

4. sensitivity to noisy data:

Challenge:

NLL calculates losses based on labelled data and is therefore very sensitive to mislabelled or noisy data, which can reduce the accuracy of the model.

Solution:

Use robust loss functions: noise-tolerant loss functions such as Focal Loss can be used, which emphasises loss for difficult samples and misclassification, while reducing the impact of easy samples.

# Example implementation of Focal Loss
class FocalLoss(nn.Module):
    def __init__(self, alpha=1, gamma=2):
        super(FocalLoss, self).__init__()
        self.alpha = alpha
        self.gamma = gamma

    def forward(self, inputs, targets):
        BCE_loss = nn.CrossEntropyLoss()(inputs, targets)
        pt = torch.exp(-BCE_loss)
        F_loss = self.alpha * (1-pt)**self.gamma * BCE_loss
        return F_loss

data cleaning: improving the quality of training by pre-detecting and removing or correcting erroneous labels and noise in the data set

5. computational costs when probabilistic models are complex:

Challenge:

NLL calculations can be computationally expensive for complex probabilistic models and large amounts of data. Especially for large datasets and deep learning models, NLL computation becomes a bottleneck.

Solution:

Mini-batch learning: computing the NLL in mini-batches instead of processing the entire dataset at once can reduce memory usage and computational costs. This is a particularly effective approach for large datasets. See detail in “Overview of mini-batch learning and examples of algorithms and implementations”
Devising optimisation algorithms: instead of stochastic gradient descent (SGD), more efficient optimisation methods such as Adam or RMSProp can be used to improve computational efficiency.

6. the possibility of falling into locally optimal solutions:

Challenge:

In the process of minimising the NLL, there is a risk of falling into a local optimum solution, especially for complex non-linear models. This means that the global optimal solution cannot be reached and the model cannot be adequately trained.

Solution:

Use of different initialisation methods: ingenious parameter initialisation can reduce the risk of falling into a local optimum solution; He initialisation and Xavier initialisation are particularly effective approaches for deep neural networks.
Ensemble learning: ensemble learning, which involves multiple training runs with different initialisations and different models and integrating their results, can reduce the impact of local optimum solutions.

Reference information and reference books

Reference information and reference books on Negative Log-Likelihood (NLL) are described.

1. reference information (online resource):

– Deep Learning Book – Chapter 5: Machine Learning Basics

– PyTorch – CrossEntropyLoss

– Understanding Cross Entropy Loss

– CS231n: Convolutional Neural Networks for Visual Recognition

2. reference books:

– “Pattern Recognition and Machine Learning” by Christopher M. Bishop

– “Deep Learning” by Ian Goodfellow, Yoshua Bengio, and Aaron Courville

– “Machine Learning: A Probabilistic Perspective” by Kevin P. Murphy

– “Bayesian Reasoning and Machine Learning” by David Barber

3. related papers:

– “A Practical Guide to Training Restricted Boltzmann Machines” by Geoffrey Hinton

– “Understanding Convolutional Neural Networks with A Mathematical Model“