Overview of LightGBM and its implementation in various languages

Machine learning Mathematics Artificial Intelligence Digital Transformation Algorithms and Data Structures Image Recognition Natural Language Processing Recommendation Technology Time Series Data Analysis Python R Clojure Navigation of this blog

LightGBM Overview

LightGBM is a Gradient Boosting Machine (GBM) framework developed by Microsoft, which is a machine learning tool designed to build fast and accurate models for large data sets. LightGBM primarily uses gradient boosting and Decision Tree-based algorithms, which achieve fast learning by vertically partitioning the data and using a histogram-based algorithm to compute gradient information at each partition. LightGBM also excels in handling categorical variables and is capable of directly handling categorical variables without converting them to numerical values.

The main features of LightGBM are as follows

Fast learning: Fast learning is possible on large data sets
Memory efficiency: Reduces memory usage by vertically partitioning data
Categorical variable handling: Categorical variables can be handled directly and do not require prior conversion
High accuracy: Accurate predictions can be made

LightGBM can be used with many programming languages, including Python, R, Java, and C++, and supports a variety of tasks, including classification, regression, ranking, and multilevel classification.

The Decision Tree in GBM is used to model the relationship between objective and explanatory variables by partitioning the data, applying the Decision Tree, a weak learner, multiple times iteratively and learning to minimize the error between each learner’s prediction and the true value. LightGBM will be an improved version of that GBM algorithm, with faster and more efficient learning. This is specifically the following algorithm.

Gradient-based One-Side Sampling (GOSS): A method to improve computational efficiency by sampling and using data points with large gradients in the data.
Exclusive Feature Bundling (EFB): A method to improve computational efficiency by grouping data features to eliminate data sparsity.
Light Histogram: A method to reduce memory usage and improve computational efficiency by employing a histogram-based algorithm.

Among these algorithms, Gradient-based One-Side Sampling (GOSS) is a data sampling technique that aims to improve computational efficiency by sampling and using data points with large gradients.

To put this more concretely, in the usual sampling method, random sampling of data can speed up the learning of a model, whereas random sampling also samples data points with small gradients, which can worsen the learning efficiency, GOSS improves computational efficiency by preferentially sampling data points with large gradients and avoiding sampling data points with small gradients with a certain probability.

Calculate the magnitude of the gradient for all data.
Preferentially select data points with large gradients and do not sample data points with small gradients at a certain ratio.
Train the model using only the selected data points.

Exclusive Feature Bundling (EFB), another algorithm used in LightGBM, is a feature engineering technique; EFB is a principle that improves model training efficiency by bundling features that are not related to each other.

To be more specific about EFB, in ordinary feature engineering, new features are created by taking into account the relationship between features, The procedure of EDB is as follows.

Group features that are not related to each other.
For each group, select the most important feature as a representative.
Create a new feature by combining the representative feature with the value of each feature in the group.

Light Histogram, the last algorithm listed in the algorithms used in LightGBM, is one of the feature binning methods used when training decision trees. In GBM, the value range of each feature is usually divided into equally spaced bins, and the average of the values in each bin is calculated to discretize the features. Light Histogram is capable of more accurate discretization by assigning bins to areas where data appear more frequently, rather than at equal intervals.

Since Light Histogram dynamically sets the bin boundaries, it uses less memory and is faster than the usual equally spaced bin partitioning, enabling the construction of highly accurate prediction models through fast and accurate feature discretization. However, Light Histogram has the disadvantage of being susceptible to outliers, so when outliers are present, other bin segmentation methods should be considered.

The following are examples of applications of LightGBM.

Click-Through Prediction (CTR Prediction): LightGBM is used for CTR prediction, which is the prediction of ad clicks; in CTR prediction, the probability of a click occurring must be predicted using user history, ad attributes, and other characteristics as inputs.
Image Recognition: LightGBM is also used for image recognition tasks because it is faster and more memory efficient than Deep Learning. Examples include human detection in security cameras and road recognition for self-driving cars.
Natural Language Processing (NLP): LightGBM is also widely used for NLP tasks. For example, it is used for tasks such as document classification and sentiment analysis.
Recommendation systems: LightGBM is also used in recommendation systems. The system is required to make product recommendations using data such as user purchase history and browsing history as input.
Prediction of time-series data: LightGBM is also used to predict time-series data. Examples include stock price forecasts and temperature forecasts.

Python implementation of LightGBM

A Python implementation of LightGBM would look like this

Install LightGBM.

!pip install lightgbm

Read the data set.

import pandas as pd
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

Separate features and objective variables.

X_train = train_df.drop(['target'], axis=1)
y_train = train_df['target']

X_test = test_df.drop(['target'], axis=1)
y_test = test_df['target']

Convert to LightGBM data set.

import lightgbm as lgb

train_data = lgb.Dataset(X_train, label=y_train)
test_data = lgb.Dataset(X_test, label=y_test)

Set hyperparameters.

params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'rmse',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9
}

Learning model.

model = lgb.train(params, train_data, valid_sets=[test_data], early_stopping_rounds=10)

Make model predictions.

y_pred = model.predict(X_test)

When actually using LightGBM, it is necessary to adjust hyperparameters and perform feature engineering depending on the data and the problem.

Next, we describe the implementation in R.

Implementation of LightGBM by R

The implementation of LightGBM by R is as follows.

Install LightGBM.

install.packages("lightgbm")

Read the data set.

/train_df <- read.csv("train.csv")
test_df <- read.csv("test.csv")

Separate features and objective variables.

X_train <- train_df[, -ncol(train_df)]
y_train <- train_df$target

X_test <- test_df[, -ncol(test_df)]
y_test <- test_df$target

Convert to LightGBM data set.

library(lightgbm)

train_data <- lgb.Dataset(X_train, label=y_train)
test_data <- lgb.Dataset(X_test, label=y_test)

Set hyperparameters.

params <- list(
  objective = "regression",
  metric = "rmse",
  boosting_type = "gbdt",
  num_leaves = 31,
  learning_rate = 0.05,
  feature_fraction = 0.9
)

Learning model.

model <- lgb.train(
  params = params, 
  data = train_data, 
  valids = list(test_data),
  early_stopping_rounds = 10
)

Make model predictions.

y_pred <- predict(model, X_test)

Implementation by Clojure

To implement lightGBM in Clojure, it is possible to use the Java Native Interface (JNI) as in the case of implementing lightBGM in Java. This can be implemented by following the steps below.

Project setup: To implement lightGBM in Clojure, first setup a project, install the Clojure development environment, and select the appropriate development environment. This can be done using build tools such as Leiningen or Boot, for example.
Obtaining the lightGBM binary files: Obtain the lightGBM binary files; you can download and build the source code from the official lightGBM website, or obtain pre-compiled binaries from the GitHub repository.
Creating a JNI library: Create a JNI library using the lightGBM binary file. creating a JNI library in Clojure requires C++ code and Java code to use JNI, just as in the case of Java. creating a JNI library The method to create a JNI library is as follows
1. Generating the NI header file: The lightGBM source code contains the C++ code to build the JNI library and the JNI header file. A JNI header file named lightgbm_jni.h exists. This is used to generate the JNI header file.
2. Creating the JNI library: Create C++ code to create the JNI library. This code includes the JNI header file and implements JNI functions to interact with the lightGBM binary file.
3. Build the JNI library: build the C++ code to generate the JNI library. The build process may vary depending on the development environment used: one can build using CMake or Makefile, or one can use the build function of the IDE (Eclipse, Visual Studio, etc.). During the build process, the path to the lightGBM binary file and the required libraries must be specified.
4. Setting up the JNI library: After the build is complete, incorporate the generated JNI library (e.g., .so file) into your Java project, and write code to load the JNI library in your Java project so that you can access lightGBM’s functionality.
Writing Clojure code: Once the lightGBM JNI library is set up, you can use lightGBM in your Clojure code. Java Interop can be used. Specific Clojure code can be written in the same way as in Java by using lightGBM’s Java API.