k-means
k-means is one of the algorithms used in the machine learning task called clustering, which is a method that can be used in a variety of tasks. Clustering here refers to the method of dividing data points into groups (clusters) with similar characteristics. The k-means algorithm aims to divide the given data into a specified number of clusters.
The main ideas and procedures of k-means are described below.
- Initialization: First, we specify the number of clusters k and randomly select k center points (centroids). These centroids are points that represent the representative positions of the clusters.
- Assignment: Assign each data point to the nearest centroid. That is, each data point is assigned to the cluster to which the feature and its closest centroid belong.
- Recalculate: Calculate the average of the data points within each cluster and set that average as the new centroid. This causes the position of the centroid to adapt to the data points in the cluster.
- Convergence Decision: Steps 2 and 3 are repeated until the centroid positions converge or the maximum number of iterations is reached. When convergence is achieved, the affiliation of each data point to the cluster is fixed.
The advantages and disadvantages of k-means are summarized below.
<Merits>
- High computational efficiency: k-means is a relatively simple algorithm and is fast even for large data sets. It is particularly effective when the number of clusters is small.
- Scalability: k-means is highly scalable and can be applied to multi-dimensional data sets. It can also produce appropriate results for large data sets.
- Intuitive Interpretability: Because data is partitioned using cluster centers (centroids), it is relatively intuitive to understand what characteristics each cluster has.
- Simplicity of preprocessing: k-means often does not require scaling of features, thus simplifying data preprocessing.
<Disadvantages>
- Effect of initialization on results: The choice of initial cluster center can significantly change the results. Therefore, different initializations should be tried.
- Specifying the number of clusters: In k-means, the number of clusters must be specified in advance. If it is difficult to know in advance the appropriate number of clusters, the number of clusters may be difficult to select.
- Limitation of spherical clusters: k-means assumes that clusters are spherical. Non-spherical clusters may not be segmented correctly.
- Sensitivity to outliers: If outliers exist, they may affect the cluster center.
Effect of the chosen distance measure: Since k-means is a distance-based algorithm, the chosen distance measure affects the clustering results.
It can be said that k-means is a simple and efficient clustering method that should be applied first in a variety of tasks. In such cases, one should be aware of the limitations in selecting the appropriate number of clusters and in partitioning non-spherical clusters, as well as the sensitivity to outliers and variations in the results due to initialization.
mathematic model
The mathematical model of k-means is described below.
- Data: In k-means, n data points are considered on a d-dimensional Euclidean space, which is represented by a data matrix X, where each row represents one data point and each column represents a feature (dimension). \[X = [x₁, x₂, …, xₙ] xᵢ ∈ ℝᵈ\]
- Cluster centers: the goal of clustering would be to divide the data into k clusters. Each cluster has one center (centroid), and the cluster centers are denoted μ₁, μ₂, . , represented as μₖ.
- Assignment: each data point xᵢ is assigned to the nearest cluster center. That is, if data point xᵢ belongs to cluster j, then the assignment function c(xᵢ) = j holds.
- Cluster center update: The center of each cluster is updated as the mean value (center of gravity) of the data points belonging to that cluster. The formula to calculate the new cluster center is as follows. \[μⱼ = (1 / |Cⱼ|) * Σᵢ₆₈ c(xᵢ) = j xᵢ\] where Cⱼ is the index set of data points belonging to cluster j set and |Cⱼ| is the number of its elements.
- Optimization Goal: The optimization goal of k-means is to minimize the sum of the distances from each data point to its belonging cluster center, which is expressed as. \[J = Σᵢₙ ||xᵢ – μ_{c(xᵢ)}||²\] where||⋅|| denotes Euclidean distance and μ_{c(xᵢ)} is the distance from the data point x ᵢ is the center of the cluster to which it belongs.
- Optimization method: k-means converges by alternately updating the cluster center and data point assignments through iteration, and a typical method would iteratively perform the following steps
- Assign data points to the nearest cluster center.
- Update the cluster centers with the average of the assigned data points.
- Finally, as the k-means converges, each data point is assigned to the nearest cluster center and the cluster centers also converge to the mean value of the set of data points.
Derived algorithms of k-means and its combination and application with other machine learning techniques
<Derived Algorithm>
There are several limitations and problems with k-means, and several derived algorithms have been proposed to solve these problems. Some of the derived algorithms of k-means are described below.
- K-means++: This is a version with an improved method of selecting initial cluster centers. Usually, the initial cluster centers in k-means are chosen randomly, resulting in slow convergence or a high probability of convergence to a locally optimal solution.
- K-medoids (PAM: Partitioning Around Medoids): While k-means represents cluster centers by their mean values, K-medoids represents each cluster by its actual data points (medoids) rather than its median value. See also “Explainable Machine Learning (19)prototype and Criticism” for more information on k-medoids.
- K-means++-PAM: An algorithm that combines K-means++ and K-medoids to improve the selection of initial cluster centers based on K-means++ and then clustering using K-medoids.
- Bisecting K-means: A method for hierarchically partitioning clusters, starting with a single cluster for all data, then iteratively partitioning the largest cluster into two, up to a specified number of clusters. See also “Hierarchical Clustering with R” for more information.
- Spectral Clustering: Spectral Clustering, also described in “Clustering of Symmetric Relational Data – Spectral Clustering” is an approach based on graph theory, in which data is viewed as a graph and eigenvectors of the Laplacian matrix of the graph are used for clustering. The eigenvectors of the Laplacian matrix of the graph are used to perform clustering. It is particularly effective for non-convex cluster shapes and high-dimensional data.
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise): DBSCAN, described in “Overview of DBSCAN and Examples of Applications and Implementations” is an algorithm that takes a different approach to clustering than k-means. DBSCAN is an algorithm that takes a different approach to clustering than k-means. It detects clusters based on density and provides robust clustering against outliers.
<Combination with other machine learning techniques>
Although k-means is a basic clustering technique, more advanced clustering and applications are possible by combining it with other machine learning techniques. The following sections describe examples of combinations and applications of k-means with other machine learning techniques.
- Feature Extraction and Clustering: k-means is used for clustering data, but it can also be used for clustering based on prior feature extraction. For example, for text data, a feature vector such as TF-IDF described in “Overview of tfidf and its implementation in Clojure” can be extracted, and similar documents can be clustered based on that feature vector using k-means.
- Dimensionality Reduction and Visualization: When clustering high-dimensional data, problems can arise due to the curse of dimensionality, and before applying k-means, it is recommended to use dimensionality reduction methods such as PCA as described in “About Principle Component Analysis (PCA)“and t-SNE described in “t-SNE (t-distributed Stochastic Neighbor Embedding)” .
- Ensemble Clustering: There are also ensemble clustering methods that combine the results of multiple clustering algorithms (k-means, hierarchical clustering, etc.). This can improve the stability and accuracy of clustering while taking advantage of the different algorithms. For more information on ensemble methods, see “Machine Learning with Ensemble Methods – Fundamentals and Algorithms” .
- Supervised Clustering: As part of supervised learning, data can be pre-clustered using k-means and the cluster information can be used as labels to augment the teacher data. This allows us to train a classifier using a dataset with missing labels. See also “small data learning, combining logic with machine learning, and local/population learning.
- Anomaly detection: clustering is performed using k-means to determine which cluster the data points belong to. The distance between the centroid of each cluster and the data point is then calculated and can be used to detect anomalous data points. For more information on anomaly detection in general, see “Anomaly Detection and Change Detection Techniques.
- Semi-supervised learning: a semi-supervised learning approach can be implemented by clustering data using k-means and then linking unsupervised clusters to labeled data. .See also”small data learning, combining logic with machine learning, and local/group learning.”
k-means applications
k-means is a clustering method used in a wide range of applications.
- Customer segmentation: In the marketing field, k-means clusters customers with similar purchasing patterns and characteristics based on customer purchase history and behavioral data. This enables the development of targeted marketing strategies for different segments of customers.
- Image Segmentation: In the image processing field, segmentation is used to detect regions of similar color or texture in an image using k-means. This is useful for image object detection and image analysis.
- Document Clustering: In the field of natural language processing, text data is analyzed to cluster documents with similar themes or content. This can be used to group documents and improve information retrieval.
- Biological data analysis: In the field of bioinformatics, this is used to gain biological insights by clustering species and functional groups of organisms based on gene expression data, protein information, etc.
- Music genre classification: Clustering of similar music tracks based on musical characteristics (acoustic characteristics, instruments, etc.) is used for genre classification and playlist creation.
- Urban population distribution analysis: In urban planning and geographic information systems, districts within a city are clustered based on features such as population density and housing prices, and used to identify different types of neighborhoods.
- Financial Data Analysis: In the financial sector, k-means is used to cluster assets and portfolios using time series data and market indicators to assess risk and develop investment strategies.
k-means is used in diverse fields. By applying k-means to appropriate data and problems, patterns and structures can be discovered and insights gained.
Example implementation of k-means
An example of implementing the k-means algorithm using Python’s Scikit-learn library is shown. The following is an example of k-means implementation using a simple data set.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
# dummy data generation
data, _ = make_blobs(n_samples=300, centers=4, cluster_std=1.0, random_state=42)
# k-means clustering
k = 4 # Number of clusters
kmeans = KMeans(n_clusters=k)
kmeans.fit(data)
# Cluster center point
centroids = kmeans.cluster_centers_
# Labels for data points belonging to the cluster
labels = kmeans.labels_
# Visualization of data points and center points
plt.scatter(data[:, 0], data[:, 1], c=labels, cmap='viridis')
plt.scatter(centroids[:, 0], centroids[:, 1], marker='X', s=200, color='red')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('k-means Clustering')
plt.show()
In this example, the make_blobs function in Scikit-learn is used to generate dummy data with four clusters, which are then clustered using k-means. The number of clusters k is specified and clustering is performed using the KMeans class to visualize the cluster center points and data points.
Example of k-means implementation for multidimensional data
An example implementation of k-means for multidimensional data using Python and the scikit-learn library is shown.
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# Generate dummy multidimensional data
np.random.seed(0)
n_samples = 300
n_features = 2
n_clusters = 3
X = np.random.randn(n_samples, n_features) + np.array([2, 2])
X = np.vstack((X, np.random.randn(n_samples, n_features) + np.array([0, -2])))
X = np.vstack((X, np.random.randn(n_samples, n_features) + np.array([-2, 2])))
# Performing k-means clustering
kmeans = KMeans(n_clusters=n_clusters)
kmeans.fit(X)
# Get cluster center and labels
cluster_centers = kmeans.cluster_centers_
labels = kmeans.labels_
# Visualization of clustering results
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', alpha=0.5)
plt.scatter(cluster_centers[:, 0], cluster_centers[:, 1], c='red', marker='x', s=100)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('k-means Clustering')
plt.show()
In this example, dummy multidimensional data is first generated, then k-means clustering is performed using the KMeans class to obtain cluster centers and labels for each data point. Finally, the clustering results are visualized as a scatter plot.
When applying k-means to actual data, it is necessary to select the appropriate number of clusters, and while the number of clusters may be known in advance, methods such as the elbow method or silhouette score may be used to find the appropriate number of clusters, and data preprocessing and clustering Parameter tuning is also an important factor.
Elbow Method Overview and Implementation
The Elbow Method is a general method for determining the appropriate number of clusters in k-means clustering. This method is based on the idea that the number of clusters is selected by plotting the trend of the variance (sum of squares of intra-cluster error) decrease as the number of clusters is increased, and selecting the number of clusters where an elbow (corner) can be seen. Below are the steps of the elbow method and an example Python implementation.
- Start with 1 cluster and increase the number of clusters.
- For each number of clusters, calculate the Sum of Squared Errors (SSE). This is the sum of the squares of the distances between each data point and the cluster center to which it belongs.
- Graph the SSE and select the number of clusters where elbows (bends) are seen.
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# Generate dummy multidimensional data
np.random.seed(0)
data = np.random.randn(300, 2)
# Specify a range of cluster counts
cluster_range = range(1, 11)
sse = []
# Run K-means for each number of clusters and calculate SSE
for k in cluster_range:
kmeans = KMeans(n_clusters=k)
kmeans.fit(data)
sse.append(kmeans.inertia_)
# Find elbows by graphing SSE
plt.plot(cluster_range, sse, marker='o')
plt.xlabel('Number of Clusters')
plt.ylabel('SSE')
plt.title('Elbow Method for Optimal Clusters')
plt.show()
The code runs K-means with the number of clusters from 1 to 10, and calculates and graphs the SSE (sum of squares of intra-cluster errors) for each number of clusters. When finding the appropriate number of clusters using the elbow method, the goal is to find the position of the elbow (bend) on the graph where the SSE decreases sharply, and the chosen elbow position is the appropriate number of clusters for the data.
For more information on the elbow method, see also “Evaluating Clustering to Master k-means“.
Overview and Implementation of the Silhouette Score Method
The Silhouette Score is one of the measures used to evaluate the results of k-means clustering. It is useful for evaluating the quality of the clusters and selecting the appropriate number of clusters. The Silhouette Score is calculated by considering how dense the data points in a cluster are (high similarity) and how far away they are from other clusters.
The silhouette score is calculated for each data point by calculating the following values and averaging them.
- a(i) : The average distance within the cluster to which data point i belongs (distance between data points).
- b(i) : The average distance between the cluster to which data point i belongs and the nearest different cluster.
The silhouette score s(i) is calculated by the following formula
s(i) = (b(i) - a(i)) / max(a(i), b(i))
The quality of the clustering is evaluated by calculating the silhouette score of all data points and taking the average of the silhouette scores. The silhouette score ranges from -1 to 1, with higher values indicating greater similarity within a cluster and greater distance from other clusters.
When using the silhouette score to find the appropriate number of clusters, it is common to calculate the silhouette score for different numbers of clusters and select the number of clusters with the largest score.
Below is an example of calculating the silhouette score using Python’s scikit-learn library.
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# Generate dummy multidimensional data
data = ...
# Specify a range of cluster counts
cluster_range = range(2, 11)
silhouette_scores = []
# Run K-means for each number of clusters and calculate silhouette score
for k in cluster_range:
kmeans = KMeans(n_clusters=k)
labels = kmeans.fit_predict(data)
score = silhouette_score(data, labels)
silhouette_scores.append(score)
# Graph silhouette scores
plt.plot(cluster_range, silhouette_scores, marker='o')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score for Optimal Clusters')
plt.show()
The code runs K-means on different numbers of clusters and calculates and graphs the silhouette score for each number of clusters. The number of clusters with the largest silhouette score is the appropriate number of clusters for the data.
For an example implementation of hierarchical k-means
Describes an example implementation of hierarchical k-means using Python and scikit-learn that uses the Bisecting K-means algorithm to partition data hierarchically.
import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import Birch
import matplotlib.pyplot as plt
import scipy.cluster.hierarchy as sch
# Generate dummy multidimensional data
data, labels = make_blobs(n_samples=300, centers=3, random_state=0, cluster_std=1.0)
# Perform hierarchical k-means clustering
agg_cluster = AgglomerativeClustering(n_clusters=3)
agg_labels = agg_cluster.fit_predict(data)
# Creating a dendrogram
dendrogram = sch.dendrogram(sch.linkage(data, method='ward'))
plt.show()
# Visualization of results
plt.scatter(data[:, 0], data[:, 1], c=agg_labels, cmap='viridis', alpha=0.5)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Hierarchical K-means Clustering')
plt.show()
In this example, dummy multidimensional data is first generated, then hierarchical k-means clustering is performed using AgglomerativeClustering to create a dendrogram (a graph visualizing the hierarchical structure). Finally, the clustering results are visualized as a scatterplot.
In hierarchical k-means clustering, the clustering results are represented as a hierarchical structure, so the number of clusters at different levels can be chosen, and the dendrogram can be used to help choose the appropriate number of clusters.
Reference Information and Reference Books
See “General Machine Learning and Data Analysis” for general machine learning algorithms, including k-means.
For specific exercises on specific topics, see “python and algorithms“,”Machine Learning with python“,”Statistical modeling with python“,”Optimization methods with python.
コメント