Similarity in Machine Learning

Machine Learning Artificial Intelligence Digital Transformation Algorithms and Data Structures Python General Machine Learning Navigation of this blog

Similarity

Similarity is a concept that describes the degree to which two or more objects or things have common characteristics or properties and are considered similar to each other, and plays an important role in evaluating, classifying, or grouping objects in terms of comparison or relatedness. The degree of similarity may generally be measured by mathematical or quantitative methods, but may also include subjective assessments.

The mathematical definition of similarity is as follows.

Definition 1 (Similarity) 
Similarity σ : o × o → R is a function from a pair of entities
 to a real number, denoting the similarity between two objects 
such that 　　　　　　　　　　 
∀x, y ∈ o, σ(x, y) ≥ 0 (positivity) 
∀x ∈ o, ∀y, z ∈ o, σ(x, x) ≥ σ(y, z) (maximality) 
∀x, y ∈ o, σ(x, y) = σ(y, x) (symmetry)

The function σ representing similarity is positive, and the similarity between you and yourself (same) is Max, and the similarity between x and y is the same as the similarity between y and x. We can now define dissimilarity, which is the opposite of similarity, as follows.

Definition 2 (Dissimilarity) 
Given a set of entities, the dissimilarity δ:o×o→R is a function 
that converts from a set of entities to real numbers such that 
∀x, y ∈ o, δ(x, y) ≥ 0 (positivity) 
∀x ∈ o, δ(x, x) = 0 (minimality) 
∀x, y ∈ o, δ(x, y) = δ(y, x) (symmetry)

The dissimilarity function will have the same positivity and symmetry as that of the similarity function, except that the similarity between oneself and oneself will be minimal.
There is an approach (Tverski 1977) that says that we have introduced “asymmetric (dis)similarity” into the similarity function here. In that case, the term asymmetric measure or pre-similarity is used. Other concepts that are more restrictive than dissimilarity include distance and super-metrics.

In addition, the “equivalence relation” mentioned in the part of the computer math basics is defined as one in which (1) a – a, (2) b – a if a – b, and (3) a – c if (a – b and b – c). In comparison with the definition of similarity, it can be interpreted as equivalent (identical) if the transitive law in (3) holds, and similar if only the symmetry (exchange law) in (2) and the reflection law in (2) hold.

Definition 3 (Distance) 
The distance (or metric) δ : o × o → R is a dissimilar function 
satisfying the definiteness inequality and the triangular inequality.
 ∀x, y ∈ o, δ(x, y) = 0 if and only if x=y (definiteness) 
∀x, y, z ∈ o, δ(x, y) + δ(y, z) ≥ δ(x, z) (trigonometric inequality)

When x=y, δ is zero, and the distances between x and y and y and z are greater than or equal to the distances between x and z.

Definition 4 (Hyperdistance Space) 
Given an entity, the function on the hyperdistance space is as follows. 
∀x,y,z∈o, δ(x,y) ≤ max(δ(x,z),δ(y,z)) (hyper-distance space inequality)

Especially when the similarity of different types of entities has to be compared, the measurements need to be normalized. A common method of normalization is to reduce each value to the same measure in proportion to the size of the space under consideration.

Definition 5 (Normalized (dis)similarity) 
If the (dis)similarity spans a real unit interval [0 1], 
then the similarity is said to be normalized. 
Let σ(δ) denote the normalized (un)similarity.

The normalized similarity σ corresponds to the normalized dissimilarity δ = 1-σ, and vice versa. In the following, we will assume that most of the measurements are normalized and that the dissimilarity function between two entities returns a real number between 0 and 1.

Similarity in Machine Learning

Similarity in machine learning is a concept for measuring the similarity between different data points or objects, often calculated based on the similarity of feature vectors or attributes, and used in different tasks and applications. Below we describe the concept of similarity in different cases and the general calculation methods.

Metrics for calculating vector similarity

First, we will discuss the most basic data, the measure used to calculate the similarity of vectors.

Cosine Similarity: Cosine Similarity evaluates the similarity between two vectors by calculating the cosine of the angle between them. The value ranges from [-1, 1], with closer to 1 indicating higher similarity.
Euclidean Distance: The Euclidean distance calculates the point-to-point distance between two vectors and is the square root of the sum of the squares of the differences between the elements of the vectors. The smaller the distance, the higher the similarity is considered.
Manhattan Distance: Manhattan Distance calculates the sum of the absolute values of the elemental differences between two vectors and, unlike Euclidean Distance, measures the distance along a Cartesian coordinate axis rather than a straight line.
Pearson Correlation Coefficient: The Pearson correlation coefficient evaluates the strength of the linear relationship between two vectors. Values range from [-1, 1].
Jaccard Index: The Jaccard Index evaluates the similarity by calculating the percentage of common elements between two sets. When the elements of a vector are represented as 0 and non-zero, the Jaccard Index is the percentage of common non-zero elements.

Metrics for calculating graph similarity

Next, we will discuss the metrics used to evaluate the similarity of graphs.

Graph Edit Distance: Measures the number of minimum edit operations (adding, deleting, or modifying nodes or edges) between two graphs; similar graphs have smaller edit distances.
Graph Kernel: Measures the similarity of graphs by mapping a graph to a higher-dimensional feature space and calculating the inner product in that space.
Graph Spectrum Similarity: Evaluates similarity by comparing eigenvalues and eigenvectors of a graph’s Laplacian matrix and adjacency matrix; if the eigenvalues are similar, the graphs are considered similar.
Subgraph Isomorphism: Determines similarity by checking whether one graph can be embedded as a subgraph of another graph; subgraph isomorphisms can be computationally difficult to determine.
Graph Feature-based Distance Metrics: A method to extract graph features and evaluate similarity based on those features, e.g., node degree distribution, clustering coefficient, average shortest path length, etc.
Jaccard Index or Cosine Similarity: A method to calculate the similarity of graph nodes and edges as a set, where the Jaccard Index is calculated using the percentage of common elements and the Cosine Similarity is calculated using the angle of vectors. Cosine similarity is calculated using the angle of vectors.

Metrics for calculating matrix similarity

Several metrics and methods exist for calculating matrix similarity. Matrices are an important concept for representing multidimensional data, and evaluating the similarity between matrices is useful for understanding data relevance and feature agreement. The following is a description of the indices related to matrix similarity calculations.

Cosine Similarity: Matrices are considered as vectors, and similarity is evaluated by calculating the cosine of the angles between the vectors, where each column or row of a matrix is treated as an element of a vector. It is mainly used to compare feature matrices.
Pearson correlation coefficient: Evaluates the similarity by calculating the Pearson correlation coefficient between vectors, where each column of a matrix is treated as a vector, and the linear relationship between the columns of a matrix is evaluated.
Jaccard coefficient: Treats a matrix as a set of binary elements, and calculates the similarity by dividing the number of common elements of the set by the number of elements of the whole set, and is mainly used to compare set matrices.
Euclidean or Manhattan Distance: Treats each element of a matrix as a vector and evaluates the similarity by calculating the distance between the vectors, mainly used to compare numerical matrices.
Similarity using Singularity Value Decomposition (SVD): This method evaluates similarity by performing SVD to map a matrix to a low-dimensional feature space and comparing singular values, and is mainly used to evaluate dimensionality reduction and singular value agreement.
KL Divergence (Kullback-Leibler Divergence): For matrices representing probability distributions, KL divergence described in “KL divergence constraint” is calculated to evaluate differences in the distributions. It is mainly used to compare probability matrices.
Matrix Norm: A method to evaluate the distance or similarity of matrices by computing their norm, mainly used to evaluate the similarity with respect to the size and shape of matrices.

Metrics for calculating similarity between probability distributions

Kullback-Leibler divergence (KL divergence): The KL divergence is a measure of the lack of information from one probability distribution to another. It should be noted, however, that the KL divergence is asymmetric and has no similarity property.
Jensen-Shannon divergence: To overcome the asymmetry of the KL divergence, the Jensen-Shannon divergence is a measure calculated by taking the average between two probability distributions.
Earth Mover’s Distance (EMD): EMD is a measure used to calculate the distance between two probability distributions. It is defined as the problem of minimizing the cost of moving between distributions and indicates the minimum cost of carrying a probability mass from one distribution to the other.
Total Variation Distance (TV Divergence): TV divergence is a measure for calculating the difference between two probability distributions. It expresses the difference between distributions as the sum of the differences in absolute values of the probability masses at each point.

Metrics for calculating class and instance similarity

In order to calculate the similarity of classes and instances, it is necessary to select the appropriate index and method for the subject. The following are some common cases of indices for calculating class and instance similarity.

Euclidean or Manhattan Distance: represents a class or instance as a vector and calculates the distance between the vectors. It is used when comparing numerical data.
Cosine similarity: Treats a class or instance as a feature vector and calculates the cosine of the angle between the vectors. It is mainly used to compare vector directions.
Jaccard coefficient: Treats classes and instances as a set, and calculates similarity by dividing the number of common elements in the set by the number of elements in the whole set. It is mainly used to evaluate the degree of congruence of categories and tags.
Index considering hierarchical relationships among classes: When classes are hierarchically related, there are methods to calculate similarity while considering the hierarchical structure of the classes. Examples include Resnik’s and Lin’s information theoretic index.
String edit distance (Levenshtein distance or Damerau-Levenshtein distance): When instances are represented as text or strings, similarity is evaluated by calculating edit operations between strings.
Graph Similarity: If the instance is represented by a graph, there are methods to evaluate node and edge congruence, such as Graph Edit Distance and Subgraph Isomorphism.
Embedded Representation using Deep Learning: Sometimes a deep learning model is used to treat a class or instance as an embedded representation to compute similarity.

Metrics for calculating text similarity

Various metrics and algorithms exist to calculate textual similarity. Textual similarity is an important concept for evaluating the degree of agreement in meaning and content between texts. Some common text similarity metrics and algorithms are described below.

Cosine Similarity: Text documents are represented as vector space models, and similarity is evaluated by calculating the cosine of the angle between the vectors. The vector space model represents a document as a vector with each word as a dimension, and the Cosine similarity is the normalized inner product of the vectors, which takes a value between 0 and 1.
Jaccard Similarity: Treats text as a set of words or characters and evaluates similarity by dividing the size of the common part of the set by the size of the whole. It is primarily used to evaluate document overlap.
Levenshtein Distance: Calculates the minimum number of editing operations (insert, delete, replace) required between two text strings. The smaller this value is, the higher the similarity between the strings.
TF-IDF (Term Frequency-Inverse Document Frequency): This method combines the frequency of a word in a text document and its rarity in the total document set. The TF-IDF vector of words in a document may be used to calculate similarity.
Word embedding, such as Word2Vec or BERT: This method represents words as multidimensional vectors. This allows the semantic relationship between words to be captured and the similarity of documents to be calculated from the word level. For more information on BERT, see “BERT Overview, Algorithms, and Example Implementations“.
Similarity calculations for Word2Vec and BERT vectors using Cosine similarity: Cosine similarity between vectors obtained using word embedding can be calculated to obtain the similarity of the entire document.
Thesaurus and ontology-based methods: evaluate the semantic similarity of documents using information such as synonyms and superordinate/subordinate relations.

Metrics for calculating image similarity

Various metrics and algorithms also exist to calculate image similarity. The evaluation of image similarity is an important task for assessing the degree of agreement between image features and content. Some common image similarity metrics and algorithms are described below.

Euclidean and Manhattan distances: Similarity is evaluated by treating images as vectors of pixel-by-pixel values and calculating the distance between the vectors. Since these distances are calculated as the difference between pixel values, they evaluate similarity by pixel brightness or color.
Cosine Similarity: Evaluates similarity by treating images as feature vectors and calculating the cosine of the angle between the vectors. It is mainly used to evaluate the degree of agreement between image features.
Histogram Comparison: This method represents the color information of an image as a histogram and evaluates the similarity of the histograms. The similarity of images is evaluated by comparing the distribution of colors.
Structural Similarity Index (SSIM): Evaluates the similarity of images by considering factors such as luminance, contrast, and structure. It is also used to evaluate the quality of images, as it is similar to human vision.
Similarity Calculation Using Feature Extraction Models: Uses neural network-based feature extraction models (e.g., VGG16, ResNet as described in “About ResNet (Residual Network)”, EfficientNet as described in “About EfficientNet“, etc.) to extract image features and calculate similarity based on these features. In particular, the use of transition learning may provide high performance see “Overview of Transfer Learning, Algorithms, and Examples of Implementations..
Siamese Network or Triplet Network: Evaluates image similarity using a network that learns combinations of two or three images. Used primarily as a supervised learning approach.
Embedded spaces for computer vision tasks: models such as FaceNet for face recognition or YOLO for object detection, for example, provide embedded representations of images, which can then be used to compute similarity.

Metrics for calculating voice similarity

The evaluation of speech similarity is an important task for assessing the degree of agreement in acoustic characteristics and waveform shape. The following describes some common speech similarity metrics and approaches.

Dynamic Time Warping (DTW): DTW is a method for finding the best alignment between two time series data (speech waveforms). It evaluates the similarity by taking into account differences in speech velocity and timing.
Mel-Frequency Cepstral Coefficients (MFCC): MFCC is a method for representing speech as a feature vector and is designed based on human hearing. Speech spectrograms can be converted to MFCCs, and the distance or similarity between these feature vectors can be calculated.
Cosine Similarity: Evaluates the similarity by representing speech as feature vectors (e.g., MFCC) and calculating the cosine of the angle between the vectors. It is mainly used to evaluate the degree of agreement between speech features.
Spectrogram Cross-Correlation: Evaluates the similarity of speech by calculating spectrograms and their cross-correlation. Spectrograms contain information on frequency components and help to capture similar acoustic patterns.
Embedded representations using Deep Learning: Neural network models for handling speech can be used to extract embedded representations of speech. These embedded representations can be used to compute the similarity of speech.
Models using WaveNet and Transformer: Speech can be converted into feature representations using speech production and speech recognition models See also “WaveNet Overview, Algorithms and Examples” for more information, which can then be used to evaluate similarity. see “Overview of Transformer Models, Algorithms, and Examples of Implementations in detail.

Metrics for calculating similarity of sensor data in time series such as IoT

Similarity assessment of time-series data, such as IoT, is an important task for evaluating the consistency of data patterns and trends. Below we describe some common time-series data similarity indices and methods.

Dice coefficient or Jackard coefficient: This index considers time-series data as two sets and calculates the proportion of common elements. It is used to evaluate the degree of agreement between sensor data.
Dynamic Time Warping (DTW): A method to find the best alignment between time series data. It evaluates the similarity by taking into account the time warping between sensor data samples.
Cross-correlation coefficient: This method evaluates the similarity of sensor data by calculating their cross-correlation. The similarity is evaluated by examining the correlation between the data.
Wavelet Transform: This method uses the wavelet transform to decompose time series data in the frequency domain and evaluates the similarity of each frequency component. It is possible to evaluate the degree of agreement in different frequency bands.
Feature-based approach: This approach extracts features from time-series data and calculates similarity using these features. For example, statistical features and Fourier transform features can be considered.
Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN): use neural networks to learn feature representations of time-series data, which can then be used to evaluate similarity; CNN described in “Overview of CNN and examples of algorithms and implementations” are mainly suited for extracting local features, while RNN as described in “Overview of RNN and examples of algorithms and implementations” are suited for capturing patterns in the series data.

Libraries and platforms used to calculate similarity

The libraries and platforms used to calculate similarity vary depending on the type of data and application. Some common libraries and platforms are described below.

Python libraries:
- NumPy: A library for numerical computations, allowing efficient matrix manipulation and vector operations.
- SciPy: An extended library for scientific and technical computing, which implements various similarity calculation algorithms.
- scikit-learn: A machine learning library that includes similarity calculations related to tasks such as clustering and dimensionality reduction.
- pandas: A library for data analysis, capable of handling matrices as data frames.
MATLAB: MATLAB is a platform widely used for scientific and technical computing, with extensive libraries for matrix operations and signal processing.
R: R is a language used for statistical analysis and data mining, and provides packages for similarity computation.
Deep learning frameworks: Deep learning frameworks (TensorFlow, PyTorch, etc.) are used as tools for computing similarity through embedded representations and feature extraction.
Search engines and databases: Search engines such as Elasticsearch and Solr, or database systems may also have built-in similarity computation capabilities.
Dedicated libraries and tools: Libraries and tools dedicated to specific similarity calculations also exist. Examples include Kaldi for speech recognition and OpenCV for image similarity.
Online platforms: There are also platforms available online. For example, there are Google Cloud and Microsoft Azure similarity calculation APIs.

Application Examples of Similarity Calculation

Similarity calculations have a wide range of applications in various fields. Some of the applications are described below.

Information Retrieval and Recommendation Systems: Search engines and recommendation systems use similarity computation to find content and items that are similar to a user’s query or past behavior.
Image Recognition and Computer Vision: Image features are extracted and used to calculate image similarity for tasks such as image retrieval and object recognition.
Speech Recognition and Music Analysis: Similarity calculations are used to analyze speech and music waveform data to find similar acoustic patterns and songs, and are used to suggest similar songs in music and for voice search.
Natural Language Processing: Similarity calculations on textual data are used to evaluate the similarity of sentences, cluster documents, and analyze meaning.
Clustering and classification: Similarity calculations are used in clustering and classification algorithms that group similar data using feature vectors of the data.
Time-series data analysis: Similarity calculations on time-series data, such as sensor data or financial data, are used to assess the consistency of trends and patterns.
Bioinformatics: Similarity calculations for bio-data such as DNA, RNA, and protein sequences are used for genetic analysis and evaluation of biological relatedness.
Network Analysis: Similarity calculations for nodes and edges of graphs and networks are used to evaluate the similarity of network structures.
Anomaly Detection: Data similarity calculations are used to detect anomalies in data based on patterns in normal data.

Reference Information and Reference Books

Data matching is also described in “Ontology Matching Techniques” and other documents. Please refer to those as well. For specific implementation, please refer to “Machine Learning Technology” and “Artificial Intelligence Technology.

Reference book is”Similarity-Based Pattern Recognition“

“Machine Learning and Data Mining in Pattern Recognition“

“Ontology Matching“等がある。