An overview of natural language similarities that are important for DX, AI, and ML

Artificial Intelligence Technology Web Technology Knowledge Information Processing Technology Semantic Web Technology Ontology Technology Natural Language Processing Machine learning Ontology Matching Technology

In the task of handling various types of natural language data, there are many cases where similarity between data is evaluated and used. From the viewpoint of ontology matching, we will summarize chapters 5 through 7 of the ontology matching technique, which describes the similarity between individual instance levels and collections of graph data.

First, the definition of similarity is as follows.

Definition 1 (Similarity) 
Similarity σ : o × o → R is a function from a pair of entities
 to a real number, denoting the similarity between two objects 
such that 　　　　　　　　　　 
∀x, y ∈ o, σ(x, y) ≥ 0 (positivity) 
∀x ∈ o, ∀y, z ∈ o, σ(x, x) ≥ σ(y, z) (maximality) 
∀x, y ∈ o, σ(x, y) = σ(y, x) (symmetry)

The function σ representing similarity is positive, and the similarity between you and yourself (same) is Max, and the similarity between x and y is the same as the similarity between y and x. We can now define dissimilarity, which is the opposite of similarity, as follows.

Definition 2 (Dissimilarity) 
Given a set of entities, the dissimilarity δ:o×o→R is a function 
that converts from a set of entities to real numbers such that 
∀x, y ∈ o, δ(x, y) ≥ 0 (positivity) 
∀x ∈ o, δ(x, x) = 0 (minimality) 
∀x, y ∈ o, δ(x, y) = δ(y, x) (symmetry)

The dissimilarity function will have the same positivity and symmetry as that of the similarity function, except that the similarity between oneself and oneself will be minimal.
There is an approach (Tverski 1977) that says that we have introduced “asymmetric (dis)similarity” into the similarity function here. In that case, the term asymmetric measure or pre-similarity is used. Other concepts that are more restrictive than dissimilarity include distance and super-metrics.

In addition, the “equivalence relation” mentioned in the part of the computer math basics is defined as one in which (1) a – a, (2) b – a if a – b, and (3) a – c if (a – b and b – c). In comparison with the definition of similarity, it can be interpreted as equivalent (identical) if the transitive law in (3) holds, and similar if only the symmetry (exchange law) in (2) and the reflection law in (2) hold.

Definition 3 (Distance) 
The distance (or metric) δ : o × o → R is a dissimilar function 
satisfying the definiteness inequality and the triangular inequality.
 ∀x, y ∈ o, δ(x, y) = 0 if and only if x=y (definiteness) 
∀x, y, z ∈ o, δ(x, y) + δ(y, z) ≥ δ(x, z) (trigonometric inequality)

When x=y, δ is zero, and the distances between x and y and y and z are greater than or equal to the distances between x and z.

Definition 4 (Hyperdistance Space) 
Given an entity, the function on the hyperdistance space is as follows. 
∀x,y,z∈o, δ(x,y) ≤ max(δ(x,z),δ(y,z)) (hyper-distance space inequality)

Especially when the similarity of different types of entities has to be compared, the measurements need to be normalized. A common method of normalization is to reduce each value to the same measure in proportion to the size of the space under consideration.

Definition 5 (Normalized (dis)similarity) 
If the (dis)similarity spans a real unit interval [0 1], 
then the similarity is said to be normalized. 
Let σ(δ) denote the normalized (un)similarity.

The normalized similarity σ corresponds to the normalized dissimilarity δ = 1-σ, and vice versa. In the following, we will assume that most of the measurements are normalized and that the dissimilarity function between two entities returns a real number between 0 and 1.
There are two ways to normalize. One would be to (i) use the maximum possible value and the other would be to (ii) use the actual maximum value. Since the possible maximums do not always exist, all future normalizations will be based on the actual maximums.
From the above definitions, we can see that similarity and dissimilarity are complete functions that map pairs of entities to real numbers. An alternative representation of such a function for a finite set of entities is a matrix. Matrices have the advantage that they are finite data structures that can be exchanged between programs.

Name-based techniques

Some glossaries compare strings. These methods can be applied to entity names, labels, or comments to find similar ones. This can also be used to compare class names or URIs.
In this section, the set S represents a set of strings, i.e., an array of characters of arbitrary length over the alphabet L: S = L∗. Let ε denote the empty string, ∀s, t ∈ S, and s + t be the concatenation of the strings s and t. |s[i] for i ∈ [1 |s|] denotes the character at the i-th position of s.

Example 5.6 (String) The string ‘article’ consists of the alphabets a, r, t, i, c, l, and e, and has a length of 7 characters. peer-reviewed’ and ‘ ‘ are two other strings (‘-‘ and ‘ ‘ are alphabets), which are concatenated to form ‘peer-reviewed ‘+’ ‘+’article’ is the string ‘peer-reviewed article’ with a length of 21.

A string s is a substring of another string t. This is the case when there are two strings s′ and s′ such that ′+s+s′=t (denoted by bys ∈ t). Two strings are equal (s=t) only when s ∈ t and t ∈ s. The number of occurrences of s in t (denoted by s#t) is the number of distinct pairs s′, s′′ such that s′+s+s′=t is the number of distinct pairs s′, s′′ such that s′ + s + s′ = t.

Example 5.7 (Substrate) The string ‘peer-reviewed article’ has the string ‘review’ as a substring because the string ‘peer-‘+’review’+’ed article’=’peer-reviewed article’. The string ‘homonymous’ has only three occurrences of the string ‘o’, two occurrences of the string ‘mo’, and one occurrence of the string ‘nym’.

The main problem in comparing entities in an ontology based on their labels is caused by the presence of synonyms and homonyms.
Synonyms are different words that are used to name the same entity. For example, Article and Paper are synonyms in some contexts.
A homonym is a word used to describe a different entity. For example, peer as a noun can mean “equal” or “member of the nobility”. one word can have multiple meanings, also known as polysemy.

Therefore, it cannot be inferred with certainty that two entities are the same if they have the same name, and different if they have different names. There are other reasons for this to happen besides synonymy and homonymy. In particular, this

– Words from different languages, such as English, French, Italian, Spanish, Germanic, and Greek, are used to name the same entity. For example, Book in English is called Livre in French and kniga in Russian.
– Syntactic variations can occur in the same word due to differences in acceptable spellings, abbreviations, and the use of arbitrary prefixes and suffixes. For example, Compact disc, CD, C.D. and CD-ROM are considered equivalent in some contexts. However, in other contexts, CD may mean Corps diplomatique, or it may change the directory.

Such variations can occur within a single ontology, but they can occur even more frequently between ontologies. However, the way things are named is very important in day-to-day communication, and names are good indicators of similarity and dissimilarity. Furthermore, regardless of the similarity or dissimilarity of strings representing two terms, various methods have been devised to evaluate the similarity of those terms.
There are two main types of methods: those that consider only the string, and those that use some linguistic knowledge to interpret the string.

In the next article, we will discuss string-based similarity.

Basic Similarity (1) Overview

コメント