Strategies in similarity matching methods (6) Alignment extraction approach

Artificial Intelligence Technology   Web Technology  Knowledge Information Processing Technology   Semantic Web Technology  Ontology Technology  Natural Language Processing  Machine learning  Ontology Matching Technology

Continuing from the previous article on ontology matching. In the previous article, we discussed the approach to tuning the alignment. In this article, we will discuss the alignment extraction approach.

Extraction of Alignments
The goal of matching is to identify a sufficient set of correspondences between ontologies. A (dis)similarity measure between entities in both ontologies yields a large set of correspondences. The ones that will be part of the resulting alignment need to be extracted based on similarity. This can be accomplished by a special extraction method acting on the similarity matrix or on the already extracted prior alignments. Here we distinguish between the extractor itself, which converts the (dis)similarity matrix into an alignment, and the filter, which reduces the candidate correspondences to one of these formats. This is depicted in Figure 7.14.

The similarity filter transforms the (dis)similarity matrix, for example, by setting cells below a certain threshold to zero or unitizing cells above the threshold. The alignment extractor generates an alignment from the similarity matrix. This is the main topic of this section. The alignment filter can further manipulate the alignment using the same kind of operations as the similarity filter.
The user can act as an alignment filter. The alignment is obtained by displaying the entity pairs along with their similarity scores and ranks, and leaving the selection of the appropriate pair to the user. This user input can be received as a definitive anchor in a helper environment, as a definition of an anchor to help the system, or as feedback of relevance in a learning algorithm.
Going a step further, it is also possible to define an algorithm to automate the extraction of alignments from similarity scores. A variety of strategies can be applied to this task, depending on the characteristics of the alignment in question.
This problem can be defined as follows.

Definition 7.25 (Alignment Extraction Problem) 
Given two sets of entities o and o′ and a 
similarity function σ : o × o′ → [0 1], 
extract an alignment A ⊆ o × o′ .

This problem statement is under-constrained because o × o′ is the solution to this problem. Therefore, in this section, we consider ways to further constrain the problem of alignment extraction. As one guideline for doing so, we introduced the totality and injectivity constraints of alignment in the previous section.
We present two main strategies based on trimming the (dis)similarity after a certain threshold and determining the optimal overall (dis)similarity of the extracted alignments. In the meantime, we will also introduce a kind of filter that has been found useful in matching algorithms.

Thresholds
When neither ontology needs to be completely covered by the alignment, threshold-based filtering keeps only the most similar entity pairs. If there are no injectivity constraints, pairs with scores above the threshold represent sensible alignments.

Therefore, applying a threshold requires that the extracted alignments are of sufficient quality. A simpler method is to select correspondences above a certain threshold. Several methods have been described in the literature (Do and Rahm 2002; Ehrig and Sure 2004).

The hard threshold retains all correspondences above a threshold n.
The delta threshold uses the highest similarity value after subtracting a specific constant value n as the threshold.
Gap threshold keeps the correspondences ordered by decreasing similarity until the difference in similarity between two correspondences becomes large.
Proportional threshold uses the percentage of the highest similarity value as a threshold.
Percentage retains n % correspondences over the other correspondences.

The Rondo system provides its own alignment extraction method (SelectThreshold), which normalizes the similarity of each node with the best similarity to the other node (the result is no longer symmetric). Then, the pair of nodes where the normalized similarity of both nodes is greater than or equal to some defined threshold is selected for alignment.

Start with the weighted sum distance with weights of 1/4-3/4 obtained in the example of the Threshold Method Multidimensional Distances and Weighted Sums.

This distance is converted to a similarity as shown in the following table.

  • -With a hard threshold of 0.23, ⟨Provider, Publisher, =⟩, ⟨Creator, Writer, =⟩, and ⟨Creator, Translator, =⟩ are selected as the corresponding ones.
  • For delta thresholds with the same value of 0.23, only the first two are selected, and the corresponding hard thresholds are 0.75 – 0.23 = 0.52.
  • If the gap threshold is set to 0.23, the same two correspondences are selected, since 0.69 – 0.25 > 0.23.
  • Conversely, if the proportional threshold is set to 0.23, the hard threshold is 0.75 x 0.23 = 0.17, and ⟨Product, Book, =⟩ is selected in addition to the above three.
  • The percentage threshold of .23 would select the first pair of 12 x 23% ≈ 3 selected.
  • The Select Threshold method, with a threshold of .23, also yields the set of four correspondences described above.

Strengthening and weakening
(Ehrig and Sure 2004), another approach is to use a sigmoid function between 0. and 1. (siga(x) = 1/(1 + e-a(x-0.5)), where a is the slope parameter). This allows us to strengthen values higher than 0.5 and weaken values lower than 0.5. This process is intended to clearly separate the two zones of positive and negative correspondence.

Other functions, such as 1-sin(arccos(x)), can have the opposite effect. That is, they discard the inconclusive measurements and dispatch the highest measurements in unit intervals. This process is well justified, given that very similar things are indeed similar, but loosely similar things give inconclusive results. Of course, it will be possible to shift these functions to choose a threshold other than 0.5.
Examples of Strengthening and Weakening Weakening The following two tables are the similarity tables from the previous example filtered by two functions that are sigmoid functions.

The sigmoid function gives a high value for the best match and a low value for a bad match, while the other proposed functions cannot sort unless the similarity is higher than 0.75, which reduces all values.

Optimizing the Results
When an injected alignment is needed, some choices need to be made to maximize the quality of the alignment. This quality is usually measured by the total similarity of the matched entity pairs. Therefore, the matching algorithm needs to optimize the global criteria instead of maximizing the local similarity of each entity pair.
In summary, the alignment calculation can be viewed as a less constrained version of the basic set of similarity functions MSim. In fact, its target properties are the same in terms of (1) maximum global similarity, (2) exclusivity, and (3) maximum cardinality (what is in correspondence). However, (2) and (3) are not mandatory and depend on the requirements of injectivity and wholeness, respectively. The method of extracting alignments from the similarity table is commonly called graph matching (Berge 1970; Lovász and Plummer 1986), or more precisely, weighted bipartite graph matching (for injective alignments) or covering (for holistic alignments). It is more precisely weighted bipartite graph matching (for injective alignment) or covering (for holistic alignment).

The greedy alignment extraction algorithm builds the correspondence in stages, selecting the most similar pair at each stage and removing its members from the table. The algorithm stops when there are no remaining pairs whose similarity exceeds a threshold. A greedy strategy is not optimal. However, the rationale for forgetting high similarity and favoring low similarity is questionable, and greedy algorithms may be preferable in some situations.

In this context, there are two notions of optimal matching of two sets: the first is a local optimum, called stable marriage, and the second is a global optimum, called maximum weight matching. Stable marriage is the assignment (of only one object of the first set to one object of the second set) of pairs of entities in different correspondences such that there is no preference for them to be placed in correspondence, i.e., the similarity between these two entities is higher than the similarity that each of them has with the entity in correspondence. Algorithms for computing stable marriages include the Gale-Shapley algorithm (Gale and Shapley 1962) and the Greedy algorithm introduced above.

Definition 7. 28 (Stable Marriage Problem) 
Given two sets of entities o and o′ and a 
similarity function σ : o × o′ → [0 1], for any ⟨p,q⟩ ∈ M and ⟨r,s⟩ ∈ M ,
 let σ( p,q)≥σ(p,s) orσ(r,s)≥σ(p,s), 
we extract a one-to-one alignment from M⊆ o × o′ such that

In the Greedy algorithm, the solution found is guaranteed to be a stable marriage. However, we have introduced a slightly different problem: we need to find a solution that is a stable marriage. It becomes a constraint that any permutation of the two assignments will provide a worse result.

Definition 7. 29 (Pairwise Maximum Matching Problem) 
Given two sets of entities o and o′ , 
and a similarity function σ : o × o′ → [0 1], 
from M ⊆ o × o′ , for any ⟨p,q⟩∈ M and ⟨ r,s⟩∈ M, 
we extract a one-to-one alignment 
such that σ(p,q)+σ(r,s) ≥ σ(p,s)+σ(r,q).

Maximum pairwise allocation is not necessarily a stable marriage, and vice versa.
The second concept would be the global optimum, or maximum weight matching. This would be an assignment for which there is no other assignment with better weighting. This is computed by the Hungarian method (Munkres 1957).

Definition 7.30 (Maximum weight graph matching problem) 
Given two sets of entities o and o′ , 
and a similarity function σ : o × o′ → [0 1], 
extract a one-to-one alignment from M ⊆ o × o′ , 
and for any one-to-one alignment M′ ⊆ o × o′ , do the following.

\[ \displaystyle\sum_{<p,q>\in M}\sigma(p,q) \geq \sum_{<p,q>\in M’}\sigma(p,q)\]

If the weights represent dissimilarity rather than similarity, the problem to be solved is bi-pairwise minimum weight graph matching.
Examples of stable marriages, pairwise maximum weights and maximum weight matching Consider the following similarity table for the concept shown below.


I want to extract a one-to-one alignment.

The greedy algorithm first selects the correspondence “⟨Product,Publisher⟩” with the highest score (.90) and discards the corresponding row and column. Then we select the next highest score (.84) “⟨Creator,Writer⟩” and the best remaining one “⟨Provider,Book⟩”. The result is clearly a stable marriage (checking that all selected pairs are not broken by symmetric personal preferences remains a challenge), but not a maximum matching per pair. The score for the alignment made from these three correspondences is 1.96.
However, there are better alignments. For example, by replacing the last two elements with ⟨Creator, Book⟩ and ⟨Provider, Writer⟩, we get an alignment with a score of 2.1. This alignment is a pairwise maximization, but it is not a stable marriage.

It consists of ⟨Product,Book⟩, ⟨Provider,Publisher⟩, and ⟨Creator,Writer⟩. It consists of a score of 2.52. This assignment is pairwise maximizing, but it is not a stable marriage.

Summary of Alignment Extraction
Alignment extraction is usually a necessary step in similarity-based matching. The methods presented so far can be applied to any similarity or dissimilarity measure. They can be as simple as applying a threshold or as sophisticated as solving an optimization problem. There are no absolute criteria for selecting these methods. An important criterion will be the form of the expected end result.

In the next article, we will discuss alignment disambiguation.

コメント

タイトルとURLをコピーしました