Introduction of reference books on search technology: “Basics of Information Retrieval” and “New Horizons in Search”.

Machine Learning Artificial Intelligence Natural Language Processing Semantic Web DataBase Technology Ontology Technology Algorithm Digital Transformation Search Technology UI and DataVisualization Javascript Workflow & Services CSS Navigation of this blog
Introduction of reference books on search technology: “Basics of Information Retrieval” and “New Horizons in Search”.

I would like to introduce some books that will be helpful in understanding search technology.

The first one is “Basics of Information Retrieval” published by Kyoritsu Publishing Co. This book comprehensively summarizes the HOW of search technology, and is a very useful reference book when actually building search modules.

The chapters are organized as follows, covering basic search technologies from the basic structure of search systems to evaluation and machine learning.

The chapters are organized as follows:

1. logical search, 
2. term lexicons and posting lists, 
3. lexicons and flexible search, 
4. index construction, 
5. index compression, 
6. scoring, term weighting, and vector space models, 
7. score calculation for the entire retrieval system, 
8. evaluation of information retrieval, 
9. fitting Feedback and Query Expansion, 
10. XML Retrieval, 
11. Probabilistic Information Retrieval, 
12. Language Models for Information Retrieval, 
13. Text Classification and Naive Bayes, 
14. Vector Space Classification, 
15. Support Vector Machines and Document Machine Learning, 
16. Flat Clustering, 
17. Hierarchical Hierarchical Clustering, 
18. Matrix Decomposition and Latent Semantic Indexing, 
19. Fundamentals of Web Search, 
20. Crawling and Indexing of the Web, 
21. Link Analysis, which covers basic search technologies from basic structure of search systems to evaluation and machine learning.

In constructing a search system, it is important to first understand the purpose of the search, i.e., what is being searched for? In order to build a successful system, it is important to consider the purpose of the search and the evaluation axis after building the search system. The first item to be considered is the evaluation of information retrieval in Chapter 8.

According to Chapter 8, three collections are presented as necessary for the evaluation of information retrieval systems. The first one is (1) a collection of target texts, the second one is (2) necessary test information requests expressed as queries, and the last one is (3) a set for validity evaluation in which the relevance or irrelevance of each query-document pair is indicated by a binary value.

Of these, (1) is the definition of “what is the target of the search? (1) is the definition of “what to search for. This target document (not necessarily text information) is a list of information that appears in the area of the problem to be solved by implementing the system, and the information to be listed includes the actual file name, the location where it is stored, and the type of information (word, excel, powerpoint, pdf, etc.). The information to be listed is the name of the actual file, where it is stored, the type of information (word, excel, powerpoint, pdf, etc.), and the amount of data in each file.

At this point, you should write a use case and describe where each piece of information appears in the workflow, as well as what issues are involved (e.g., simple search produces a large number of results and filtering is difficult, or you want to use a search method other than character matching). It will be easier to narrow down the evaluation points.

Next, (2) is the definition of what problem you want to use the system to solve. This is a definition directly related to KPIs. If you have top KPIs you want to solve, describe/analyze the workflow in which they appear as described above to clarify the point of the problem. Since we are going to solve these issues by searching, we need to describe what kind of answers we want to get to what kind of information requests. In this case, the information requirement is not simply a list of keywords, but is defined in two stages: first, a specific description of what you want to find in the workflow, and then the query (including keywords and refinements, if any) necessary to search for it.

As a practical example of (1) and (2), for example, in NITCIR, a well-known collection of Japanese retrieval test data, the target collection for (1) is defined as “Mainichi Newspaper (2002-2005)” (Mainichi Newspaper 2002-2005), and the information requirement for (2) is defined as (The analyst is especially interested in the information about Fatah’s activities. (The analyst is especially interested in major characteristics of the organization called Fatah.)” The input information is “What is Fatah? (The analyst is especially interested in major characteristics of the organization called Fatah.)” and the input information is defined as “What is Fatah?

Finally, (3) is preferably created by actual users who appear in the use scene as a pair of input and answer, but if cooperation from the field is not possible, the system designer creates a hypothesis based on the KPI. As a concrete example, in the NITCIR example mentioned above, “What kind of organization is Fatah? Fatah is the mainstream faction of the Palestine Liberation Organization (PLO),” “Fatah is the main body of the anti-Israel resistance struggle,” “Fatah is the largest support base of Chairman Arafat,” “Fatah has been in power for 13 days,” and so on.

Since absolute quantitative evaluation of search systems is difficult (it is difficult to prepare all the input/output pairs that are likely to appear in the use scene), it is common to evaluate them using test data sets that are shared with users. One solution to this problem is to use a test data set. One solution to this problem is to divide the search data set into areas of technical issues and evaluate them.

For example, we can divide the search matching cases into (1) string matching, (2) matching using dictionaries such as similar words, and (3) more complex matching, and estimate the proportion of each of (1), (2), and (3) appearing in the use scene from actual examples. In general, the ratio of (1) and (2) is high, and the cases where advanced search for (3) is required are rare. Here, since (1) and (2) can be matched mechanically, the results can be estimated quantitatively relatively easily. On the other hand, for (3), since various cases can be assumed, a method is used to make a decision by first checking the correspondence to a specific case derived from the use case.

It is also possible to use a shared test collection such as NITCIR, which is widely recognized by academic societies (Cranfield Collection, GOV2, NITCIR, CLEF, Revters, Newsgroups 20, etc.). However, even if you get good results with these, it does not guarantee that they are highly practical, so it is reasonable to use them as a guide when you cannot prepare your own test set.

As for the evaluation of actual search results, the basic evaluation indices for the unranked search set are precision and recall. Here, precision is the percentage of results that are truly correct and is an indicator of accuracy, while recall is the percentage of results that are actually found among the results that should be found and is an indicator of comprehensiveness. In general, there is a trade-off between accuracy and comprehensiveness, with accuracy decreasing as comprehensiveness increases and comprehensiveness decreasing as accuracy increases. Therefore, we use the F-measure, which is a weighted harmonic mean (the reciprocal of the mean of the reciprocals) of the fit rate and the recall rate, as an indicator that both are high in balance.

As for the evaluation of the ranked search set, we consider the top k results as the set of correct answers. Various methods have been proposed to evaluate these results, such as the fit-rate-repeat curve, interpolated fit rate, pointwise interpolated average fit rate, mean of average fit rate (MAP), fit rate in the top k results, R-fit rate, break-even point, ROC curve, sensitivity, singularity, cumulative profit, and DICE coefficient.

Apart from the evaluation of the results themselves, there are other methods such as pooling, kappa coefficient, and expectation of agreement using the periphery to evaluate whether the answer text is relevant to the requirement.

In addition to the evaluation of these search metrics, there are other factors such as the speed of the search, the expressiveness of the query, how broadly it can be used, and how quickly it can be indexed. and how quickly can it be indexed? In addition to the evaluation of these search indicators, it is necessary to consider non-functional requirements (indicators of system quality) such as the speed of the search, the expressiveness of the query, how wide a range of the system can be used, and how quickly it can be indexed, as well as the improvement of productivity and user satisfaction when the system is finally used.

As mentioned above, the evaluation items of a search system vary widely, so it is necessary to define them carefully when conducting verification experiments.

The first book I will introduce next is “Kadokawa Internet Lecture 8: New Horizons of Search”, a book in the Kadokawa Internet Lecture series. It is a book supervised by Akihiko Takano, and each of the seven chapters is written by a different author..

The chapter structure is as follows.

Part 1: Diversified Search Today (Introduction: What is Search?
   Chapter 1 Searching Text Search Engines
   Chapter 2: The Evolution of Image Search
   Chapter 3: Searching Connected to the Real World - Searching in Time and Space
   Chapter 4: Various kinds of search and use of materials)
Part 2: The Future of Search
   Chapter 5: Writing and Searching for Knowledge: The Ideal Web and the Road to the Semantic Web
   Chapter 6: Searching as a Memory Technique: From Search to Association)

The first half of the book begins with a description of conventional text-text search, followed by the evolution of image and video search, the evolution of spatial and temporal search, and their navigation. In the direction of the evolution of text search, it is expected that the search of unstructured data such as images and video, which is currently attracting attention from the viewpoint of DXification, and the search of space and time will come. However, it is very interesting that the necessity of searching via spatial and temporal connections based on the content of the information (natural language processing) was mentioned.

Dealing with information on the time axis means dealing with changes in information. For example, if information that was true at a certain point in time becomes false after a certain period of time (e.g., because something that was not known before is revealed), how to handle the information becomes a major technical issue. An object-oriented approach that introduces some state to the information is one way to handle this, but this approach is expected to become exponentially more complicated and break down if we start considering the correlation between each state. On the other hand, a method that makes the data immutable and assigns a version to each of them (a model like the DATOMIC database) is expected to be able to handle these issues smoothly.

The second half of the paper, “These Searches,” outlines the SemanticWeb technology introduced in the previous section, and its development into “associative informatics” based on human memory. The relationship between knowledge and memory and the retrieval of relevant information as represented by Semantic Web technology are very interesting issues, but I will discuss them at another time.

コメント

Exit mobile version
タイトルとURLをコピーしました