Search Technology

Machine Learning Artificial Intelligence Natural Language Processing Semantic Web Search Technology DataBase Technology Ontology Algorithm Digital Transformation UI and DataVisualization Javascript Workflow & Services CSS DX Case Study  Navigation of this blog

About Search Technology

Information is the basis of computer technology. It is meaningless to simply collect information, and in order to perform creative activities from the collected information, it is necessary to go through a cycle of “collecting,” “searching,” “finding,” “looking,” and “noticing. For each of these, there are corresponding technologies and ideas. In this article, I will discuss “search” technology (search technology).

The following is a description of search technology in this blog.

Implementations

A ranking algorithm is a method for sorting a given set of items in order of most relevance to the user, and is widely used in various fields such as search engines, online shopping, and recommendation systems. This section provides an overview of common ranking algorithms.

Random Forest is a very popular ensemble learning method in the field of machine learning (a method that combines multiple machine learning models to obtain better performance than individual models). This approach combines multiple Decision Trees to build a more powerful model. There are many variations in ranking features using random forests.

Diversity-Promoting Ranking is one of the methods that play an important role in information retrieval and recommendation systems, which aim to make users’ information retrieval results and the list of recommended items more diverse and balanced. This will be the case. Usually, the purpose of ranking is to display items that match the user’s interests at the top, but at this time, multiple items with similar content and characteristics may appear at the top. For example, in a product recommendation system, similar items or items in the same category often appear at the top of the list. However, because these items are similar, they may not adequately cover the user’s interests, leading to information bias and limiting choices, and diversity promotion ranking is used to address these issues.

Exploratory Ranking is a technique for identifying items that are likely to be of interest to users in ranking tasks such as information retrieval and recommendation systems. This technique aims to find the items of most interest to the user among ranked items based on the feedback given by the user.

Maximum Marginal Relevance (MMR) is a ranking method for information retrieval and information filtering that aims to optimize the ranking of documents provided to users by information retrieval systems. MMR was developed as a method for selecting documents that are relevant to the user’s interests from among multiple documents. The method will rank documents based on both the relevance and diversity of each document, specifically emphasizing the selection of documents that are highly relevant but have low similarity to other options.

A search system will be a system that searches databases and information sources based on a given query and returns relevant results, and will be capable of targeting various types of data, such as information, image, and voice search. The implementation of a search system involves elements such as database management, search algorithms, indexing, ranking models, and user interfaces, and a variety of technologies and algorithms are used, with the appropriate approach selected according to specific requirements and data types.

This section discusses specific implementation examples, focusing on Elasticsearch.

Elasticsearch is an open source distributed search engine that provides many features to enable fast text search and data analysis. Various plug-ins are also available to extend the functionality of Elasticsearch. This section describes these plug-ins and their specific implementations.

Multimodal search integrates multiple different information sources and data modalities (e.g., text, images, audio, etc.) to enable users to search for and retrieve information. This approach effectively combines information from multiple sources to provide more multifaceted and richer search results. This section provides an overview and implementation of this multimodal search, one using Elasticsearch and the other using machine learning techniques.

Elasticsearch is an open source distributed search engine for search, analysis, and data visualization that also integrates Machine Learning (ML) technology and can be leveraged for data-driven insights and predictions. It is a platform that can be used to achieve data-driven insights and predictions. This section describes various uses and specific implementations of machine learning technology in Elasticsearch.

An elastic search-based search engine that can be up and running in a short period of time (creeling, automatic indexing, word registration, etc.)

Elastic search is a full-text search engine developed by Elasticsearch under the open-core business model, and its basic functions such as full-text search and clustering (ultra-fast distributed processing search based on Apache Lucene) are open source programs under the Apache license.

Elasticsearch, the search module, can be connected to a group of core products called the Elastic Stack, including Logstash, a data collection module, Kibana, a data visualization tool, and beats, a lightweight data shipper, with a JSON-based Restful communication module to form a system that can perform not only search but also collection, analysis, and visualization.

Introduction to “High Speed Scalable Engine Elastic Search Server”, a reference book on elastic search technology

Procedure for setting up the environment for launching Elastic Search

UI for elastic search built on node and react (launching reactive search)

Continuing from the previous article, we will discuss the application of the reactivesearch UI.

From “Microservice with Clojure. In this article, we will discuss the use of ElasticStash for monitoring microservice systems. The monitoring system described here can be widely applied to systems other than microservice systems; please refer to “Search Tool Elasticsearch – Startup Procedure” for details.

About the pagerank algorithm that determines search rankings, which is one of the search engine algorithms that has dramatically improved google.

Unity is an integrated development environment (IDE) for game and application development developed by Unity Technologies and widely used in various fields such as games, VR, AR, and simulations. This paper describes the integration of Unity with artificial intelligence systems such as CMS, chatbot, ES, machine learning, and natural language processing.

PHP (Hypertext Preprocessor) is a scripting language for web development that runs mainly on the server side and is used to create dynamic web pages and develop web applications, such as embedding HTML code, accessing databases, and processing forms. It is used to create dynamic web pages and develop web applications, such as embedding HTML code, accessing databases, and processing forms. Laravel is the most popular PHP framework in the field.

This section describes specific implementations using Laravel (integration with mediawiki, cahtbot, Elasticsearch).

RAG (Retrieval-Augmented Generation) is one of the technologies attracting attention in the field of natural language processing (NLP), and is a method of constructing models with richer context by combining information retrieval (Retrieval) and generation (Generation). The main goal of RAG is to generate higher quality results by utilizing retrieved information in generative tasks (sentence generation, question answering, etc.). It is characterized by its ability to utilize knowledge and context.

The basic structure of RAG is to vectorize input queries with Query Encoder, find Documnet with similar vectors, and generate responses using those vectors. The vector DB is used to store the vectorized documents and to search for similar documents. Among these functions, as described in “Overview of ChatGPT and LangChain and their use”, ChatGPT’s API or LanChain is generally used for generative AI, and “Overview of Vector Database” is generally used for database. The database is generally described in “Overview of Vector Databases”. In this article, we describe a concrete implementation using these databases.

Overview

Introduction of reference books on search technology. One is “Fundamentals of Information Retrieval,” which comprehensively summarizes the HOW of search technology and is a very useful reference book when actually constructing a search module.

The other is “Horizons of Search,” which begins with an explanation of conventional text retrieval, and then describes the evolution of image and video retrieval, the evolution of spatial and temporal retrieval, and their navigation. In the direction of the evolution of text search, it is expected that the search of unstructured data such as images and video, which is currently attracting attention from the viewpoint of DXification, and the search of space and time will come. However, the necessity of searching via spatial and temporal connections based on the content of the information (natural language processing), rather than merely searching for information with time stamps, is discussed.

This book is a translation of Search User Interfaces (Cambridge University Press, 2009), which systematically discusses user interface technologies for users to obtain necessary information appropriately without stress when they use Web search engines and information retrieval systems such as Google, Yahoo! University Press, 2009), which systematically discusses user interface technology to appropriately and stress-free retrieve necessary information when using web engines and information retrieval systems. The author, Marti A. Hearst, is a professor at the School of Information, University of California, Berkeley, and has made outstanding achievements in the field of user interfaces. He has also conducted research at Xerox PARC (Palo Alto Research Center) in Silicon Valley, U.S.A., which is famous for inventing many of the basic technologies of computers. In this sense, the original work is an epoch-making masterpiece in the era of Web information retrieval.

Algorithms

Introduction to string matching algorithms for search engines (n-grams, meta word indexing, etc.)

About the pagerank algorithm that determines search rankings, which is one of the search engine algorithms that has dramatically improved google.

About n-grams, a natural language processing method used for pattern matching in search engines.

Object detection aims to find a rectangular region in an image that surrounds an object such as a person or a car. Many object detection methods propose multiple candidate object regions and use object class recognition methods to determine which object these regions are classified as. Since the number of candidate object regions proposed from images is often huge, methods with low computational cost are often used for object class recognition.

Sliding window method, selective search method, and branch-and-bound method are the methods to propose object region candidates from images. There are also several methods to classify them, such as Exampler-SVM, Random Forest, and R-CNN (regious with CNN feature).

While class recognition involves predicting the class to which a target object belongs, instance recognition is the task of identifying the target object itself. The central task of instance recognition is the image retrieval problem, which is to quickly find an image in a database from an input image. Instance recognition is the task of identifying the object itself, such that when we see the Tokyo Tower, we do not recognize it as a radio tower, but as the Tokyo Tower. This can be achieved by searching the database for images that show the same object as the one in the input image.

The implementation of instance recognition is as follows: 1) extract local features from a set of stored images and create an image database, 2) extract local features of the query image, 3) take one local feature of the query image and compare it with all local features in the image database. Cast one vote for the image in the database that has the most similar local features. The object in the image with the most votes in the database is recognized as the object in the query image.

The problem of finding images in the database that are similar to the image represented by the feature vector x is called similar image search or image retrieval, and is one of the central problems in instance recognition.

The simplest way to achieve image retrieval is to rank the images in the database by measuring the distance between the query image and all the images in the database and sorting them in ascending order. However, when the number of images in the database becomes huge, this method becomes impractical because it takes too much computation time. In this paper, we will discuss efficient search methods using tree structure, binary code conversion, and direct product quantization.

Application

The Semantic Web realization depends on the availability of critical mass of metadata for the web content, linked to formal knowledge about the world. This paper presents our vision about a holistic system allowing annotation, indexing, and retrieval of documents with respect to real-world entities. A system (called KIM), partially implementing this concept is shortly presented and used for evaluation and demonstration. Our understanding is that a system for semantic annotation should be based upon specific knowledge about the world, rather than indifferent to any ontological commitments and general knowledge. To assure efficiency and reusability of the metadata we introduce a simplistic upper-level ontology which starts with some basic philosophic distinctions and goes down to the most popular entity types (people, companies, cities, etc.), thus providing many of the inter-domain common sense concepts and allowing easy domain-specific extensions. Based on the ontology, an extensive knowledge base of entities descriptions is maintained. Semantically enhanced information extraction system providing automatic annotation with references to classes in the ontology and instances in the knowledge base is presented. Based on these annotations, we perform IR-like indexing and retrieval, further extended using the ontology and knowledge about the specific entities.

Resources on the Semantic Web are described by metadata related to some formal or informal ontology. It is a common situation that a casual user does not know domain ontology in detail. This makes it difficult to formulate queries in this ontology to find the relevant resources. Users consider the resources in their specific context, so the most straightforward solution is to formulate queries in an ontology that corresponds to a user-specific view. We present an approach based on multiple views, expressed in simple ontologies. This allows a user to query heterogeneous data repositories in terms of multiple, relatively simple view ontologies. We present how ontology developers can define such views on ontologies and the corresponding mapping rules. These ontologies are represented in Semantic Web ontology languages, like RDFS, DAML+OIL or OWL. We present our approach with examples from the e-learning domain using the Semantic Web query and transformation language TRIPLE.

Although RDF/XML has been widely recognized as the standard vehicle for representing semantic information on the Web, an enormous amount of semantic data is still being encoded in HTML documents that are designed primarily for human consumption and not directly amenable to machine processing. This paper seeks to bridge this semantic gap by addressing the fundamental problem of automatically annotating HTML documents with semantic labels. Exploiting a key observation that semantically related items exhibit consistency in presentation style as well as spatial locality in template-based content-rich HTML documents, we have developed a novel framework for automatically partitioning such documents into semantic structures. Our framework tightly couples structural analysis of documents with semantic analysis incorporating domain ontologies and lexical databases such as WordNet. We present experimental evidence of the effectiveness of our techniques on a large collection of HTML documents from various news portals.

Building user-friendly GUIs for browsing and filtering RDF/S description bases while exploiting in a transparent way the expressiveness of declarative query/view languages is vital for various Semantic Web applications (e.g., e-learning, e-science). In this paper we present a novel interface, called GRQL, which relies on the full power of the RDF/S data model for constructing on the fly queries expressed in RQL. More precisely, a user can navigate graphically through the individual RDF/S class and property definitions and generate transparently the RQL path expressions required to access the resources of interest. These expressions capture accurately the meaning of its navigation steps through the class (or property) subsumption and/or associations. Additionally, users can enrich the generated queries with filtering conditions on the attributes of the currently visited class while they can easily specify the resource’s class(es) appearing in the query result. To the best of our knowledge, GRQL is the first application-independent GUI able to generate a unique RQL query which captures the cumulative effect of an entire user navigation session.

Faceted search and querying are two well-known paradigms to search the Semantic Web. Querying languages, such as SPARQL, offer expressive means for searching RDF datasets, but they are difficult to use. Query assistants help users to write well-formed queries, but they do not prevent empty results. Faceted search supports exploratory search, i.e., guided navigation that returns rich feedbacks to users, and prevents them to fall in dead-ends (empty results). However, faceted search systems do not offer the same expressiveness as query languages. We introduce Query-based Faceted Search (QFS), the combination of an expressive query language and faceted search, to reconcile the two paradigms. In this paper, the LISQL query language generalizes existing semantic faceted search systems, and covers most features of SPARQL. A prototype, Sewelis (aka. Camelis 2), has been implemented, and a usability evaluation demonstrated that QFS retains the ease-of-use of faceted search, and enables users to build complex queries with little training.

Effective techniques for keyword search over RDF databases incorporate an explicit interpretation phase that maps keywords in a keyword query to structured query constructs. Because of the ambiguity of keyword queries, it is often not possible to generate a unique interpretation for a keyword query. Consequently, heuristics geared toward generating the top-K likeliest user-intended interpretations have been proposed. However, heuristics currently proposed fail to capture any userdependent characteristics, but rather depend on database-dependent properties such as occurrence frequency of subgraph pattern connecting keywords. This leads to the problem of generating top-K interpretations that are not aligned with user intentions. In this paper, we propose a contextaware approach for keyword query interpretation that personalizes the interpretation process based on a user’s query context. Our approach addresses the novel problem of using a sequence of structured queries corresponding to interpretations of keyword queries in the query history as contextual information for biasing the interpretation of a new query. Experimental results presented over DBPedia dataset show that our approach outperforms the state-of-the-art technique on both efficiency and effectiveness, particularly for ambiguous queries.

Knowledge interaction in Web context is a challenging problem. For instance, it requires to deal with complex structures able to filter knowledge by drawing a meaningful context boundary around data. We assume that these complex structures can be formalized as Knowledge Patterns (KPs), aka frames. This Ph.D. work is aimed at developing methods for extracting KPs from the Web and at applying KPs to exploratory search tasks. We want to extract KPs by analyzing the structure of Web links from rich resources, such as Wikipedia.

コメント

  1. […] Technology   Natural Language Processing Technology   Semantic Web Technology   Search Technology     DataBase Technology   Ontology Technology   Algorithm   Digital Transformation […]

Exit mobile version
タイトルとURLをコピーしました