Overview of Vector Database
A vector database is a type of database that will primarily store vector data and perform operations such as queries and searches in vector space.
Since 2023, there has been a lot of marketing about vector databases and a large number of vector database vendors have emerged. This has been particularly influenced by the rise of ChatGPT using GPT model described in “Overview of GPT and examples of algorithms and implementations“,
This is because vector databases can be used in configurations called RAGs to compensate for weaknesses in ChatGPT, such as handling the latest news and unpublished information, which ChatGPT is not very good at.
Vector databases are designed to search for data based on vector similarity and to retrieve relevant data efficiently. Some also use algorithms such as k-NN (k nearest neighbor) to retrieve high-dimensional data, and techniques such as quantization and partitioning are also used to optimize retrieval performance.
In this article, we will discuss these vector databases.
First, the vector database has the following characteristics
1. vector data storage:
Unlike a general RDBMS, vector databases do not have operations such as direct relation. Instead, it stores vector data such as numeric vectors and feature vectors. This enables searches and queries based on vector similarity.
2. faster similarity searches:
Since vector databases evaluate similarity by calculating inner products and distances of vectors, similarity search can be performed at high speed. This is useful for image search, voice search, natural language processing, and other fields.
3. processing of high-dimensional data:
Vector databases are designed to efficiently process high-dimensional vector data, for example, vector representations of feature-extracted images or text.
4. integration with machine learning:
Vector databases are used in conjunction with machine learning models and clustering algorithms. This enables grouping and classification of vector data.
5. diverse fields of use:
Vector databases are used in various fields, such as image databases, sound databases, text databases, and embedded databases for machine learning.
Types of Vector Databases
The following describes the vector databases currently available.
1. Faiss: Open source, GPU acceleration via CUDA, support for billions of vectors, and a wide range of algorithm options including IVFADC, PQ, and HNSW.
2. Annoy(Approximate Nearest Neighbors Oh Yeah): Features include open source, use of tree forests for vector space partitioning, and support for files in memory maps for large data.
3. NMSLIB(Non-Metric Space Library): Features include open source, support for a variety of distance measures including Cosine, Jaccard, and Levenstein, use of Hierarchical Navigable Small World (HNSW) graphs for efficient search, optimization for both dense and sparse data vectors, and more.
4. Milvus: Open source, can process up to 100 billion vectors with sub-second latency, supports distance measures such as Euclidean, Cosine, and Jaccard, supports index types such as IVF_FLAT, IVF_PQ, HNSW, etc.
5. Pinecone: features such as pay-as-you-go, fully managed services, built-in data versioning and rollback capabilities, and multi-tenancy support.
6. Zilliz: Features include custom pricing, REST API, support for basic attribute search operations, cloud-based and scalable without operational overhead.
7. Qdrant: Open source, with both local and cloud-based deployment options. Can operate in in-memory mode, among other features. Rust is used as the base.
8. Chroma: Open source, offering a built-in mode by default with tightly integrated database and application layers. It offers a convenient Python/JavaScript interface, and more.
9. LanceDB: It is designed to perform natively distributed indexing and searching on multimodal data (images, audio, text) and is based on the Lance data format, a new and innovative columnar data format for ML.
10. Vespa: Combines proven keyword search with custom vector search on HNSW to provide the most “enterprise-ready” hybrid search capability.
11. Vald: It is designed to handle multimodal data storage through a highly distributed architecture and includes useful features such as index backup. It uses a very fast ANN search algorithm, NGT (Neighborhood Graph & Tree), which is one of the fastest ANN algorithms when used in conjunction with a highly distributed vector index.
12: Elasticsearch、Redis、PostgreSQL: Each has its own vetocle database options, but existing databases are designed to be generic and do not store or index data in the most optimal way, resulting in poor performance for data containing more than a million vector searches.
Of these, LanceDB and Chroma are the most recent to appear, and even the oldest vespa appeared in 2017.
For an example implementation of a vector database
As an example of vector database implementation, we show an example using Milvus, a database that supports vector data storage, retrieval, and similarity search and is widely used in fields such as machine learning and data mining.
Milvus Installation: In order to use Milvus, you must first install Milvus. The following is an example of installation using Docker.
docker pull milvusdb/milvus:latest
docker run -d --name milvus_cpu_0.11.0 -p 19530:19530 -p 19121:19121 milvusdb/milvus:latest
This will start Milvus and make the REST API and gRPC endpoints available.
Importing data into Milvus: Before data can be stored in Milvus, it must be vectorized; Milvus treats vectors as arrays of type float32.
import milvus
# Connect to Milvus server
client = milvus.Milvus(host='localhost', port='19530')
# Preparation of vector data
vectors = [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [7.0, 8.0, 9.0]]
# Vector data import
collection_name = 'example_collection'
client.create_collection({'fields': [{'name': 'embedding', 'type': milvus.DataType.FLOAT_VECTOR, 'params': {'dim': 3}}], 'collection_name': collection_name})
client.insert(collection_name, records=vectors)
Performing vector similarity searches: Milvus can perform similarity searches on vector data.
# Query vector preparation
query_vector = [[2.0, 3.0, 4.0]]
# Perform similarity search
search_params = {'nprobe': 16}
result = client.search(collection_name, query_vector, top_k=1, params=search_params)
print(result)
In this example, the most similar vectors are searched for a given query vector.
This is a basic use case for Milvus; more complex searches and index manipulations require more detailed configuration. They are described in the official documentation and in the GitHub repository.
Challenges and Countermeasures for Vector Databases
Vector databases also face some challenges, and measures are being considered to address them. They are described below.
1. the Curse of Dimensionality:
Challenge: As the dimensionality of data increases, the distance between vectors becomes equal, making similarity search difficult.
Solution: Implement dimensionality reduction techniques or narrow down the dimensionality by feature selection to effectively manage the dimensionality. 2.
2. scalability problem:
Challenge: Vector database query processing is slow for large data sets and high-dimensional data.
Solution: Consider measures to improve scalability, such as optimizing indexes, introducing distributed databases, and considering efficient query execution.
3. index efficiency:
Challenge: If indexes are not efficient for large data sets, search performance will suffer.
Solution: Improve search performance by optimizing index structure and search algorithms, and by setting appropriate parameters.
4. updating and deleting data:
Challenge: Vector data is difficult to update and delete, and it is difficult to respond to data changes.
Solution: Consider ways to handle dynamic data changes, such as periodic re-indexing, rebuilding indexes, and employing appropriate updating techniques.
5. security and privacy:
Challenge: Vector data may contain sensitive information, which may raise security and privacy concerns.
Solution: Ensure data security by implementing appropriate security measures such as encryption and access control.
6. domain dependencies:
Challenge: The performance and effectiveness of a vector database depends on the characteristics of the data and domain in which it is used.
Solution: Adapt vector databases to their domain by performing domain-specific configuration and optimization.
コメント