使用weaviate实现向量存储_weaviate集群

作者：我家自动化 | 2024-05-26 03:09:53

踩

weaviate集群

使用weaviate实现向量存储

1 介绍

当下开源的向量数据库比较多，主要分嵌入式向量数据库（数据库和应用在一起）和客户端-服务器模型向量数据库（客户端和服务器分离），比较出名数据库如下：嵌入式包括Chroma、lancedb，客户端和服务器分离的包括：Milvus、Faiss、Qdrant和Weaviate等。本文选择Weaviate。

Weaviate是一个人工智能原生矢量数据库，强调与知识图的灵活高效交互，支持单节点和集群部署，支持关键词检索和语义检索。

Chroma在易用性上占优，适合在没用向量数据库服务器的场景下使用，是音频和视频搜索的理想选择。

Milvus强调在存储效率和数据查询性能上的平衡，适用于向量嵌入、高效相似搜索和AI应用；

Faiss可进行GPU加速的高速检索，擅长处理庞大数据集的快速最近邻搜索，适应用于快速、密集向量相似性搜索和分组。

嵌入式向量数据库

（1）Chroma

# Github地址
https://github.com/chroma-core/chroma
1
2

（2）lancedb

https://github.com/lancedb/lancedb
1

客户端和服务器分离的向量数据库

（1）Milvus

# Github地址
https://github.com/milvus-io/milvus
1
2

（2）Faiss

# Github地址
https://github.com/facebookresearch/faiss
1
2

（3）weaviate

# Github地址
https://github.com/weaviate/weaviate

# 官网地址
https://weaviate.io/developers/weaviate

# 使用的参考地址
https://weaviate.io/developers/weaviate/quickstart
1
2
3
4
5
6
7
8

2 使用Docker安装

参考官网：

https://weaviate.io/developers/weaviate/installation/docker-compose
1

参数说明

# 设置环境变量：

# 1 开启验证
# 开启连接数据库时验证
AUTHENTICATION_APIKEY_ENABLED: true

# 配置登录数据库的密钥，密钥列表如下，多个密钥对应多个用户
AUTHENTICATION_APIKEY_ALLOWED_KEYS: 'jane-secret-key,ian-secret-key'

# 配置登录数据库的用户，用户列表与上述密钥一一对应
AUTHENTICATION_APIKEY_USERS: 'jane@doe.com,ian-smith'

# 2 开启权限
# 配置登录数据库的用户权限
AUTHORIZATION_ADMINLIST_ENABLED: 'true'

# 具有管理员权限，可读、可写的用户列表，多个用户用英语逗号隔开
AUTHORIZATION_ADMINLIST_USERS: 'jane@doe.com'

# 仅仅具有可读权限的用户列表，多个用户用英文逗号隔开
AUTHORIZATION_ADMINLIST_READONLY_USERS: 'ian-smith'
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

docker命令安装weaviate

docker run -itd --name=weaviate \
--restart=always \
-p 8080:8080 \
-p 50051:50051 \
-e "AUTHENTICATION_APIKEY_ENABLED=true" \
-e "AUTHENTICATION_APIKEY_ALLOWED_KEYS=jane-secret-key,ian-secret-key" \
-e "AUTHENTICATION_APIKEY_USERS=jane@doe.com,ian-smith" \
-e "AUTHORIZATION_ADMINLIST_ENABLED=true" \
-e "AUTHORIZATION_ADMINLIST_USERS=jane@doe.com" \
-e "AUTHORIZATION_ADMINLIST_READONLY_USERS=ian-smith" \
cr.weaviate.io/semitechnologies/weaviate:1.25.0
1
2
3
4
5
6
7
8
9
10
11

对应关系

序号	密钥	用户	权限
1	jane-secret-key	jane@doe.com	管理员，可读可写
2	ian-secret-key	ian-smith	可读

3 使用Weaviate

3.1 安装环境

⚠️ 下面安装方式适应用客户端版本为4，Weaviate版本是1.23 or higher.

# 安装weaviate
pip install -U weaviate-client -i https://pypi.tuna.tsinghua.edu.cn/simple

# 安装SentenceTransformer
pip install -U sentence-transformers -i https://pypi.tuna.tsinghua.edu.cn/simple
1
2
3
4
5

3.2 Python使用Weaviate


# 词嵌入模型
from sentence_transformers import SentenceTransformer


# 使用 weaviate
import weaviate
from weaviate.auth import AuthApiKey
from weaviate.classes.config import Property, DataType
from weaviate.collections.classes.config import Configure
from weaviate.collections.classes.grpc import MetadataQuery


# 使用SentenceTransformer模型
model = SentenceTransformer("E:/model/all-MiniLM-L6-v2")


ip_addr = "192.168.108.200"

# 自定义连接
client = weaviate.connect_to_custom(
    http_host=ip_addr,
    http_port=8080,
    http_secure=False,
    grpc_host=ip_addr,
    grpc_port=50051,
    grpc_secure=False,
    # 对应AUTHENTICATION_APIKEY_ALLOWED_KEYS中的密钥
    # 注意：此处只需要密钥即可，不需要用户名称
    auth_credentials=AuthApiKey("jane-secret-key")
)


collection_name = "article"


def create_collection():
    """
    创建索引
    :return:
    """
    client.collections.create(
        name="Article",
        # 定义文本属性
        properties=[
            Property(name="title", data_type=DataType.TEXT),
            Property(name="summary", data_type=DataType.TEXT)
        ],
        # 定义向量属性
        vectorizer_config=[
            # 注意使用text2vec_openai、text2vec_cohere等需要设置连接API，并且官网提示后期将删除
            # 例如：vectorizer_config=Configure.Vectorizer.text2vec_openai(),
            # 设置不用模型，因为我用的时SentenceTransformer，所以使用none
            Configure.NamedVectors.none(name="title")
        ]
    )


def delete_collection():
    """
    删除索引
    :return:
    """
    # 删除索引
    client.collections.delete(collection_name)


def query_collection_info():
    # 查看读的状态
    print(client.is_ready())

    # 查看连接状态
    print(client.is_connected())

    # 查看索引
    print(client.collections.list_all())

    # 获取索引
    article_collection = client.collections.get(collection_name)

    # 获取全部数据
    # include_vector=True表示显示向量
    print(article_collection.query.fetch_objects(include_vector=True))

    # 迭代获取数据
    for item in article_collection.iterator(include_vector=True):
        print(item.properties)
        print(item.vector)


def insert_collection():
    """
    插入1条数据
    :return:
    """
    # 获取索引
    article_collection = client.collections.get(collection_name)

    # 插入1条数据，返回唯一标识uuid
    uuid = article_collection.data.insert(
        properties={
            "title": "A delicious Riesling",
            "summary": "This wine is a delicious Riesling which pairs well with seafood."
        },
        # 添加向量数据
        vector={
            "title": model.encode("A delicious Riesling"),
            "summary": model.encode("This wine is a delicious Riesling which pairs well with seafood.")
        }
    )
    print(uuid)


def batch_collection():
    """
    批量插入数据
    :return:
    """
    data_list = [
        {
            "title": "A man is eating food.",
            "summary": ""
        },
        {
            "title": "A man is eating a piece of bread.",
            "summary": ""
        },
        {
            "title": "The girl is carrying a baby.",
            "summary": ""
        },
        {
            "title": "A man is riding a horse.",
            "summary": ""
        },
        {
            "title": "A woman is playing violin.",
            "summary": ""
        },
        {
            "title": "Two men pushed carts through the woods.",
            "summary": ""
        },
        {
            "title": "A man is riding a white horse on an enclosed ground.",
            "summary": ""
        },
        {
            "title": "A monkey is playing drums.",
            "summary": ""
        },
        {
            "title": "A cheetah is running behind its prey.",
            "summary": ""
        }
    ]

    # 嵌入模型
    title_embedding_list = list()
    for item in data_list:
        # 文本转化为嵌入
        title_embedding_list.append(model.encode(item.get("title")))

    # 获取索引
    collection = client.collections.get(collection_name)

    # 添加到数据库中
    with collection.batch.dynamic() as batch:

        # 循环插入数据
        for i, data_row in enumerate(data_list):
            batch.add_object(
                properties=data_row,
                vector= {
                    "title": title_embedding_list[i]
                }
            )


def query_vector_collection():
    """
    使用向量查询数据
    :return:
    """
    article_collection = client.collections.get(collection_name)

    # 如果使用向量查询中的query.near_text，需要配置转化向量的模型，例如：连接互联网的API。
    # 使用向量查询，使用自定义的模型生成向量
    response = article_collection.query.near_vector(
        # 文本向量
        near_vector=model.encode("A man is eating pasta.").tolist(),
        # 限制输出的数量
        limit=10,
        # 设置返回的数据是否显示距离
        return_metadata=MetadataQuery(distance=True)
    )

    # 展示数据
    for o in response.objects:
        print(o.properties)
        print(o.metadata.distance)


def query_keyword_collection():
    """
    使用关键词查询数据，此处的关键词查询也叫"BM25 (Best match 25)"或者"sparse vector"。
    :return:
    """
    article_collection = client.collections.get(collection_name)

    # 使用关键词查询
    response = article_collection.query.bm25(
        query="food",
        return_metadata=MetadataQuery(score=True),
        limit=3
    )

    for o in response.objects:
        print(o.properties)
        print(o.metadata.distance)


if __name__ == '__main__':
    # delete_collection()

    # 1 创建索引
    # create_collection()

    # 2 插入数据
    # insert_collection()

    # 3 批量插入数据
    # batch_collection()

    # 4 查询索引
    # query_collection_info()

    # 5 使用向量查询
    query_vector_collection()

    # 6 使用关键词查询
    query_keyword_collection()

    # 关闭客户端
    client.close()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245

执行截图

在这里插入图片描述

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/我家自动化/article/detail/624920