weixin_40725706

这个屌丝很懒，什么也没留下！

热门标签

机器学习 | Faiss实现_python faiss

作者：weixin_40725706 | 2024-06-17 19:49:58

踩

python faiss

1 前言

最近公司项目使用了Faiss来计算向量的相关性，比如有1亿个用户，计算每个用户和其最相似的50或者100个用户，此时如果用最暴力的方法，每次算一个用户的时候都遍历1亿个用户然后找出最相似的50/100个，那真的是要等到天长地久了，所以这时候一个快速计算的框架就显得尤为重要，于是今天的主人公Faiss闪亮登场！

2 什么是Faiss

Faiss的主要功能就是相似度搜索！尤其是大数据的场景下！

2.1 为什么会出现Faiss？

Faiss是Facebook在2017年发布的一个相似性搜索项目，是多媒体文档中快速搜索出相似的条目——这个场景下的挑战是基于查询的传统搜索引擎无法解决的。
海量多媒体信息的涌入产生了数十亿的向量；其次，且更重要的是，查找相似实体意味着查找相似的高维向量，如果只是使用标准查询语言这将非常低效和困难。
总结来说，数据量太大，相关性检索的时候传统的SQL查询非常抵消和困难，引入Faiss

2.2 Faiss的优点

提供了多种相似性搜索方法，支持各种各样的不同用法和功能集。
特别优化了内存使用和速度。
为最相关索引方法提供了最先进的 GPU 实现。

2.3 Faiss组件

2.3.1 索引Index

Faiss提供了针对不同场景下应用对Index的封装类。具体可参考：Faiss的index

而我们项目用到的是第二种：IndexFlatIP（Exact Search for Inner Product），also for cosine (normalize vectors beforehand) 因为本身就是要算向量的相似性cosine，而这个索引刚好适合！

详细来说，选择这个索引的含义是为向量集构建IndexFlatL2索引，它是最简单的索引类型，执行基于向量的内积，也即cosine大小搜索！

2.3.2 索引Index选择的原则

如果要精确的结果：IndexFlatL2
如果数据量低于1百万：用k-means聚类向量
如果考虑内存：一系列方法

2.4 优化方法：单元-探测（Cell-probe）方法

以失去保证以找到最近邻居为代价来加速该过程的典型方法是采用诸如k均值的分区技术。相应的算法有时被称为 cell-probe 方法：
我们使用基于多探测的基于分区的方法（可以联想到best-bin KD-tree的一种变体）。
特征空间被划分为 ncells 个单元格。
由于散列函数（在k均值的情况下，对最靠近查询的质心的分配），数据库向量被分配给这些单元中的一个，并且存储在由ncells反向列表形成的反向文件结构中。
在查询时，会选择一组 nprobe 个的反向列表，将查询与分配给这些列表的每个数据库向量进行比较，这样做，只有一小部分数据库与查询进行比较：作为第一个近似值，这个比例是 nprobe / ncells，但请注意，这个近似值通常被低估，因为反向列表的长度不相等。当未选择给定查询的最近邻居的单元格时，将显示失败案例。
在C++中，相应的索引是索引IndexIVFFlat。
构造函数将索引作为参数，用于对反转列表进行赋值。在该索引中搜索查询，并且返回的向量id（s）是应该被访问的反向列表。

3 Faiss的Python实现

3.1 导入库

import faiss
1

3.2 准备数据

import numpy as np
d = 64                           # dimension
nb = 100000                      # database size
nq = 10000                       # nb of queries
np.random.seed(1234)             # make reproducible
xb = np.random.random((nb, d)).astype('float32') # 训练数据
xb[:, 0] += np.arange(nb) / 1000.
xq = np.random.random((nq, d)).astype('float32') # 查询数据
xq[:, 0] += np.arange(nq) / 1000.
1
2
3
4
5
6
7
8
9

print(xb.shape)
xb
1
2

(100000, 64)





array([[1.91519454e-01, 6.22108757e-01, 4.37727749e-01, ...,
        6.24916732e-01, 4.78093803e-01, 1.95675179e-01],
       [3.83317441e-01, 5.38736843e-02, 4.51648414e-01, ...,
        1.51395261e-01, 3.35174650e-01, 6.57551765e-01],
       [7.53425434e-02, 5.50063960e-02, 3.23194802e-01, ...,
        3.44416976e-01, 6.40880406e-01, 1.26205325e-01],
       ...,
       [1.00811470e+02, 5.90245306e-01, 7.98893511e-01, ...,
        3.39859009e-01, 3.01949501e-01, 8.53854537e-01],
       [1.00669464e+02, 9.16068792e-01, 9.55078781e-01, ...,
        5.95364332e-01, 3.84918079e-02, 1.05637990e-01],
       [1.00855637e+02, 5.91134131e-01, 6.78907931e-01, ...,
        2.18976989e-01, 6.53015897e-02, 2.17538327e-01]], dtype=float32)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

print(xq.shape)
xq
1
2

(10000, 64)





array([[ 0.81432974,  0.7409969 ,  0.8915324 , ...,  0.72459674,
         0.893881  ,  0.6574571 ],
       [ 0.5844774 ,  0.797842  ,  0.74140453, ...,  0.6768835 ,
         0.05907924,  0.6396156 ],
       [ 0.75040764,  0.02659794,  0.5495097 , ...,  0.69562465,
         0.16268532,  0.76653737],
       ...,
       [10.96773   ,  0.05037309,  0.7342035 , ...,  0.89510185,
         0.6490696 ,  0.86151606],
       [10.831193  ,  0.70606154,  0.1922274 , ...,  0.8026039 ,
         0.6854174 ,  0.60209423],
       [10.078484  ,  0.39106598,  0.01359335, ...,  0.63193923,
         0.12561724,  0.78384215]], dtype=float32)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

3.3 创建索引(Index)

faiss创建索引对向量预处理，提高查询效率。
faiss提供多种索引方法，这里选择最简单的暴力检索L2距离的索引：IndexFlatL2。
创建索引时必须指定向量的维度d。大部分索引需要训练的步骤。IndexFlatL2跳过这一步。
当索引创建好并训练(如果需要)之后，我们就可以执行add和search方法了。
- add方法一般添加训练时的样本
- search就是寻找相似相似向量了
一些索引可以保存整型的ID，每个向量可以指定一个ID，当查询相似向量时，会返回相似向量的ID及相似度(或距离)。如果不指定，将按照添加的顺序从0开始累加。其中IndexFlatL2不支持指定ID。

index = faiss.IndexFlatL2(d)   # build the index
print(index.is_trained)
index.add(xb)                  # add vectors to the index 训练数据
print(index.ntotal) # 看索引的总数量 按行来
1
2
3
4

True
100000
1
2

3.4 查找相似向量

我们有了包含向量的索引后，就可以传入搜索向量查找相似向量了。
D表示与相似向量的距离(distance)，维度，I表示相似用户的ID。

k = 4                          # we want to see 4 nearest neighbors
D, I = index.search(xq, k)     # actual search 
print(I[:5])                   # neighbors of the 5 first queries-对应ID
print(D[-5:])                  # neighbors of the 5 last queries-对应距离
1
2
3
4

[[ 381  207  210  477]
 [ 526  911  142   72]
 [ 838  527 1290  425]
 [ 196  184  164  359]
 [ 526  377  120  425]]
[[6.5315704 6.97876   7.0039215 7.013794 ]
 [4.335266  5.2369385 5.3194275 5.7032776]
 [6.072693  6.5767517 6.6139526 6.7323   ]
 [6.637512  6.6487427 6.8578796 7.0096436]
 [6.2183685 6.4525146 6.548767  6.581299 ]]
1
2
3
4
5
6
7
8
9
10

3.5 加速搜索

如果需要存储的向量太多，通过暴力搜索索引IndexFlatL2速度很慢
加速搜索的方法为IndexIVFFlat，倒排文件。原理是使用K-means建立聚类中心，然后通过查询最近的聚类中心，最后比较聚类中的所有向量得到相似的向量。
- 创建IndexIVFFlat时需要指定一个其他的索引作为量化器(quantizer)来计算距离或相似度。
- 在add方法之前需要先训练
IndexIVFFlat的参数为：
- faiss.METRIC_L2: faiss定义了两种衡量相似度的方法(metrics)，分别为faiss.METRIC_L2、faiss.METRIC_INNER_PRODUCT。一个是欧式距离，一个是向量内积【等价于cosine】。
- nlist：聚类中心的个数
- k：查找最相似的k个向量
- index.nprobe：查找聚类中心的个数，默认为1个

import time

nlist = 100                       #聚类中心的个数
k = 4
quantizer = faiss.IndexFlatL2(d)  # the other index
index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_L2)
# here we specify METRIC_L2, by default it performs inner-product search
assert not index.is_trained

t0 = time.time()
index.train(xb) # 训练数据
t1 = time.time()
print('训练数据时间为 %.2f ' % (t1-t0))
assert index.is_trained  

t0 = time.time()
index.add(xb)                  # add may be a bit slower as well
t1 = time.time()
print('加索引时间为 %.2f ' % (t1-t0))

t0 = time.time()
D, I = index.search(xq, k)     # actual search
D, I = index.search(xb[:5], k)     # actual search
print('自己搜索自己的结果为: ', D)
print('查看训练集中前五个最接近的的ID为: ',I)
print('查看训练集中和前五个最接近的距离为: ',D)

t1 = time.time()
print('默认1个聚类中心搜索时间为 %.2f ' % (t1-t0))

print(I[-5:])                  # neighbors of the 5 last queries
index.nprobe = 10              # default nprobe is 1, try a few more

t0 = time.time()
D, I = index.search(xq, k)
t1 = time.time()
print('10个聚类中心搜索时间为 %.2f ' % (t1-t0))

print(I[-5:])                  # neighbors of the 5 last queries
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

训练数据时间为 0.13 
加索引时间为 0.10 
自己搜索自己的结果为:  [[0.        7.1751738 7.20763   7.2511625]
 [0.        6.3235645 6.684581  6.799946 ]
 [0.        5.7964087 6.391736  7.2815123]
 [0.        7.2779055 7.527987  7.6628466]
 [0.        6.7638035 7.2951202 7.3688145]]
查看训练集中前五个最接近的的ID为:  [[  0 393 363  78]
 [  1 555 277 364]
 [  2 304 101  13]
 [  3 173  18 182]
 [  4 288 370 531]]
查看训练集中和前五个最接近的距离为:  [[0.        7.1751738 7.20763   7.2511625]
 [0.        6.3235645 6.684581  6.799946 ]
 [0.        5.7964087 6.391736  7.2815123]
 [0.        7.2779055 7.527987  7.6628466]
 [0.        6.7638035 7.2951202 7.3688145]]
默认1个聚类中心搜索时间为 0.10 
[[  0 393 363  78]
 [  1 555 277 364]
 [  2 304 101  13]
 [  3 173  18 182]
 [  4 288 370 531]]
10个聚类中心搜索时间为 1.07 
[[ 9900 10500  9309  9831]
 [11055 10895 10812 11321]
 [11353 11103 10164  9787]
 [10571 10664 10632  9638]
 [ 9628  9554 10036  9582]]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

3.6 减少内存

索引IndexFlatL2和IndexIVFFlat都会全量存储所有的向量在内存中
为满足大的数据量的需求，faiss提供一种基于Product Quantizer(乘积量化)的压缩算法编码向量大小到指定的字节数。此时，存储的向量时压缩过的，查询的距离也是近似的。
使用IndexIVFPQ

nlist = 100
m = 8                             # number of bytes per vector
k = 4
quantizer = faiss.IndexFlatL2(d)  # this remains the same
index = faiss.IndexIVFPQ(quantizer, d, nlist, m, 8)
# 8 specifies that each sub-vector is encoded as 8 bits
index.train(xb)
index.add(xb)
D, I = index.search(xb[:5], k) # sanity check
print('查看训练集中前五个最接近的的ID为: ',I)
print('查看训练集中和前五个最接近的距离为: ',D)
index.nprobe = 10              # make comparable with experiment above
D, I = index.search(xq, k)     # search
print(I[-5:])
1
2
3
4
5
6
7
8
9
10
11
12
13
14

查看训练集中前五个最接近的的ID为:  [[   0   78  608  159]
 [   1 1063  555  380]
 [   2  304  134   46]
 [   3   64  773  265]
 [   4  288  827  531]]
查看训练集中和前五个最接近的距离为:  [[1.6157446 6.1152263 6.4348035 6.564185 ]
 [1.389575  5.6771317 5.9956017 6.486294 ]
 [1.7025063 6.121688  6.189084  6.489888 ]
 [1.8057697 6.544031  6.6684756 6.8593984]
 [1.4920276 5.79976   6.190908  6.3791513]]
[[ 9900  8746  9853 10437]
 [10494 10507 11373  9014]
 [10719 11291 10424 10138]
 [10122  9638 11113 10630]
 [ 9229 10304  9644 10370]]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

之前我们定义的维度为d = 64，向量的数据类型为float32。
这里压缩成了8个字节。所以压缩比率为 (64*32/8) / 8 = 32
返回的结果见上，第一个向量同自己的距离为1.6157446，不是上上个结果0。因为如上所述返回的是近似距离，但是整体上返回的最相似的top k的向量ID没有变化。

3.7 GPU使用

ngpus = faiss.get_num_gpus()
print("number of GPUs:", ngpus)
1
2

number of GPUs: 0
1

3.7.1 使用1块GPU

# build a flat (CPU) index
index_flat = faiss.IndexFlatL2(d)
# make it into a gpu index
gpu_index_flat = faiss.index_cpu_to_gpu(res, 0, index_flat)
1
2
3
4

3.7.2 使用全部gpu

cpu_index = faiss.IndexFlatL2(d)
gpu_index = faiss.index_cpu_to_all_gpus(cpu_index) # build the index

gpu_index.add(xb)              # add vectors to the index
print(gpu_index.ntotal)

k = 4                          # we want to see 4 nearest neighbors
D, I = gpu_index.search(xq, k) # actual search
print(I[:5])                   # neighbors of the 5 first queries
print(I[-5:])                  # neighbors of the 5 last queries
1
2
3
4
5
6
7
8
9
10

参考

Faiss index：https://waltyou.github.io/Faiss-Indexs/#挑一个合适的-index
faiss项目：https://waltyou.github.io/Faiss-In-Project/
faiss介绍：https://waltyou.github.io/Faiss-Introduce/
https://www.infoq.cn/article/2017/11/Faiss-Facebook
faiss index wiki：https://github.com/facebookresearch/faiss/wiki/Faiss-indexes
https://www.cnblogs.com/yhzhou/p/10568728.html
https://cloud.tencent.com/developer/article/1077741

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/weixin_40725706/article/detail/732557