当前位置:   article > 正文

基于similarities的文本语义相似度计算和文本匹配搜索_从一系列文本里找出语义最相似的文本的网络

从一系列文本里找出语义最相似的文本的网络

similarities 实现了多种相似度计算、匹配搜索算法,支持文本、图像,python3开发。

安装

pip3 install torch # conda install pytorch
pip3 install -U similarities
  • 1
  • 2

git clone https://github.com/shibing624/similarities.git
cd similarities
python3 setup.py install
  • 1
  • 2
  • 3

报错

ChineseCLIPProcessor

1

Traceback (most recent call last): File “xx\similarity_test1.py”,
line 9, in
from similarities import BertSimilarity File “xx\lib\site-packages\similarities_init_.py”, line 28, in
from similarities.clip_similarity import ClipSimilarity File “xx\lib\site-packages\similarities\clip_similarity.py”, line 16, in

from similarities.clip_module import ClipModule File “xx\lib\site-packages\similarities\clip_module.py”, line 18, in

from transformers import ChineseCLIPProcessor, ChineseCLIPModel, CLIPProcessor, CLIPModel ImportError: cannot import name
‘ChineseCLIPProcessor’ from ‘transformers’
(xx\lib\site-packages\transformers_init_.py)

报这个错的原因是transformers版本太低,升级下版本就可以了。

pip install --upgrade transformers
  • 1

pydantic

另外还缺少pydantic:

pip install pydantic
  • 1

文本语义相似度计算

from similarities import Similarity

m = Similarity()
r = m.similarity('如何更换花呗绑定银行卡', '花呗更改绑定银行卡')
print(f"similarity score: {float(r)}")  # similarity score: 0.855146050453186
  • 1
  • 2
  • 3
  • 4
  • 5

运行结果:

2024-03-07 20:35:07.000 | DEBUG    | text2vec.sentence_model:__init__:80 - Use device: cpu
similarity score: 0.8551465272903442
  • 1
  • 2

返回值:余弦值score范围是[-1, 1],值越大越相似

corpus表示搜索的doc集,仅搜索时需要,输入doc格式兼容:句子列表和{corpus_id: sentence}的dict格式

model_name_or_path表示模型,默认使用中文表征式匹配模型shibing624/text2vec-base-chinese,可以替换为多语言 表征模型sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

max_seq_length表示输入句子的最大长度,最大为匹配模型支持的最大长度,BERT系列是512。

文本语义匹配搜索

一般在文档候选集中找与query最相似的文本,常用于QA场景的问句相似匹配、文本相似检索等任务。

# -*- coding: utf-8 -*-
"""
@author:XuMing(xuming624@qq.com)
@description: 文本语义相似度计算和文本匹配搜索
"""
import sys

sys.path.append('..')
from similarities import BertSimilarity

# 1.Compute cosine similarity between two sentences.
sentences = ['如何更换花呗绑定银行卡',
             '花呗更改绑定银行卡']
corpus = [
    '花呗更改绑定银行卡',
    '我什么时候开通了花呗',
    '俄罗斯警告乌克兰反对欧盟协议',
    '暴风雨掩埋了东北部;新泽西16英寸的降雪',
    '中央情报局局长访问以色列叙利亚会谈',
    '人在巴基斯坦基地的炸弹袭击中丧生',
]
model = BertSimilarity(model_name_or_path="shibing624/text2vec-base-chinese")
print(model)
similarity_score = model.similarity(sentences[0], sentences[1])
print(f"{sentences[0]} vs {sentences[1]}, score: {float(similarity_score):.4f}")

print('-' * 50 + '\n')
# 2.Compute similarity between two list
similarity_scores = model.similarity(sentences, corpus)
print(similarity_scores.numpy())
for i in range(len(sentences)):
    for j in range(len(corpus)):
        print(f"{sentences[i]} vs {corpus[j]}, score: {similarity_scores.numpy()[i][j]:.4f}")

print('-' * 50 + '\n')
# 3.Semantic Search
model.add_corpus(corpus)
res = model.most_similar(queries=sentences, topn=3)
print(res)
for q_id, id_score_dict in res.items():
    print('query:', sentences[q_id])
    print("search top 3:")
    for corpus_id, s in id_score_dict.items():
        print(f'\t{model.corpus[corpus_id]}: {s:.4f}')

print('-' * 50 + '\n')
print(model.search(sentences[0], topn=3))
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47

结果:

Similarity: BertSimilarity, matching_model: <SentenceModel: shibing624/text2vec-base-chinese, encoder_type: MEAN, max_seq_length: 256, emb_dim: 768>
2024-03-07 20:12:46.481 | DEBUG    | text2vec.sentence_model:__init__:80 - Use device: cpu
如何更换花呗绑定银行卡 vs 花呗更改绑定银行卡, score: 0.8551
--------------------------------------------------

[[0.8551465  0.72119546 0.14502521 0.21666759 0.25171342 0.08089039]
 [0.9999997  0.6807433  0.17136583 0.21621695 0.27282682 0.12791349]]
如何更换花呗绑定银行卡 vs 花呗更改绑定银行卡, score: 0.8551
如何更换花呗绑定银行卡 vs 我什么时候开通了花呗, score: 0.7212
如何更换花呗绑定银行卡 vs 俄罗斯警告乌克兰反对欧盟协议, score: 0.1450
如何更换花呗绑定银行卡 vs 暴风雨掩埋了东北部;新泽西16英寸的降雪, score: 0.2167
如何更换花呗绑定银行卡 vs 中央情报局局长访问以色列叙利亚会谈, score: 0.2517
如何更换花呗绑定银行卡 vs 人在巴基斯坦基地的炸弹袭击中丧生, score: 0.0809
花呗更改绑定银行卡 vs 花呗更改绑定银行卡, score: 1.0000
花呗更改绑定银行卡 vs 我什么时候开通了花呗, score: 0.6807
花呗更改绑定银行卡 vs 俄罗斯警告乌克兰反对欧盟协议, score: 0.1714
花呗更改绑定银行卡 vs 暴风雨掩埋了东北部;新泽西16英寸的降雪, score: 0.2162
花呗更改绑定银行卡 vs 中央情报局局长访问以色列叙利亚会谈, score: 0.2728
花呗更改绑定银行卡 vs 人在巴基斯坦基地的炸弹袭击中丧生, score: 0.1279
--------------------------------------------------

2024-03-07 20:13:03.429 | INFO     | similarities.bert_similarity:add_corpus:108 - Start computing corpus embeddings, new docs: 6
Batches: 100%|██████████| 1/1 [00:10<00:00, 10.45s/it]
2024-03-07 20:13:13.889 | INFO     | similarities.bert_similarity:add_corpus:120 - Add 6 docs, total: 6, emb len: 6
{0: {0: 0.8551465272903442, 1: 0.7211954593658447, 4: 0.25171342492103577}, 1: {0: 0.9999997019767761, 1: 0.6807432770729065, 4: 0.27282682061195374}}
query: 如何更换花呗绑定银行卡
search top 3:
	花呗更改绑定银行卡: 0.8551
	我什么时候开通了花呗: 0.7212
	中央情报局局长访问以色列叙利亚会谈: 0.2517
query: 花呗更改绑定银行卡
search top 3:
	花呗更改绑定银行卡: 1.0000
	我什么时候开通了花呗: 0.6807
	中央情报局局长访问以色列叙利亚会谈: 0.2728
--------------------------------------------------

{0: {0: 0.8551465272903442, 1: 0.7211954593658447, 4: 0.25171342492103577}}

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39

余弦score的值范围[-1, 1],值越大,表示该query与corpus的文本越相似。

基于字面的文本相似度计算和匹配搜索

支持同义词词林(Cilin)、知网Hownet、词向量(WordEmbedding)、Tfidf、SimHash、BM25等算法的相似度计算和字面匹配搜索,常用于文本匹配冷启动。

# -*- coding: utf-8 -*-
"""
@author:XuMing(xuming624@qq.com)
@description: 
"""
import sys
from loguru import logger

sys.path.append('..')

from similarities import (
    SimHashSimilarity,
    TfidfSimilarity,
    BM25Similarity,
    WordEmbeddingSimilarity,
    CilinSimilarity,
    HownetSimilarity,
    SameCharsSimilarity,
    SequenceMatcherSimilarity,
)

logger.remove()
logger.add(sys.stderr, level="INFO")


def sim_and_search(m):
    print(m)
    if 'BM25' not in str(m):
        sim_scores = m.similarity(text1, text2)
        print('sim scores: ', sim_scores)
        for (idx, i), j in zip(enumerate(text1), text2):
            s = sim_scores[idx] if isinstance(sim_scores, list) else sim_scores[idx][idx]
            print(f"{i} vs {j}, score: {s:.4f}")
    m.add_corpus(corpus)
    res = m.most_similar(queries, topn=3)
    print('sim search: ', res)
    for q_id, c in res.items():
        print('query:', queries[q_id])
        print("search top 3:")
        for corpus_id, s in c.items():
            print(f'\t{m.corpus[corpus_id]}: {s:.4f}')
    print('-' * 50 + '\n')


if __name__ == '__main__':
    text1 = [
        '如何更换花呗绑定银行卡',
        '花呗更改绑定银行卡'
    ]
    text2 = [
        '花呗更改绑定银行卡',
        '我什么时候开通了花呗',
    ]
    corpus = [
        '花呗更改绑定银行卡',
        '我什么时候开通了花呗',
        '俄罗斯警告乌克兰反对欧盟协议',
        '暴风雨掩埋了东北部;新泽西16英寸的降雪',
        '中央情报局局长访问以色列叙利亚会谈',
        '人在巴基斯坦基地的炸弹袭击中丧生',
    ]

    queries = [
        '我的花呗开通了?',
        '乌克兰被俄罗斯警告',
        '更改绑定银行卡',
    ]
    print('text1: ', text1)
    print('text2: ', text2)
    print('query: ', queries)
    sim_and_search(SimHashSimilarity())
    sim_and_search(TfidfSimilarity())
    sim_and_search(BM25Similarity())
    sim_and_search(WordEmbeddingSimilarity())
    sim_and_search(CilinSimilarity())
    sim_and_search(HownetSimilarity())
    sim_and_search(SameCharsSimilarity())
    sim_and_search(SequenceMatcherSimilarity())

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79

图像相似度计算和图片搜索

支持CLIP、pHash、SIFT等算法的图像相似度计算和匹配搜索,中文CLIP模型支持图搜图,文搜图、还支持中英文图文互搜。

# -*- coding: utf-8 -*-
"""
@author:XuMing(xuming624@qq.com)
@description: 
"""
import glob
import sys

from PIL import Image

sys.path.append('..')
from similarities import ImageHashSimilarity, SiftSimilarity, ClipSimilarity


def sim_and_search(m):
    print(m)
    # similarity
    sim_scores = m.similarity(imgs1, imgs2)
    print('sim scores: ', sim_scores)
    for (idx, i), j in zip(enumerate(image_fps1), image_fps2):
        s = sim_scores[idx] if isinstance(sim_scores, list) else sim_scores[idx][idx]
        print(f"{i} vs {j}, score: {s:.4f}")
    # search
    m.add_corpus(corpus_imgs)
    queries = imgs1
    res = m.most_similar(queries, topn=3)
    print('sim search: ', res)
    for q_id, c in res.items():
        print('query:', image_fps1[q_id])
        print("search top 3:")
        for corpus_id, s in c.items():
            print(f'\t{m.corpus[corpus_id].filename}: {s:.4f}')
    print('-' * 50 + '\n')


def clip_demo():
    m = ClipSimilarity(model_name_or_path="OFA-Sys/chinese-clip-vit-base-patch16")
    # english model name: openai/clip-vit-base-patch32
    print(m)
    # similarity score between text and image
    image_fps = [
        'data/image3.png',  # yellow flower image
        'data/image1.png',  # tiger image
    ]
    texts = ['a yellow flower', '老虎', '一头狮子', '玩具车']
    imgs = [Image.open(i) for i in image_fps]
    sim_scores = m.similarity(imgs, texts)
    print('sim scores: ', sim_scores)
    for idx, i in enumerate(image_fps):
        for idy, j in enumerate(texts):
            s = sim_scores[idx][idy]
            print(f"{i} vs {j}, score: {s:.4f}")
    print('-' * 50 + '\n')


if __name__ == "__main__":
    image_fps1 = ['data/image1.png', 'data/image3.png']
    image_fps2 = ['data/image12-like-image1.png', 'data/image10.png']
    imgs1 = [Image.open(i) for i in image_fps1]
    imgs2 = [Image.open(i) for i in image_fps2]
    corpus_fps = glob.glob('data/*.jpg') + glob.glob('data/*.png')
    corpus_imgs = [Image.open(i) for i in corpus_fps]
    # 1. image and text similarity
    clip_demo()

    # 2. image and image similarity score
    sim_and_search(ClipSimilarity())  # the best result
    sim_and_search(ImageHashSimilarity(hash_function='phash'))
    sim_and_search(SiftSimilarity())

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70

相关链接

https://github.com/shibing624/similarities
https://huggingface.co/shibing624/text2vec-base-chinese
Compute similarity score Demo
Semantic Search Demo

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/我家自动化/article/detail/935818
推荐阅读
相关标签
  

闽ICP备14008679号