要使用 en_core_web_md，请使用 python -m spacy download en_core_web_md 进行下载。要使用 en_core_web_lg，请使用 python -m spacy download en_core_web_lg。大型模型大约为 830mb 左右，而且速度很慢，因此中型模型是一个不错的选择。

python -m spacy download en_core_web_lg

代码


import spacy
nlp = spacy.load("en_core_web_lg")
 
doc1 = nlp(u'the person wear red T-shirt')
doc2 = nlp(u'this person is walking')
doc3 = nlp(u'the boy wear red T-shirt')
 
print(doc1.similarity(doc2))
print(doc1.similarity(doc3))
print(doc2.similarity(doc3))

结果


0.7003971105290047
0.9671912343259517
0.6121211244876517

2.2. 方法之`Sentence Transformers`

GitHub - UKPLab/sentence-transformers: Multilingual Sentence & Image Embeddings with BERT

Semantic Textual Similarity — Sentence-Transformers documentation

安装包sentence-transformers

pip install -U sentence-transformers

所需的所有模型可以在Models - Hugging Face查看,

也可以通过Pretrained Models — Sentence-Transformers documentation查看。

2.2.1. 得到句子的Embedding向量

1. 是利用sentence_transformers包。

代码:


from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('distilbert-base-nli-mean-tokens')
 
sentences = [
    'the person wear red T-shirt',
    'this person is walking',
    'the boy wear red T-shirt'
    ]
sentence_embeddings = model.encode(sentences)
 
for sentence, embedding in zip(sentences, sentence_embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

输出:


Sentence: the person wear red T-shirt
Embedding: [ 1.31643847e-01 -4.20616418e-01 ... 8.13076794e-01 -4.64620918e-01]
 
Sentence: this person is walking
Embedding: [-3.52878094e-01 -5.04286848e-02 ... -2.36091137e-01 -6.77282438e-02]
 
Sentence: the boy wear red T-shirt
Embedding: [-2.36365378e-01 -8.49713564e-01 ... 1.06414437e+00 -2.70157874e-01]

2. 利用transformers 和Torch

可以看出sentence_transformers包其实是transformers包和torch包的高级方法。它做了一个封装而已，更容易使用。从下面代码可以看出，所谓的句子嵌入就是Bert或其它预训练模型的输出后面加一池化层而已。

代码


from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
 
#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
 
 
# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']
 
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
 
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
 
# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
 
# Perform pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
 
# Normalize embeddings
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
 
print("Sentence embeddings:")
print(sentence_embeddings)

输出


Sentence embeddings:
tensor([[ 0.0077, -0.0245,  0.0131,  ..., -0.0114,  0.0473, -0.0270],
        [-0.0200, -0.0029,  0.0531,  ...,  0.0568, -0.0134, -0.0038],
        [-0.0139, -0.0500,  0.0053,  ..., -0.0067,  0.0627, -0.0159]])

2.2.2. 根据句子的embeding向量，计算句子的距离，有三种方法。

1. 利用 SentenceTransformer 中的util

其实它里面的实现是利用Torch来实现


print(util.cos_sim(passage_embedding[0], passage_embedding[1]))
print(util.cos_sim(passage_embedding[0], passage_embedding[2]))
print(util.cos_sim(passage_embedding[1], passage_embedding[2]))

输出


tensor([[0.4644]])
tensor([[0.9070]])
tensor([[0.3276]])

2. 利用 scipy

代码


from scipy.spatial import distance
print(1 - distance.cosine(sentence_embeddings[0], sentence_embeddings[1]))
print(1 - distance.cosine(sentence_embeddings[0], sentence_embeddings[2]))
print(1 - distance.cosine(sentence_embeddings[1], sentence_embeddings[2]))

输出


0.4643629193305969
0.9069876074790955
0.3275738060474396

3. 利用 torch

代码


import torch.nn
cos = torch.nn.CosineSimilarity(dim=0, eps=1e-6)
b = torch.from_numpy(sentence_embeddings)
print(cos(b[0], b[1]))
print(cos(b[0], b[2]))
print(cos(b[1], b[2]))

输出


tensor(0.4644)
tensor(0.9070)
tensor(0.3276)

2.3. 方法之`TFHub Universal Sentence Encoder`

https://tfhub.dev/google/universal-sentence-encoder/4

https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/semantic_similarity_with_tf_hub_universal_encoder.ipynb

这个大约 1GB 的模型非常大，而且看起来比其他模型慢。这也会生成句子的嵌入

代码


import tensorflow_hub as hub
 
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
embeddings = embed([
    "the person wear red T-shirt",
    "this person is walking",
    "the boy wear red T-shirt"
    ])
 
print(embeddings)
 
from scipy.spatial import distance
print(1 - distance.cosine(embeddings[0], embeddings[1]))
print(1 - distance.cosine(embeddings[0], embeddings[2]))
print(1 - distance.cosine(embeddings[1], embeddings[2]))

输出


tf.Tensor(
[[ 0.063188    0.07063895 -0.05998802 ... -0.01409875  0.01863449
   0.01505797]
 [-0.06786212  0.01993554  0.03236153 ...  0.05772103  0.01787272
   0.01740014]
 [ 0.05379306  0.07613157 -0.05256693 ... -0.01256405  0.0213196
  -0.00262441]], shape=(3, 512), dtype=float32)
 
0.15320375561714172
0.8592830896377563
0.09080004692077637

其它嵌入

https://github.com/facebookresearch/InferSent

GitHub - Tiiiger/bert_score: BERT score for text generation

2.4. 方法之TF-IDF

TF-IDF原理

TF（Term Frequency) 表示词频，即一个词在在一篇文章中出现的次数，但在实际应用时会有一个漏洞，就是篇幅长的文章给定词出现的次数会更多一点。因此我们需要对次数进行归一化，通常用给定词的次数除以文章的总词数。

这其中还有一个漏洞，就是 ”的“ ”是“ ”啊“ 等类似的词在文章中出现的此时是非常多的，但是这些大多都是没有意义词，对于判断文章的关键词几乎没有什么用处，我们称这些词为”停用词“，也就是说，在度量相关性的时候不应该考虑这些词的频率。

IDF（Inverse Document Frequency）逆文本频率指数，如果包含关键词w的文档越少，则说明关键词w具有很好的类别区分能力。某一关键词的IDF，可以用总的文章数量除以包含该关键词的文章的数量，然后对结果取对数得到

注：分母加1是为了避免没有包含关键词的文章时分母是0的情况

一个词预测主题的能力越强，权重就越大，反之，权重越小，因此一个词的TF-IDF就是：

$TF-IDF=TF\cdot IDF$

代码实践

安装sklearn包

pip install scikit-learn

代码


from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["I 'd like an apple",
             "An apple a day keeps the doctor away",
            "Never compare an apple to an orange",
            "I prefer scikit-learn to Orange",
             "The scikit-learn docs are Orange and Blue"]
vect = TfidfVectorizer(min_df=1, stop_words="english")
tfidf = vect.fit_transform(corpus)
print(vect.vocabulary_)
print('tfidf:',tfidf)
pairwise_similarity = tfidf * tfidf.T
print("pairwise_similarity:",pairwise_similarity)
print(pairwise_similarity.toarray() )
 
import numpy as np
 
arr = pairwise_similarity.toarray()
np.fill_diagonal(arr, np.nan)
input_doc = "The scikit-learn docs are Orange and Blue"
input_idx = corpus.index(input_doc)
print(input_idx)
 
result_idx = np.nanargmax(arr[input_idx])
 
print(corpus[result_idx])

输出结果


{'like': 9, 'apple': 0, 'day': 4, 'keeps': 7, 'doctor': 6, 'away': 1, 'compare': 3, 'orange': 10, 'prefer': 11, 'scikit': 12, 'learn': 8, 'docs': 5, 'blue': 2}
tfidf:   (0, 0)	0.5564505207186616
  (0, 9)	0.830880748357988
  (1, 1)	0.4741246485558491
  (1, 6)	0.4741246485558491
  (1, 7)	0.4741246485558491
  (1, 4)	0.4741246485558491
  (1, 0)	0.31752680284846835
  (2, 10)	0.48624041659157047
  (2, 3)	0.7260444301457811
  (2, 0)	0.48624041659157047
  (3, 8)	0.4864843177105593
  (3, 12)	0.4864843177105593
  (3, 11)	0.6029847724484662
  (3, 10)	0.40382592962643526
  (4, 2)	0.516373967614865
  (4, 5)	0.516373967614865
  (4, 8)	0.4166072657167829
  (4, 12)	0.4166072657167829
  (4, 10)	0.3458216642191991
pairwise_similarity:   (0, 2)	0.27056873300683837
  (0, 1)	0.17668795478716204
  (0, 0)	0.9999999999999998
  (1, 2)	0.1543943648960287
  (1, 0)	0.17668795478716204
  (1, 1)	0.9999999999999999
  (2, 1)	0.1543943648960287
  (2, 0)	0.27056873300683837
  (2, 4)	0.16815247007633355
  (2, 3)	0.1963564882520361
  (2, 2)	1.0
  (3, 2)	0.1963564882520361
  (3, 4)	0.5449975578692606
  (3, 3)	0.9999999999999999
  (4, 2)	0.16815247007633355
  (4, 3)	0.5449975578692606
  (4, 4)	1.0
[[1.         0.17668795 0.27056873 0.         0.        ]
 [0.17668795 1.         0.15439436 0.         0.        ]
 [0.27056873 0.15439436 1.         0.19635649 0.16815247]
 [0.         0.         0.19635649 1.         0.54499756]
 [0.         0.         0.16815247 0.54499756 1.        ]]
4

参考资料

python - How to compute the similarity between two text documents? - Stack Overflow

优势是：速度快

2.5. CountVectorizer

下面将结合代码实现基于词袋模型的文本匹配任务。

1 数据预处理

首先需要对数据进行预处理，将文本转化成向量表示。可以使用sklearn库中的CountVectorizer


from sklearn.feature_extraction.text import CountVectorizer
 
# 定义文本数据
text1 = "Hello, how are you?"
text2 = "I am doing well, thank you."
text3 = "What is your name?"
text4 = "My name is ChatGPT."
 
# 定义CountVectorizer
vectorizer = CountVectorizer()
 
# 将文本数据转化成向量表示
X = vectorizer.fit_transform([text1, text2, text3, text4])

2 计算文本相似度

接下来可以使用余弦相似度来计算文本之间的相似度，可以使用scipy库中的spatial库中的distance函数。


from scipy.spatial import distance
 
# 计算第1个和第2个文本的相似度
similarity_12 = 1 - distance.cosine(X[0].toarray(), X[1].toarray())
print("Text 1 and Text 2 similarity: ", similarity_12)
 
# 计算第1个和第3个文本的相似度
similarity_13 = 1 - distance.cosine(X[0].toarray(), X[2].toarray())
print("Text 1 and Text 3 similarity: ", similarity_13)
 
# 计算第2个和第4个文本的相似度
similarity_24 = 1 - distance.cosine(X[1].toarray(), X[3].toarray())
print("Text 2 and Text 4 similarity: ", similarity_24)

以上代码中，使用的是余弦相似度计算文本之间的相似度。可以看到，文本1和文本2的相似度较高，文本1和文本3的相似度较低，文本2和文本4的相似度为0。

2.6. Jaccard Index--杰卡德系数计算

杰卡德系数，英文叫做 Jaccard index, 又称为 Jaccard 相似系数，用于比较有限样本集之间的相似性与差异性。Jaccard 系数值越大，样本相似度越高。

实际上它的计算方式非常简单，就是两个样本的交集除以并集得到的数值，当两个样本完全一致时，结果为 1，当两个样本完全不同时，结果为 0。

具体的可以参考Jaccard Index / Similarity Coefficient - Statistics How To

1. 代码-只使用于英语


import numpy as np
def Jaccard_Similarity(doc1, doc2):
    # List the unique words in a document
    words_doc1 = set(doc1.lower().split())
    words_doc2 = set(doc2.lower().split())
 
    # Find the intersection of words list of doc1 & doc2
    intersection = words_doc1.intersection(words_doc2)
 
    # Find the union of words list of doc1 & doc2
    union = words_doc1.union(words_doc2)
 
    # Calculate Jaccard similarity score
    # using length of intersection set divided by length of union set
    return float(len(intersection)) / len(union)
 
 
if __name__ == '__main__':
 
    doc_1 = "Data is the new oil of the digital economy"
    doc_2 = "Data is a new oil"
 
    possiblity = Jaccard_Similarity(doc_1, doc_2)
    print(possiblity)

输出

0.4444444444444444

代码二--适用于中英文


 
import numpy as np
def jaccard_similarity_cn(s1, s2):
    from sklearn.feature_extraction.text import CountVectorizer
    def add_space(s):
        return ' '.join(list(s))
 
    # 将字中间加入空格
    s1, s2 = add_space(s1), add_space(s2)
    # 转化为TF矩阵
    cv = CountVectorizer(tokenizer=lambda s: s.split())
    corpus = [s1, s2]
    vectors = cv.fit_transform(corpus).toarray()
    # 获取词表内容
    ret = cv.get_feature_names()
    print(ret)
    # 求交集
    numerator = np.sum(np.min(vectors, axis=0))
    # 求并集
    denominator = np.sum(np.max(vectors, axis=0))
    # 计算杰卡德系数
    return 1.0 * numerator / denominator
 
 
if __name__ == '__main__':
    s1 = '你在干嘛呢'
    s2 = '你在干什么呢'
    print(jaccard_similarity_cn(s1, s2))

输出


['么', '什', '你', '呢', '嘛', '在', '干']
0.5714285714285714

总结

本文介绍了NLP中的文本匹配任务，包括任务的定义、原理以及代码实现。文本匹配任务的核心是将文本表示成向量，并计算文本向量之间的相似度。在实现中，可以使用不同的方法进行文本表示和相似度计算。通过文本匹配任务，可以实现很多NLP应用，比如问答系统、对话系统等。

参考资料

How to compute the similarity between two text documents?

https://en.wikipedia.org/wiki/Cosine_similarity#Angular_distance_and_similarity

https://towardsdatascience.com/word-distance-between-word-embeddings-cc3e9cf1d632

scipy.spatial.distance.cosine — SciPy v0.14.0 Reference Guide

https://www.tensorflow.org/api_docs/python/tf/keras/losses/CosineSimilarity

deep learning - is there a way to check similarity between two full sentences in python? - Stack Overflow

NLP Town

https://welearnnlp.blog.csdn.net/article/details/129904773

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/从前慢现在也慢/article/detail/367618