当前位置:   article > 正文

C1W4.Assignment.Naive Machine Translation and LSH

C1W4.Assignment.Naive Machine Translation and LSH

理论课:C1W4.Machine Translation and Document Search


理论课: C1W4.Machine Translation and Document Search
本次课需要NLTK的推特数据集和停用词表。

先导入包

import pdb
import pickle
import string

import time

import nltk
import numpy as np
from nltk.corpus import stopwords, twitter_samples

from utils import (cosine_similarity, get_dict, process_tweet)
from os import getcwd

import w4_unittest
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14

1. The word embeddings data for English and French words

编写一个将英语翻译成法语的程序。

1.1The data

The subset of data

完整的英语词向量数据集约为 3.64G,法语词向量数据集约为 629 G。减少工作量,这里只需加载要用到的单词的嵌入词子集。
en_embeddings.p
fr_embeddings.p

en_embeddings_subset = pickle.load(open("./data/en_embeddings.p", "rb"))
fr_embeddings_subset = pickle.load(open("./data/fr_embeddings.p", "rb"))
  • 1
  • 2

en_embeddings_subset:键是一个英文单词,值是一个 300 维数组,是该单词的词向量,例如:
‘the’: array([ 0.08007812, 0.10498047, 0.04980469, 0.0534668 , -0.06738281, …
fr_embeddings_subset:键是一个法文单词,值是一个 300 维数组,是该单词的词向量,例如:
‘la’: array([-6.18250e-03, -9.43867e-04, -8.82648e-03, 3.24623e-02,…

Load two dictionaries

加载两个词典,将英语词汇映射为法语词汇

  • 一个训练词典
  • 一个测试词典。
# loading the english to french dictionaries
en_fr_train = get_dict('./data/en-fr.train.txt')
print('The length of the English to French training dictionary is', len(en_fr_train))
en_fr_test = get_dict('./data/en-fr.test.txt')
print('The length of the English to French test dictionary is', len(en_fr_test))
  • 1
  • 2
  • 3
  • 4
  • 5

结果:
The length of the English to French training dictionary is 5000
The length of the English to French test dictionary is 1500

  • en_fr_train 是一个字典,键是英文单词,值是该英文单词的法文翻译。例如:
{'the': 'la',
 'and': 'et',
 'was': 'était',
 'for': 'pour',
  • 1
  • 2
  • 3
  • 4
  • en_fr_testen_fr_train 类似,但它是一个测试集。

1.2 Generate embedding and transform matrices

Exercise 1: Translating English dictionary to French by using embeddings

实现一个函数 get_matrices,该函数接收加载的数据,并返回矩阵 XY

输入:

  • en_fr : 英法词典
  • en_embeddings : 英语词向量词典
  • fr_embeddings : 法文词向量词典

返回:

  • 矩阵 X 和矩阵 Y,其中 X 中的每一行是一个英语单词的词嵌入/向量,而 Y 中的同一行是该英语单词的法语版本的词嵌入。

在这里插入图片描述
使用 en_fr 字典,确保 X 矩阵中的第 i 行与 Y 矩阵中的第 i 行相对应。

# UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def get_matrices(en_fr, french_vecs, english_vecs):
    """
    Input:
        en_fr: English to French dictionary
        french_vecs: French words to their corresponding word embeddings.
        english_vecs: English words to their corresponding word embeddings.
    Output: 
        X: a matrix where the columns are the English embeddings.
        Y: a matrix where the columns correspong to the French embeddings.
        R: the projection matrix that minimizes the F norm ||X R -Y||^2.
    """

    ### START CODE HERE ###

    # X_l and Y_l are lists of the english and french word embeddings
    X_l = list()
    Y_l = list()

    # get the english words (the keys in the dictionary) and store in a set()
    english_set = set(english_vecs.keys())

    # get the french words (keys in the dictionary) and store in a set()
    french_set = set(french_vecs.keys())

    # store the french words that are part of the english-french dictionary (these are the values of the dictionary)
    french_words = set(en_fr.values())

    # loop through all english, french word pairs in the english french dictionary
    for en_word, fr_word in en_fr.items():

        # check that the french word has an embedding and that the english word has an embedding
        if fr_word in french_set and en_word in english_set:

            # get the english embedding
            en_vec = english_vecs[en_word]

            # get the french embedding
            fr_vec = french_vecs[fr_word]

            # add the english embedding to the list
            X_l.append(en_vec)

            # add the french embedding to the list
            Y_l.append(fr_vec)

    # stack the vectors of X_l into a matrix X
    X = np.vstack(X_l)

    # stack the vectors of Y_l into a matrix Y
    Y = np.vstack(Y_l)
    ### END CODE HERE ###

    return X, Y
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54

使用get_matrices()函数获得将英语和法语单词嵌入到相应向量空间模型中的集合 X_trainY_train

# getting the training set:
X_train, Y_train = get_matrices(
    en_fr_train, fr_embeddings_subset, en_embeddings_subset)
  • 1
  • 2
  • 3

大白话:
1.所有英文和法文都有词向量表达,但是太大,我们不需要这么多,先切出我们需要用的英文词向量表达en_embeddings_subset和法文词向量表达fr_embeddings_subset。
2.加载英文法文单词对翻译训练集en_fr_train和测试集en_fr_test
3.单词没有办法训练,于是使用get_matrices函数将en_fr_train中的英文法文单词对一一拿出来,分别在英文词向量表达en_embeddings_subset和法文词向量表达fr_embeddings_subset中找到对应的词向量表达,分别放到X和Y中。

2. Translations

2.1 Translation as linear transformation of embeddings

给定英语和法语词嵌入词典,创建一个转换矩阵 R,完成:
使用英语词汇嵌入 e \mathbf{e} e,乘以 转换矩阵 后得 e R \mathbf{eR} eR ,也就是得到一个新的词嵌入 f \mathbf{f} f
e \mathbf{e} e f \mathbf{f} f 都是行向量。
然后计算 f \mathbf{f} f在法语词向量表达中的近邻,并得到与转换后的词嵌入式最相似的词。
我们希望转化矩阵R能够最小化以下公式:
arg ⁡ min ⁡ R ∥ X R − Y ∥ F (1) \arg \min _{\mathbf{R}}\| \mathbf{X R} - \mathbf{Y}\|_{F}\tag{1} argRminXRYF(1)
上式中的下标F表示Frobenius范数,具体定义看本节课程配套实验,简单说就是矩阵 A A A(维度为 m , n m,n m,n),其Frobenius范数为:
∥ A ∥ F ≡ ∑ i = 1 m ∑ j = 1 n ∣ a i j ∣ 2 (2) \|\mathbf{A}\|_{F} \equiv \sqrt{\sum_{i=1}^{m} \sum_{j=1}^{n}\left|a_{i j}\right|^{2}}\tag{2} AFi=1mj=1naij2 (2)

Actual loss function

在实操过程中,为了简化计算:
L F = ∥ X R − Y ∥ F L_F=\| \mathbf{XR} - \mathbf{Y}\|_{F} LF=XRYF
有时会使用Frobenius范数的平方,(去掉了Frobenius的根号,梯度下降求偏导更简便)即:
L F 2 = ∥ X R − Y ∥ F 2 L_{F^2}=\| \mathbf{XR} - \mathbf{Y}\|_{F}^2 LF2=XRYF2
然后,为了进一步简化,可以将这个值除以样本数量 m m m(即矩阵 X X X 的行数),得到:
L F 2 / m = ∥ X R − Y ∥ F 2 m L_{F^2}/m=\cfrac{\| \mathbf{XR} - \mathbf{Y}\|_{F}^2}{m} LF2/m=mXRYF2
这里的 m m m是为了规范化损失函数,使得损失值与样本数量无关,这样在不同规模的数据集上比较模型性能时更为公平。
扩展知识:

Exercise 2: Implementing translation mechanism described in this section

计算损失值,损失函数公式为:
L ( X , Y , R ) = 1 m ∑ i = 1 m ∑ j = 1 n ( a i j ) 2 L(X, Y, R)=\frac{1}{m}\sum_{i=1}^{m} \sum_{j=1}^{n}\left( a_{i j} \right)^{2} L(X,Y,R)=m1i=1mj=1n(aij)2
其中 a i j a_{i j} aij 是矩阵 X R − Y \mathbf{XR}-\mathbf{Y} XRY i i i 行和第 j j j 列的值。
compute_loss() 函数主要步骤:

  • 通过矩阵乘以 XR 计算 Y 的近似值
  • 计算差值 XR - Y
  • 根据Frobenius 公式计算差值的平方并除以 m m m
# UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def compute_loss(X, Y, R):
    '''
    Inputs: 
        X: a matrix of dimension (m,n) where the columns are the English embeddings.
        Y: a matrix of dimension (m,n) where the columns correspong to the French embeddings.
        R: a matrix of dimension (n,n) - transformation matrix from English to French vector space embeddings.
    Outputs:
        L: a matrix of dimension (m,n) - the value of the loss function for given X, Y and R.
    '''
    ### START CODE HERE ###
    # m is the number of rows in X
    m = X.shape[0]
        
    # diff is XR - Y    
    diff = np.dot(X,R)-Y

    # diff_squared is the element-wise square of the difference    
    diff_squared = np.square(diff)

    # sum_diff_squared is the sum of the squared elements
    sum_diff_squared = np.sum(diff_squared)

    # loss i is the sum_diff_squard divided by the number of examples (m)
    loss = sum_diff_squared/m
    ### END CODE HERE ###
    return loss
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27

测试:

# Testing your implementation.
np.random.seed(123)
m = 10
n = 5
X = np.random.rand(m, n)
Y = np.random.rand(m, n) * .1
R = np.random.rand(n, n)
print(f"Expected loss for an experiment with random matrices: {compute_loss(X, Y, R):.4f}" ) 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

Exercise 3

Computing the gradient of loss in respect to transform matrix R

梯度计算要点:

  • 计算变换矩阵 R 的梯度损失。
  • 梯度是一个矩阵,它表示 R 的微小变化对损失函数变化的影响程度。
  • 梯度为我们提供了减少 R以最小化损失 的方向。
  • m m m 是训练实例的数量( X X X 中的行数)。
  • 损失函数 L ( X , Y , R ) L(X, Y, R) L(X,Y,R)的梯度公式(对 R R R求偏导)为:
    d d R L ( X , Y , R ) = d d R ( 1 m ∥ X R − Y ∥ F 2 ) = 2 m X T ( X R − Y ) \frac{d}{dR}L(X, Y, R)=\frac{d}{dR}\Big(\frac{1}{m}\| X R -Y\|_{F}^{2}\Big) = \frac{2}{m}X^{T} (X R - Y) dRdL(X,Y,R)=dRd(m1XRYF2)=m2XT(XRY)
# UNQ_C4 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def compute_gradient(X, Y, R):
    '''
    Inputs: 
        X: a matrix of dimension (m,n) where the columns are the English embeddings.
        Y: a matrix of dimension (m,n) where the columns correspong to the French embeddings.
        R: a matrix of dimension (n,n) - transformation matrix from English to French vector space embeddings.
    Outputs:
        g: a scalar value - gradient of the loss function L for given X, Y and R.
    '''
    ### START CODE HERE ###
    # m is the number of rows in X
    m = X.shape[0]

    # gradient is X^T(XR - Y) * 2/m    
    gradient = np.dot(np.transpose(X),(np.dot(X,R)-Y))* 2/m    
    
    ### END CODE HERE ###
    return gradient
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19

测试:

# Testing your implementation.
np.random.seed(123)
m = 10
n = 5
X = np.random.rand(m, n)
Y = np.random.rand(m, n) * .1
R = np.random.rand(n, n)
gradient = compute_gradient(X, Y, R)
print(f"First row of the gradient matrix: {gradient[0]}")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
Finding the optimal R with gradient descent algorithm

梯度下降 是一种迭代算法,用于寻找函数的最优值。

  • 前面我们提到,损失相对于矩阵的梯度表示矩阵中某个坐标的微小变化对损失函数变化的影响程度。
  • 梯度下降利用这一信息来迭代更新矩阵 R,直到达到损失函数最小值。

梯度下降需要迭代次数新手可以设置一个固定值,而不是迭代到损失低于阈值为止。以下是相关解释:
1.训练集损失值降低不是我们的目标,我们想要的是低验证集损失值降,或者提高验证集准确率。通常会设置"early stopping"临界值,提前停止训练迭代,防止过拟合现象。但是。。。请看第二条
2.在更大的数据集上,regularization做得好的模型训练损失永远不会停止下降。尤其是在 NLP 领域,可以持续训练几个月,模型训练损失会慢慢下降。这是很难在某个临界值上停下来的原因。
3.使用固定迭代次数的一个好处是:你可以减少训练时间。另一个好处是,可以对超参数例如:尝试不同学习率,看看调整学习率会不会带来性能提升。

伪代码:
1.根据矩阵 R R R计算梯度损失 g g g
2.以学习率 α \alpha α更新矩阵 R R R
R new = R old − α g R_{\text{new}}= R_{\text{old}}-\alpha g Rnew=Roldαg
学习率大小的对梯度下降的影响这里不赘述,这里取learning_rate = 0.0003 =0.0003 =0.0003

Exercise 4

使用align_embeddings()函数完成align_embeddings()

# UNQ_C5 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def align_embeddings(X, Y, train_steps=100, learning_rate=0.0003, verbose=True, compute_loss=compute_loss, compute_gradient=compute_gradient):
    '''
    Inputs:
        X: a matrix of dimension (m,n) where the columns are the English embeddings.
        Y: a matrix of dimension (m,n) where the columns correspong to the French embeddings.
        train_steps: positive int - describes how many steps will gradient descent algorithm do.
        learning_rate: positive float - describes how big steps will  gradient descent algorithm do.
    Outputs:
        R: a matrix of dimension (n,n) - the projection matrix that minimizes the F norm ||X R -Y||^2
    '''
    np.random.seed(129)

    # the number of columns in X is the number of dimensions for a word vector (e.g. 300)
    # R is a square matrix with length equal to the number of dimensions in th  word embedding
    R = np.random.rand(X.shape[1], X.shape[1])

    for i in range(train_steps):
        if verbose and i % 25 == 0:
            print(f"loss at iteration {i} is: {compute_loss(X, Y, R):.4f}")
        ### START CODE HERE ###
        # use the function that you defined to compute the gradient
        gradient = compute_gradient(X, Y, R)

        # update R by subtracting the learning rate times gradient
        R -= learning_rate * gradient
        ### END CODE HERE ###
    return R
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28

测试:

# Testing your implementation.
np.random.seed(129)
m = 10
n = 5
X = np.random.rand(m, n)
Y = np.random.rand(m, n) * .1
R = align_embeddings(X, Y)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

Calculate transformation matrix R

R_train = align_embeddings(X_train, Y_train, train_steps=400, learning_rate=0.8)
  • 1

结果:
loss at iteration 0 is: 963.0146
loss at iteration 25 is: 97.8292
loss at iteration 50 is: 26.8329
loss at iteration 75 is: 9.7893
loss at iteration 100 is: 4.3776
loss at iteration 125 is: 2.3281
loss at iteration 150 is: 1.4480
loss at iteration 175 is: 1.0338
loss at iteration 200 is: 0.8251
loss at iteration 225 is: 0.7145
loss at iteration 250 is: 0.6534
loss at iteration 275 is: 0.6185
loss at iteration 300 is: 0.5981
loss at iteration 325 is: 0.5858
loss at iteration 350 is: 0.5782
loss at iteration 375 is: 0.5735

2.2 Testing the translation

k-Nearest neighbors algorithm

k-NN简介

  • k-NN 将一个向量作为输入,并找出数据集中与之最接近的其他向量。
  • k “是要找到的 ”最近邻 "的数量(例如,k=2 可以找到最近的两个邻居)。

由于我们是通过线性变换矩阵 R \mathbf{R} R 来近似从英语词向量到法语词向量的转换,因此当我们将某个特定英语单词的向量 e \mathbf{e} e 转换到法语向量空间时,大多数情况下我们不会得到法语单词的精确向量表达,需要我们使用1-NN算法,以 e R \mathbf{eR} eR为输入,从矩阵 Y \mathbf{Y} Y中找到一个最相近的词向量 f \mathbf{f} f

Cosine similarity

判断向量距离使用之前就学过的余弦相似度:
cos ⁡ ( u , v ) = u ⋅ v ∥ u ∥ ∥ v ∥ \cos(u,v)=\frac{u\cdot v}{\left\|u\right\|\left\|v\right\|} cos(u,v)=uvuv
u u u v v v是两个向量。

  • u u u v v v 位于同一条直线且方向相同时, cos ⁡ ( u , v ) \cos(u,v) cos(u,v) = 1 1 1
  • u u u v v v 位于同一条直线上且方向相同时, c o s ( u , v ) cos(u,v) cos(u,v) = 1 1 1
  • 当矢量彼此正交(垂直)时, cos ⁡ ( u , v ) \cos(u,v) cos(u,v) 为 0。

注意:距离和相似度是截然相反的两个概念。

  • 我们可以从余弦相似度中得到距离度量,但余弦相似度不能直接用作距离度量。
  • 当余弦相似度增加时(接近 1 ),两个向量之间的 “距离 ”会减小(接近 0 )。
  • 我们可以将 u u u v v v 之间的余弦距离定义为:
    d cos ( u , v ) = 1 − cos ⁡ ( u , v ) d_{\text{cos}}(u,v)=1-\cos(u,v) dcos(u,v)=1cos(u,v)

Exercise 5

完成函数 nearest_neighbor()
输入:

  • 向量 v
  • 一组可能的近邻候选向量
  • 要找到的 k 个最近的邻居,这里默认为1。
  • 距离度量默认基于余弦相似度。

输出:
最近的 k 个向量

  • 余弦相似度cosine_similarity函数已经实现并导入。它的参数是两个向量,并返回它们之间角度的余弦值。
  • 遍历 candidates 中的行,并将当前行与向量 v 的相似度结果保存在一个 python 列表中。注意相似度的顺序与 candidates 行向量的顺序相同。
# UNQ_C8 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def nearest_neighbor(v, candidates, k=1, cosine_similarity=cosine_similarity):
    """
    Input:
      - v, the vector you are going find the nearest neighbor for
      - candidates: a set of vectors where we will find the neighbors
      - k: top k nearest neighbors to find
    Output:
      - k_idx: the indices of the top k closest vectors in sorted form
    """
    ### START CODE HERE ###
    similarity_l = []

    # for each candidate vector...
    for row in candidates:
        # get the cosine similarity
        cos_similarity = cosine_similarity(v,row)

        # append the similarity to the list
        similarity_l.append(cos_similarity)

    # sort the similarity list and get the indices of the sorted list    
    sorted_ids = np.argsort(similarity_l)
    
    # Reverse the order of the sorted_ids array
    sorted_ids = np.flip(sorted_ids)
    
    # get the indices of the k most similar candidate vectors
    k_idx = sorted_ids[0:k]
    ### END CODE HERE ###
    return k_idx
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31

测试:

# UNQ_C9 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# You do not have to input any code in this cell, but it is relevant to grading, so please do not change anything

# Test your implementation:
v = np.array([1, 0, 1])
candidates = np.array([[1, 0, 5], [-2, 5, 3], [2, 0, 1], [6, -9, 5], [9, 9, 9]])
print(candidates[nearest_neighbor(v, candidates, 3)])
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

结果:
[[2 0 1]
[1 0 5]
[9 9 9]]

Exercise 6: Test your translation and compute its accuracy

完成函数 test_vocabulary,该函数吃英文词向量矩阵 X X X、法语词向量矩阵 Y Y Y R R R 矩阵,并通过 R R R 计算从 X X X Y Y Y 的翻译准确度。

  • 迭代转换后的英文单词向量,并检查 最接近的法语单词向量是否属于实际翻译的法语单词。
  • 使用 nearest_neighbor(参数为 ”k=1")获取最接近的法文词向量索引,并将其与刚才获取的英文嵌入索引进行比较。
  • 记录获得正确翻译的次数。

accuracy = # ( correct predictions ) # ( total predictions ) \text{accuracy}=\frac{\#(\text{correct predictions})}{\#(\text{total predictions})} accuracy=#(total predictions)#(correct predictions)

# UNQ_C10 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def test_vocabulary(X, Y, R, nearest_neighbor=nearest_neighbor):
    '''
    Input:
        X: a matrix where the columns are the English embeddings.
        Y: a matrix where the columns correspong to the French embeddings.
        R: the transform matrix which translates word embeddings from
        English to French word vector space.
    Output:
        accuracy: for the English to French capitals
    '''

    ### START CODE HERE ###
    # The prediction is X times R
    pred = np.dot(X,R)

    # initialize the number correct to zero
    num_correct = 0

    # loop through each row in pred (each transformed embedding)
    for i in range(len(pred)):
        # get the index of the nearest neighbor of pred at row 'i'; also pass in the candidates in Y
        pred_idx = nearest_neighbor(pred[i], Y)

        # if the index of the nearest neighbor equals the row of i... \
        if pred_idx == i:
            # increment the number correct by 1.
            num_correct += 1

    # accuracy is the number correct divided by the number of rows in 'pred' (also number of rows in X)
    accuracy = num_correct/len(pred)

    ### END CODE HERE ###

    return accuracy
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35

在测试集上计算正确率:

X_val, Y_val = get_matrices(en_fr_test, fr_embeddings_subset, en_embeddings_subset)
acc = test_vocabulary(X_val, Y_val, R_train)  # this might take a minute or two
print(f"accuracy on test set is {acc:.3f}")
  • 1
  • 2
  • 3

结果:
accuracy on test set is 0.557

3. LSH and document search

本节使用位置敏感哈希算法实现更高效的 K 近邻算法,并将其应用于文档搜索。

  • 处理推文,并将每条推文表示为一个向量(用向量嵌入表示文档)。
  • 使用定位敏感哈希算法和 k 近邻法查找与给定推文相似的推文。

先加载推特数据集

# get the positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')
all_tweets = all_positive_tweets + all_negative_tweets
  • 1
  • 2
  • 3
  • 4

3.1 Getting the document embeddings

Bag-of-words (BOW) document models

文本文档是单词序列。

  • 词语的排序会产生不同的效果。例如,“苹果派比意大利辣香肠披萨好吃 ”和 “意大利辣香肠披萨比苹果派好吃 ”这两个句子由于单词顺序的不同而具有相反的含义。
  • 在某些应用中,忽略词序可以让我们训练出一个高效且仍然有效的模型,这种方法被称为词袋文档模型。

Document embeddings

  • 文档向量是通过对文档中所有单词的嵌入进行求和而创建的。
  • 如果我们不知道某个词的嵌入,就可以忽略该词。

Exercise 7

完成 get_document_embedding() 函数。

  • 函数 get_document_embedding() 将整个文档编码为 “文档 向量”。
  • 它接收一个文档(字符串)和一个字典en_embeddings
  • 它处理文档,并查找每个单词的相应词向量。
  • 然后将它们相加,并返回处理过的推特的所有单词向量的总和。
# UNQ_C12 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def get_document_embedding(tweet, en_embeddings, process_tweet=process_tweet):
    '''
    Input:
        - tweet: a string
        - en_embeddings: a dictionary of word embeddings
    Output:
        - doc_embedding: sum of all word embeddings in the tweet
    '''
    doc_embedding = np.zeros(300)

    ### START CODE HERE ###
    # process the document into a list of words (process the tweet)
    processed_doc = process_tweet(tweet)
    for word in processed_doc:
        # add the word embedding to the running total for the document embedding
        doc_embedding +=en_embeddings.get(word,0)
    ### END CODE HERE ###
    return doc_embedding
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19

代码中使用了get函数来省略未知单词,将其词向量值设置为0。
测试:

# testing your function
custom_tweet = "RT @Twitter @chapagain Hello There! Have a great day. :) #good #morning http://chapagain.com.np"
tweet_embedding = get_doc
  • 1
  • 2
  • 3

结果:
array([-0.00268555, -0.15378189, -0.55761719, -0.07216644, -0.32263184])

Exercise 8: Store all document vectors into a dictionary

将所有文档(推文)的向量表达放入词典中:

# UNQ_C14 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def get_document_vecs(all_docs, en_embeddings, get_document_embedding=get_document_embedding):
    '''
    Input:
        - all_docs: list of strings - all tweets in our dataset.
        - en_embeddings: dictionary with words as the keys and their embeddings as the values.
    Output:
        - document_vec_matrix: matrix of tweet embeddings.
        - ind2Doc_dict: dictionary with indices of tweets in vecs as keys and their embeddings as the values.
    '''

    # the dictionary's key is an index (integer) that identifies a specific tweet
    # the value is the document embedding for that document
    ind2Doc_dict = {}

    # this is list that will store the document vectors
    document_vec_l = []

    for i, doc in enumerate(all_docs):

        ### START CODE HERE ###
        # get the document embedding of the tweet
        doc_embedding = get_document_embedding(doc, en_embeddings)

        # save the document embedding into the ind2Tweet dictionary at index i
        ind2Doc_dict[i] = doc_embedding

        # append the document embedding to the list of document vectors
        document_vec_l.append(doc_embedding)

        ### END CODE HERE ###

    # convert the list of document vectors into a 2D array (each row is a document vector)
    document_vec_matrix = np.vstack(document_vec_l)

    return document_vec_matrix, ind2Doc_dict
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36

测试:

document_vecs, ind2Tweet = get_document_vecs(all_tweets, en_embeddings_subset)

print(f"length of dictionary {len(ind2Tweet)}")
print(f"shape of document_vecs {document_vecs.shape}")
  • 1
  • 2
  • 3
  • 4

结果:
length of dictionary 10000
shape of document_vecs (10000, 300)

3.2 Looking up the tweets

现在得到了一个维数为 ( m , d ) (m,d) (m,d) 的向量,其中 m 是推文的数量(10,000),d 是嵌入的维数(300)。 现在,可输入一条推文,然后使用余弦相似度来查看语料库中哪条推文与您的推文相似。

my_tweet = 'i am sad'
process_tweet(my_tweet)
tweet_embedding = get_document_embedding(my_tweet, en_embeddings_subset)

# this gives you a similar tweet as your input.
# this implementation is vectorized...
idx = np.argmax(cosine_similarity(document_vecs, tweet_embedding))
print(all_tweets[idx])
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

结果:
@zoeeylim sad sad sad kid

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/IT小白/article/detail/845411
推荐阅读
相关标签