赞
踩
理论课:C1W4.Machine Translation and Document Search
先导入包
import pdb
import pickle
import string
import time
import nltk
import numpy as np
from nltk.corpus import stopwords, twitter_samples
from utils import (cosine_similarity, get_dict, process_tweet)
from os import getcwd
import w4_unittest
编写一个将英语翻译成法语的程序。
完整的英语词向量数据集约为 3.64G,法语词向量数据集约为 629 G。减少工作量,这里只需加载要用到的单词的嵌入词子集。
en_embeddings.p
fr_embeddings.p
en_embeddings_subset = pickle.load(open("./data/en_embeddings.p", "rb"))
fr_embeddings_subset = pickle.load(open("./data/fr_embeddings.p", "rb"))
en_embeddings_subset:键是一个英文单词,值是一个 300 维数组,是该单词的词向量,例如:
‘the’: array([ 0.08007812, 0.10498047, 0.04980469, 0.0534668 , -0.06738281, …
fr_embeddings_subset:键是一个法文单词,值是一个 300 维数组,是该单词的词向量,例如:
‘la’: array([-6.18250e-03, -9.43867e-04, -8.82648e-03, 3.24623e-02,…
加载两个词典,将英语词汇映射为法语词汇
# loading the english to french dictionaries
en_fr_train = get_dict('./data/en-fr.train.txt')
print('The length of the English to French training dictionary is', len(en_fr_train))
en_fr_test = get_dict('./data/en-fr.test.txt')
print('The length of the English to French test dictionary is', len(en_fr_test))
结果:
The length of the English to French training dictionary is 5000
The length of the English to French test dictionary is 1500
en_fr_train
是一个字典,键是英文单词,值是该英文单词的法文翻译。例如:{'the': 'la',
'and': 'et',
'was': 'était',
'for': 'pour',
en_fr_test
与 en_fr_train
类似,但它是一个测试集。实现一个函数 get_matrices
,该函数接收加载的数据,并返回矩阵 X
和 Y
。
输入:
en_fr
: 英法词典en_embeddings
: 英语词向量词典fr_embeddings
: 法文词向量词典返回:
X
和矩阵 Y
,其中 X 中的每一行是一个英语单词的词嵌入/向量,而 Y 中的同一行是该英语单词的法语版本的词嵌入。
使用 en_fr
字典,确保 X
矩阵中的第 i 行与 Y
矩阵中的第 i 行相对应。
# UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) def get_matrices(en_fr, french_vecs, english_vecs): """ Input: en_fr: English to French dictionary french_vecs: French words to their corresponding word embeddings. english_vecs: English words to their corresponding word embeddings. Output: X: a matrix where the columns are the English embeddings. Y: a matrix where the columns correspong to the French embeddings. R: the projection matrix that minimizes the F norm ||X R -Y||^2. """ ### START CODE HERE ### # X_l and Y_l are lists of the english and french word embeddings X_l = list() Y_l = list() # get the english words (the keys in the dictionary) and store in a set() english_set = set(english_vecs.keys()) # get the french words (keys in the dictionary) and store in a set() french_set = set(french_vecs.keys()) # store the french words that are part of the english-french dictionary (these are the values of the dictionary) french_words = set(en_fr.values()) # loop through all english, french word pairs in the english french dictionary for en_word, fr_word in en_fr.items(): # check that the french word has an embedding and that the english word has an embedding if fr_word in french_set and en_word in english_set: # get the english embedding en_vec = english_vecs[en_word] # get the french embedding fr_vec = french_vecs[fr_word] # add the english embedding to the list X_l.append(en_vec) # add the french embedding to the list Y_l.append(fr_vec) # stack the vectors of X_l into a matrix X X = np.vstack(X_l) # stack the vectors of Y_l into a matrix Y Y = np.vstack(Y_l) ### END CODE HERE ### return X, Y
使用get_matrices()
函数获得将英语和法语单词嵌入到相应向量空间模型中的集合 X_train
和 Y_train
。
# getting the training set:
X_train, Y_train = get_matrices(
en_fr_train, fr_embeddings_subset, en_embeddings_subset)
大白话:
1.所有英文和法文都有词向量表达,但是太大,我们不需要这么多,先切出我们需要用的英文词向量表达en_embeddings_subset和法文词向量表达fr_embeddings_subset。
2.加载英文法文单词对翻译训练集en_fr_train和测试集en_fr_test
3.单词没有办法训练,于是使用get_matrices函数将en_fr_train中的英文法文单词对一一拿出来,分别在英文词向量表达en_embeddings_subset和法文词向量表达fr_embeddings_subset中找到对应的词向量表达,分别放到X和Y中。
给定英语和法语词嵌入词典,创建一个转换矩阵 R
,完成:
使用英语词汇嵌入
e
\mathbf{e}
e,乘以 转换矩阵 后得
e
R
\mathbf{eR}
eR ,也就是得到一个新的词嵌入
f
\mathbf{f}
f。
e
\mathbf{e}
e和
f
\mathbf{f}
f 都是行向量。
然后计算
f
\mathbf{f}
f在法语词向量表达中的近邻,并得到与转换后的词嵌入式最相似的词。
我们希望转化矩阵R
能够最小化以下公式:
arg
min
R
∥
X
R
−
Y
∥
F
(1)
\arg \min _{\mathbf{R}}\| \mathbf{X R} - \mathbf{Y}\|_{F}\tag{1}
argRmin∥XR−Y∥F(1)
上式中的下标F表示Frobenius范数,具体定义看本节课程配套实验,简单说就是矩阵
A
A
A(维度为
m
,
n
m,n
m,n),其Frobenius范数为:
∥
A
∥
F
≡
∑
i
=
1
m
∑
j
=
1
n
∣
a
i
j
∣
2
(2)
\|\mathbf{A}\|_{F} \equiv \sqrt{\sum_{i=1}^{m} \sum_{j=1}^{n}\left|a_{i j}\right|^{2}}\tag{2}
∥A∥F≡i=1∑mj=1∑n∣aij∣2
(2)
在实操过程中,为了简化计算:
L
F
=
∥
X
R
−
Y
∥
F
L_F=\| \mathbf{XR} - \mathbf{Y}\|_{F}
LF=∥XR−Y∥F
有时会使用Frobenius范数的平方,(去掉了Frobenius的根号,梯度下降求偏导更简便)即:
L
F
2
=
∥
X
R
−
Y
∥
F
2
L_{F^2}=\| \mathbf{XR} - \mathbf{Y}\|_{F}^2
LF2=∥XR−Y∥F2
然后,为了进一步简化,可以将这个值除以样本数量
m
m
m(即矩阵
X
X
X 的行数),得到:
L
F
2
/
m
=
∥
X
R
−
Y
∥
F
2
m
L_{F^2}/m=\cfrac{\| \mathbf{XR} - \mathbf{Y}\|_{F}^2}{m}
LF2/m=m∥XR−Y∥F2
这里的
m
m
m是为了规范化损失函数,使得损失值与样本数量无关,这样在不同规模的数据集上比较模型性能时更为公平。
扩展知识:
计算损失值,损失函数公式为:
L
(
X
,
Y
,
R
)
=
1
m
∑
i
=
1
m
∑
j
=
1
n
(
a
i
j
)
2
L(X, Y, R)=\frac{1}{m}\sum_{i=1}^{m} \sum_{j=1}^{n}\left( a_{i j} \right)^{2}
L(X,Y,R)=m1i=1∑mj=1∑n(aij)2
其中
a
i
j
a_{i j}
aij 是矩阵
X
R
−
Y
\mathbf{XR}-\mathbf{Y}
XR−Y 第
i
i
i 行和第
j
j
j 列的值。
compute_loss()
函数主要步骤:
X
和 R
计算 Y
的近似值XR - Y
# UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) def compute_loss(X, Y, R): ''' Inputs: X: a matrix of dimension (m,n) where the columns are the English embeddings. Y: a matrix of dimension (m,n) where the columns correspong to the French embeddings. R: a matrix of dimension (n,n) - transformation matrix from English to French vector space embeddings. Outputs: L: a matrix of dimension (m,n) - the value of the loss function for given X, Y and R. ''' ### START CODE HERE ### # m is the number of rows in X m = X.shape[0] # diff is XR - Y diff = np.dot(X,R)-Y # diff_squared is the element-wise square of the difference diff_squared = np.square(diff) # sum_diff_squared is the sum of the squared elements sum_diff_squared = np.sum(diff_squared) # loss i is the sum_diff_squard divided by the number of examples (m) loss = sum_diff_squared/m ### END CODE HERE ### return loss
测试:
# Testing your implementation.
np.random.seed(123)
m = 10
n = 5
X = np.random.rand(m, n)
Y = np.random.rand(m, n) * .1
R = np.random.rand(n, n)
print(f"Expected loss for an experiment with random matrices: {compute_loss(X, Y, R):.4f}" )
梯度计算要点:
R
的梯度损失。R
的微小变化对损失函数变化的影响程度。R
以最小化损失 的方向。# UNQ_C4 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) def compute_gradient(X, Y, R): ''' Inputs: X: a matrix of dimension (m,n) where the columns are the English embeddings. Y: a matrix of dimension (m,n) where the columns correspong to the French embeddings. R: a matrix of dimension (n,n) - transformation matrix from English to French vector space embeddings. Outputs: g: a scalar value - gradient of the loss function L for given X, Y and R. ''' ### START CODE HERE ### # m is the number of rows in X m = X.shape[0] # gradient is X^T(XR - Y) * 2/m gradient = np.dot(np.transpose(X),(np.dot(X,R)-Y))* 2/m ### END CODE HERE ### return gradient
测试:
# Testing your implementation.
np.random.seed(123)
m = 10
n = 5
X = np.random.rand(m, n)
Y = np.random.rand(m, n) * .1
R = np.random.rand(n, n)
gradient = compute_gradient(X, Y, R)
print(f"First row of the gradient matrix: {gradient[0]}")
梯度下降 是一种迭代算法,用于寻找函数的最优值。
R
,直到达到损失函数最小值。梯度下降需要迭代次数新手可以设置一个固定值,而不是迭代到损失低于阈值为止。以下是相关解释:
1.训练集损失值降低不是我们的目标,我们想要的是低验证集损失值降,或者提高验证集准确率。通常会设置"early stopping"临界值,提前停止训练迭代,防止过拟合现象。但是。。。请看第二条
2.在更大的数据集上,regularization做得好的模型训练损失永远不会停止下降。尤其是在 NLP 领域,可以持续训练几个月,模型训练损失会慢慢下降。这是很难在某个临界值上停下来的原因。
3.使用固定迭代次数的一个好处是:你可以减少训练时间。另一个好处是,可以对超参数例如:尝试不同学习率,看看调整学习率会不会带来性能提升。
伪代码:
1.根据矩阵
R
R
R计算梯度损失
g
g
g
2.以学习率
α
\alpha
α更新矩阵
R
R
R
R
new
=
R
old
−
α
g
R_{\text{new}}= R_{\text{old}}-\alpha g
Rnew=Rold−αg
学习率大小的对梯度下降的影响这里不赘述,这里取learning_rate
=
0.0003
=0.0003
=0.0003
使用align_embeddings()
函数完成align_embeddings()
# UNQ_C5 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) def align_embeddings(X, Y, train_steps=100, learning_rate=0.0003, verbose=True, compute_loss=compute_loss, compute_gradient=compute_gradient): ''' Inputs: X: a matrix of dimension (m,n) where the columns are the English embeddings. Y: a matrix of dimension (m,n) where the columns correspong to the French embeddings. train_steps: positive int - describes how many steps will gradient descent algorithm do. learning_rate: positive float - describes how big steps will gradient descent algorithm do. Outputs: R: a matrix of dimension (n,n) - the projection matrix that minimizes the F norm ||X R -Y||^2 ''' np.random.seed(129) # the number of columns in X is the number of dimensions for a word vector (e.g. 300) # R is a square matrix with length equal to the number of dimensions in th word embedding R = np.random.rand(X.shape[1], X.shape[1]) for i in range(train_steps): if verbose and i % 25 == 0: print(f"loss at iteration {i} is: {compute_loss(X, Y, R):.4f}") ### START CODE HERE ### # use the function that you defined to compute the gradient gradient = compute_gradient(X, Y, R) # update R by subtracting the learning rate times gradient R -= learning_rate * gradient ### END CODE HERE ### return R
测试:
# Testing your implementation.
np.random.seed(129)
m = 10
n = 5
X = np.random.rand(m, n)
Y = np.random.rand(m, n) * .1
R = align_embeddings(X, Y)
R_train = align_embeddings(X_train, Y_train, train_steps=400, learning_rate=0.8)
结果:
loss at iteration 0 is: 963.0146
loss at iteration 25 is: 97.8292
loss at iteration 50 is: 26.8329
loss at iteration 75 is: 9.7893
loss at iteration 100 is: 4.3776
loss at iteration 125 is: 2.3281
loss at iteration 150 is: 1.4480
loss at iteration 175 is: 1.0338
loss at iteration 200 is: 0.8251
loss at iteration 225 is: 0.7145
loss at iteration 250 is: 0.6534
loss at iteration 275 is: 0.6185
loss at iteration 300 is: 0.5981
loss at iteration 325 is: 0.5858
loss at iteration 350 is: 0.5782
loss at iteration 375 is: 0.5735
由于我们是通过线性变换矩阵 R \mathbf{R} R 来近似从英语词向量到法语词向量的转换,因此当我们将某个特定英语单词的向量 e \mathbf{e} e 转换到法语向量空间时,大多数情况下我们不会得到法语单词的精确向量表达,需要我们使用1-NN算法,以 e R \mathbf{eR} eR为输入,从矩阵 Y \mathbf{Y} Y中找到一个最相近的词向量 f \mathbf{f} f
判断向量距离使用之前就学过的余弦相似度:
cos
(
u
,
v
)
=
u
⋅
v
∥
u
∥
∥
v
∥
\cos(u,v)=\frac{u\cdot v}{\left\|u\right\|\left\|v\right\|}
cos(u,v)=∥u∥∥v∥u⋅v
u
u
u 和
v
v
v是两个向量。
注意:距离和相似度是截然相反的两个概念。
完成函数 nearest_neighbor()
。
输入:
v
、k
个最近的邻居,这里默认为1。输出:
最近的 k
个向量
cosine_similarity
函数已经实现并导入。它的参数是两个向量,并返回它们之间角度的余弦值。candidates
中的行,并将当前行与向量 v
的相似度结果保存在一个 python 列表中。注意相似度的顺序与 candidates
行向量的顺序相同。# UNQ_C8 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) def nearest_neighbor(v, candidates, k=1, cosine_similarity=cosine_similarity): """ Input: - v, the vector you are going find the nearest neighbor for - candidates: a set of vectors where we will find the neighbors - k: top k nearest neighbors to find Output: - k_idx: the indices of the top k closest vectors in sorted form """ ### START CODE HERE ### similarity_l = [] # for each candidate vector... for row in candidates: # get the cosine similarity cos_similarity = cosine_similarity(v,row) # append the similarity to the list similarity_l.append(cos_similarity) # sort the similarity list and get the indices of the sorted list sorted_ids = np.argsort(similarity_l) # Reverse the order of the sorted_ids array sorted_ids = np.flip(sorted_ids) # get the indices of the k most similar candidate vectors k_idx = sorted_ids[0:k] ### END CODE HERE ### return k_idx
测试:
# UNQ_C9 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# You do not have to input any code in this cell, but it is relevant to grading, so please do not change anything
# Test your implementation:
v = np.array([1, 0, 1])
candidates = np.array([[1, 0, 5], [-2, 5, 3], [2, 0, 1], [6, -9, 5], [9, 9, 9]])
print(candidates[nearest_neighbor(v, candidates, 3)])
结果:
[[2 0 1]
[1 0 5]
[9 9 9]]
完成函数 test_vocabulary
,该函数吃英文词向量矩阵
X
X
X、法语词向量矩阵
Y
Y
Y 和
R
R
R 矩阵,并通过
R
R
R 计算从
X
X
X 到
Y
Y
Y 的翻译准确度。
nearest_neighbor
(参数为 ”k=1")获取最接近的法文词向量索引,并将其与刚才获取的英文嵌入索引进行比较。accuracy = # ( correct predictions ) # ( total predictions ) \text{accuracy}=\frac{\#(\text{correct predictions})}{\#(\text{total predictions})} accuracy=#(total predictions)#(correct predictions)
# UNQ_C10 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) def test_vocabulary(X, Y, R, nearest_neighbor=nearest_neighbor): ''' Input: X: a matrix where the columns are the English embeddings. Y: a matrix where the columns correspong to the French embeddings. R: the transform matrix which translates word embeddings from English to French word vector space. Output: accuracy: for the English to French capitals ''' ### START CODE HERE ### # The prediction is X times R pred = np.dot(X,R) # initialize the number correct to zero num_correct = 0 # loop through each row in pred (each transformed embedding) for i in range(len(pred)): # get the index of the nearest neighbor of pred at row 'i'; also pass in the candidates in Y pred_idx = nearest_neighbor(pred[i], Y) # if the index of the nearest neighbor equals the row of i... \ if pred_idx == i: # increment the number correct by 1. num_correct += 1 # accuracy is the number correct divided by the number of rows in 'pred' (also number of rows in X) accuracy = num_correct/len(pred) ### END CODE HERE ### return accuracy
在测试集上计算正确率:
X_val, Y_val = get_matrices(en_fr_test, fr_embeddings_subset, en_embeddings_subset)
acc = test_vocabulary(X_val, Y_val, R_train) # this might take a minute or two
print(f"accuracy on test set is {acc:.3f}")
结果:
accuracy on test set is 0.557
本节使用位置敏感哈希算法实现更高效的 K 近邻算法,并将其应用于文档搜索。
先加载推特数据集
# get the positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')
all_tweets = all_positive_tweets + all_negative_tweets
文本文档是单词序列。
完成 get_document_embedding()
函数。
get_document_embedding()
将整个文档编码为 “文档 向量”。en_embeddings
。# UNQ_C12 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) def get_document_embedding(tweet, en_embeddings, process_tweet=process_tweet): ''' Input: - tweet: a string - en_embeddings: a dictionary of word embeddings Output: - doc_embedding: sum of all word embeddings in the tweet ''' doc_embedding = np.zeros(300) ### START CODE HERE ### # process the document into a list of words (process the tweet) processed_doc = process_tweet(tweet) for word in processed_doc: # add the word embedding to the running total for the document embedding doc_embedding +=en_embeddings.get(word,0) ### END CODE HERE ### return doc_embedding
代码中使用了get函数来省略未知单词,将其词向量值设置为0。
测试:
# testing your function
custom_tweet = "RT @Twitter @chapagain Hello There! Have a great day. :) #good #morning http://chapagain.com.np"
tweet_embedding = get_doc
结果:
array([-0.00268555, -0.15378189, -0.55761719, -0.07216644, -0.32263184])
将所有文档(推文)的向量表达放入词典中:
# UNQ_C14 (UNIQUE CELL IDENTIFIER, DO NOT EDIT) def get_document_vecs(all_docs, en_embeddings, get_document_embedding=get_document_embedding): ''' Input: - all_docs: list of strings - all tweets in our dataset. - en_embeddings: dictionary with words as the keys and their embeddings as the values. Output: - document_vec_matrix: matrix of tweet embeddings. - ind2Doc_dict: dictionary with indices of tweets in vecs as keys and their embeddings as the values. ''' # the dictionary's key is an index (integer) that identifies a specific tweet # the value is the document embedding for that document ind2Doc_dict = {} # this is list that will store the document vectors document_vec_l = [] for i, doc in enumerate(all_docs): ### START CODE HERE ### # get the document embedding of the tweet doc_embedding = get_document_embedding(doc, en_embeddings) # save the document embedding into the ind2Tweet dictionary at index i ind2Doc_dict[i] = doc_embedding # append the document embedding to the list of document vectors document_vec_l.append(doc_embedding) ### END CODE HERE ### # convert the list of document vectors into a 2D array (each row is a document vector) document_vec_matrix = np.vstack(document_vec_l) return document_vec_matrix, ind2Doc_dict
测试:
document_vecs, ind2Tweet = get_document_vecs(all_tweets, en_embeddings_subset)
print(f"length of dictionary {len(ind2Tweet)}")
print(f"shape of document_vecs {document_vecs.shape}")
结果:
length of dictionary 10000
shape of document_vecs (10000, 300)
现在得到了一个维数为
(
m
,
d
)
(m,d)
(m,d) 的向量,其中 m
是推文的数量(10,000),d
是嵌入的维数(300)。 现在,可输入一条推文,然后使用余弦相似度来查看语料库中哪条推文与您的推文相似。
my_tweet = 'i am sad'
process_tweet(my_tweet)
tweet_embedding = get_document_embedding(my_tweet, en_embeddings_subset)
# this gives you a similar tweet as your input.
# this implementation is vectorized...
idx = np.argmax(cosine_similarity(document_vecs, tweet_embedding))
print(all_tweets[idx])
结果:
@zoeeylim sad sad sad kid
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。