赞
踩
本次作业的目的:
Predict the Countries from Capitals任务说明:要求从一个国家中找出该国的首都来说明单词类比,编写一个程序,让它能根据国家,给出其对应首都。
导入包,没装的pip install 安装
# Run this cell to import packages.
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import w3_unittest
from utils import get_vectors
数据集以 Pandas DataFrame 的形式加载。阿如果数据量较大,这可能需要几分钟时间、这可能需要几分钟时间。
data = pd.read_csv('./data/capitals.txt', delimiter=' ')
data.columns = ['city1', 'country1', 'city2', 'country2']
# print first five elements in the DataFrame
data.head(5)
由于原始的谷歌新闻单词嵌入数据集约为 3.64 G,有条件的同学可以自行下载该数据集,提取出将在本作业中分析的单词样本,并将其保存在名为word_embeddings_subset.p的 pickle 文件中。
自己提取单词词向量看下面:
import nltk from gensim.models import KeyedVectors embeddings = KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary = True) f = open('capitals.txt', 'r').read() set_words = set(nltk.word_tokenize(f)) select_words = words = ['king', 'queen', 'oil', 'gas', 'happy', 'sad', 'city', 'town', 'village', 'country', 'continent', 'petroleum', 'joyful'] for w in select_words: set_words.add(w) def get_word_embeddings(embeddings): word_embeddings = {} for word in embeddings.vocab: if word in set_words: word_embeddings[word] = embeddings[word] return word_embeddings # Testing your function word_embeddings = get_word_embeddings(embeddings) print(len(word_embeddings)) pickle.dump( word_embeddings, open( "word_embeddings_subset.p", "wb" ) )
也可以从绑定资源中直接下载word_embeddings_subset.p,并保存到data目录下,然后加载为 dictionary
word_embeddings = pickle.load(open("./data/word_embeddings_subset.p", "rb"))
len(word_embeddings) # there should be 243 words that will be used in this assignment
结果:
243
每个单词是300维的:
print("dimension: {}".format(word_embeddings['Spain'].shape[0]))
结果:
dimension: 300
接下来编写一个函数,利用单词嵌入来预测单词之间的关系。
余弦相似度公式为:
cos
(
θ
)
=
A
⋅
B
∥
A
∥
∥
B
∥
=
∑
i
=
1
n
A
i
B
i
∑
i
=
1
n
A
i
2
∑
i
=
1
n
B
i
2
(1)
\cos (\theta)=\frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\|\|\mathbf{B}\|}=\frac{\sum_{i=1}^{n} A_{i} B_{i}}{\sqrt{\sum_{i=1}^{n} A_{i}^{2}} \sqrt{\sum_{i=1}^{n} B_{i}^{2}}}\tag{1}
cos(θ)=∥A∥∥B∥A⋅B=∑i=1nAi2
∑i=1nBi2
∑i=1nAiBi(1)
A
A
A 和
B
B
B 代表词向量,
A
i
A_i
Ai 或
B
i
B_i
Bi 代表该向量的索引 i。
# UNQ_C1 GRADED FUNCTION: cosine_similarity def cosine_similarity(A, B): ''' Input: A: a numpy array which corresponds to a word vector B: A numpy array which corresponds to a word vector Output: cos: numerical number representing the cosine similarity between A and B. ''' ### START CODE HERE ### dot = np.dot(A,B) norma = np.sqrt(np.dot(A,A)) normb = np.sqrt(np.dot(B,B)) cos = dot/(norma*normb) ### END CODE HERE ### return cos
测试:
# feel free to try different words
king = word_embeddings['king']
queen = word_embeddings['queen']
cosine_similarity(king, queen)
结果:
0.6510957
欧氏距离公式如下:
d
(
A
,
B
)
=
d
(
B
,
A
)
=
(
A
1
−
B
1
)
2
+
(
A
2
−
B
2
)
2
+
⋯
+
(
A
n
−
B
n
)
2
=
∑
i
=
1
n
(
A
i
−
B
i
)
2
d(A,B)=d(B,A)=√(A1−B1)2+(A2−B2)2+⋯+(An−Bn)2=√n∑i=1(Ai−Bi)2
d(A,B)=d(B,A)=(A1−B1)2+(A2−B2)2+⋯+(An−Bn)2
=i=1∑n(Ai−Bi)2
其中:
# UNQ_C2 GRADED FUNCTION: euclidean def euclidean(A, B): """ Input: A: a numpy array which corresponds to a word vector B: A numpy array which corresponds to a word vector Output: d: numerical number representing the Euclidean distance between A and B. """ ### START CODE HERE ### # euclidean distance d = np.sqrt(np.sum((A-B)**2)) ### END CODE HERE ### return d
测试:
# Test your function
euclidean(king, queen)
结果:
2.4796925
使用上面实现的函数计算向量之间的相似性,并利用这些相似性找出各国的首都。需要编写一个函数接收三个单词和词向量的字典。任务是找出各国首都。例如,给定以下单词:
1: Athens 2: Greece 3: Baghdad,
预测结果应该为:4: Iraq
函数编写时:
1.您可需要参考上述国王 - 男人 + 女人 = 皇后的示例,并使用单词嵌入和相似度函数将该方案转化为数学函数。
2.在词向量词典中迭代,计算向量和当前词嵌入之间的余弦相似度得分。
3.需要确保返回的单词与输入函数的单词不重复。
4.返回得分最高的单词。
# UNQ_C3 GRADED FUNCTION: get_country def get_country(city1, country1, city2, embeddings, cosine_similarity=cosine_similarity): """ Input: city1: a string (the capital city of country1) country1: a string (the country of capital1) city2: a string (the capital city of country2) embeddings: a dictionary where the keys are words and Output: countries: a dictionary with the most likely country and its similarity score """ ### START CODE HERE ### # store the city1, country 1, and city 2 in a set called group group = set((city1, country1, city2)) # get embeddings of city 1 city1_emb = embeddings[city1] # get embedding of country 1 country1_emb = embeddings[country1] # get embedding of city 2 city2_emb = embeddings[city2] # get embedding of country 2 (it's a combination of the embeddings of country 1, city 1 and city 2) # Remember: King - Man + Woman = None vec = country1_emb-city1_emb+city2_emb # Initialize the similarity to -1 (it will be replaced by a similarities that are closer to +1) similarity = -1 # initialize country to an empty string country = '' # loop through all words in the embeddings dictionary for word in embeddings.keys(): # first check that the word is not already in the 'group' if word not in group: # get the word embedding word_emb = embeddings[word] # calculate cosine similarity between embedding of country 2 and the word in the embeddings dictionary cur_similarity = cosine_similarity(vec,word_emb) # if the cosine similarity is more similar than the previously best similarity... if cur_similarity > similarity: # update the similarity to the new, better similarity similarity = cur_similarity # store the country as a tuple, which contains the word and the similarity country = (word,similarity) ### END CODE HERE ### return country
测试:
# Testing your function, note to make it more robust you can return the 5 most similar words.
get_country('Athens', 'Greece', 'Cairo', word_embeddings)
结果:
(‘Egypt’, 0.7626821)
正确率计算公式为:
Accuracy
=
Correct # of predictions
Total # of predictions
\text{Accuracy}=\frac{\text{Correct \# of predictions}}{\text{Total \# of predictions}}
Accuracy=Total # of predictionsCorrect # of predictions
借助上面的get_country
函数,遍历每一个单词,计算出准确率。
# UNQ_C4 GRADED FUNCTION: get_accuracy def get_accuracy(word_embeddings, data, get_country=get_country): ''' Input: word_embeddings: a dictionary where the key is a word and the value is its embedding data: a pandas data frame as ''' ### START CODE HERE ### # initialize num correct to zero num_correct = 0 # loop through the rows of the dataframe for i, row in data.iterrows(): # get city1 city1 = row['city1'] # get country1 country1 = row['country1'] # get city2 city2 = row['city2'] # get country2 country2 = row['country2'] # use get_country to find the predicted country2 predicted_country2, _ = get_country(city1,country1,city2,word_embeddings) # if the predicted country2 is the same as the actual country2... if predicted_country2 == country2: # increment the number of correct by 1 num_correct += 1 # get the number of rows in the data dataframe (length of dataframe) m = len(data) # calculate the accuracy by dividing the number correct by m accuracy = num_correct /m ### END CODE HERE ### return accuracy
测试:
accuracy = get_accuracy(word_embeddings, data)
print(f"Accuracy is {accuracy:.2f}")
结果:
Accuracy is 0.92
接下来使用主成分分析法principal component analysis (PCA)来探索降低词向量维度后它们之间的距离。现在的词向量是300维的,难以使用可视化的方式显示这些词向量,因此我们使用PCA将向量投射到一个维度更小的空间中,并尽量保持原始信息不丢失。可视化后相似的单词会相互聚集在一起。例如,“悲伤”、"快乐 "和 "喜悦 “都是描述情绪的词语,在绘制时应该相互靠近。这些词 石油”、"天然气 "和 "石油 “都是描述自然资源的词语。城市”、“村庄”、"城镇 "等词可视为同义词,描述的是类似的事物。
大概步骤如下:
# UNQ_C5 GRADED FUNCTION: compute_pca def compute_pca(X, n_components=2): """ Input: X: of dimension (m,n) where each row corresponds to a word vector n_components: Number of components you want to keep. Output: X_reduced: data transformed in 2 dims/columns + regenerated original data pass in: data as 2D NumPy array """ ### START CODE HERE ### # mean center the data X_demeaned = X - np.mean(X,axis=0) # calculate the covariance matrix covariance_matrix = np.cov(X_demeaned, rowvar=False) # calculate eigenvectors & eigenvalues of the covariance matrix eigen_vals, eigen_vecs = np.linalg.eigh(covariance_matrix, UPLO='L') # sort eigenvalue in increasing order (get the indices from the sort) idx_sorted = np.argsort(eigen_vals) # reverse the order so that it's from highest to lowest. idx_sorted_decreasing = idx_sorted[::-1] # sort the eigen values by idx_sorted_decreasing eigen_vals_sorted = eigen_vals[idx_sorted_decreasing] # sort eigenvectors using the idx_sorted_decreasing indices eigen_vecs_sorted = eigen_vecs[:,idx_sorted_decreasing] # select the first n eigenvectors (n is desired dimension # of rescaled data array, or dims_rescaled_data) eigen_vecs_subset = eigen_vecs_sorted[:,0:n_components] # transform the data by multiplying the transpose of the eigenvectors with the transpose of the de-meaned data # Then take the transpose of that product. X_reduced = np.dot(eigen_vecs_subset.transpose(),X_demeaned.transpose()).transpose() ### END CODE HERE ### return X_reduced
测试:
# Testing your function
np.random.seed(1)
X = np.random.rand(3, 10)
X_reduced = compute_pca(X, n_components=2)
print("Your original matrix was " + str(X.shape) + " and it became:")
print(X_reduced)
结果:
Your original matrix was (3, 10) and it became:
[[ 0.43437323 0.49820384]
[ 0.42077249 -0.50351448]
[-0.85514571 0.00531064]]
接下来挑选11个单词,然后观察他们PCA降维后的结果:
words = ['oil', 'gas', 'happy', 'sad', 'city', 'town',
'village', 'country', 'continent', 'petroleum', 'joyful']
# given a list of words and the embeddings, it returns a matrix with all the embeddings
X = get_vectors(word_embeddings, words)
print('You have 11 words each of 300 dimensions thus X.shape is:', X.shape)
结果:
You have 11 words each of 300 dimensions thus X.shape is: (11, 300)
降维:
# We have done the plotting for you. Just run this cell.
result = compute_pca(X, 2)
plt.scatter(result[:, 0], result[:, 1])
for i, word in enumerate(words):
plt.annotate(word, xy=(result[i, 0] - 0.05, result[i, 1] + 0.1))
plt.show()
结果:
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。