C1W3.Assignment: Hello Vectors

理论课:C1W3.Vector Space Models

理论课: C1W3.Vector Space Models


  • 预测单词之间的类比,例如:男人vs女人,相当于:国王vs??
  • 使用 PCA 降低词嵌入的维度,并将其绘制成二维图。
  • 使用相似度量(余弦相似度)比较词嵌入。
  • 了解这些向量空间模型的工作原理。

Predict the Countries from Capitals任务说明:要求从一个国家中找出该国的首都来说明单词类比,编写一个程序,让它能根据国家,给出其对应首都。

Importing the data

导入包,没装的pip install 安装

# Run this cell to import packages.
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import w3_unittest

from utils import get_vectors
数据集Pandas DataFrame 的形式加载。阿如果数据量较大,这可能需要几分钟时间、这可能需要几分钟时间。

data = pd.read_csv('./data/capitals.txt', delimiter=' ')
data.columns = ['city1', 'country1', 'city2', 'country2']

# print first five elements in the DataFrame
由于原始的谷歌新闻单词嵌入数据集约为 3.64 G,有条件的同学可以自行下载该数据集,提取出将在本作业中分析的单词样本,并将其保存在名为word_embeddings_subset.p的 pickle 文件中。

  • 这里下载已经训练好的谷歌新闻单词嵌入数据集。
  • 在页面中搜索 "GoogleNews-vectors-negative300.bin.gz "并点击链接下载。
  • 需要解压该文件。
import nltk
from gensim.models import KeyedVectors

embeddings = KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary = True)
f = open('capitals.txt', 'r').read()
set_words = set(nltk.word_tokenize(f))
select_words = words = ['king', 'queen', 'oil', 'gas', 'happy', 'sad', 'city', 'town', 'village', 'country', 'continent', 'petroleum', 'joyful']
for w in select_words:

def get_word_embeddings(embeddings):

    word_embeddings = {}
    for word in embeddings.vocab:
        if word in set_words:
            word_embeddings[word] = embeddings[word]
    return word_embeddings

# Testing your function
word_embeddings = get_word_embeddings(embeddings)
pickle.dump( word_embeddings, open( "word_embeddings_subset.p", "wb" ) )
也可以从绑定资源中直接下载word_embeddings_subset.p,并保存到data目录下,然后加载为 dictionary

word_embeddings = pickle.load(open("./data/word_embeddings_subset.p", "rb"))
len(word_embeddings)  # there should be 243 words that will be used in this assignment
print("dimension: {}".format(word_embeddings['Spain'].shape[0]))
  • 1

dimension: 300

Predict relationships among words


  • 函数将吃三个单词。
  • 前两个词相互关联。
  • 它将预测第四个单词,该单词与第三个单词的关系与前两个单词的关系类似。
  • 例如,“雅典之于希腊,就像曼谷之于 ______”?
  • 编写一个能够找到第四个单词的程序。

Cosine Similarity

cos ⁡ ( θ ) = A ⋅ B ∥ A ∥ ∥ B ∥ = ∑ i = 1 n A i B i ∑ i = 1 n A i 2 ∑ i = 1 n B i 2 (1) \cos (\theta)=\frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\|\|\mathbf{B}\|}=\frac{\sum_{i=1}^{n} A_{i} B_{i}}{\sqrt{\sum_{i=1}^{n} A_{i}^{2}} \sqrt{\sum_{i=1}^{n} B_{i}^{2}}}\tag{1} cos(θ)=A∥∥BAB=i=1nAi2 i=1nBi2 i=1nAiBi(1)
A A A B B B 代表词向量, A i A_i Ai B i B_i Bi 代表该向量的索引 i。

  • 如果 A A A B B B 完全相同,则 c o s ( θ ) = 1 cos(\theta)=1 cos(θ)=1
  • 否则,如果它们完全相反,即 A = − B A=-B A=B,那么你将得到 c o s ( θ ) = − 1 cos(\theta) =-1 cos(θ)=1
  • 如果得到 c o s ( θ ) = 0 cos(\theta)=0 cos(θ)=0,则表示它们正交(或垂直)。
  • 0 和 1 之间的数字表示相似度得分。
  • 介于-1和0之间的数字表示非相似度得分。
# UNQ_C1 GRADED FUNCTION: cosine_similarity

def cosine_similarity(A, B):
        A: a numpy array which corresponds to a word vector
        B: A numpy array which corresponds to a word vector
        cos: numerical number representing the cosine similarity between A and B.

    ### START CODE HERE ###
    dot = np.dot(A,B)
    norma = np.sqrt(np.dot(A,A))
    normb = np.sqrt(np.dot(B,B))
    cos = dot/(norma*normb)

    ### END CODE HERE ###
    return cos
# feel free to try different words
king = word_embeddings['king']
queen = word_embeddings['queen']

cosine_similarity(king, queen)
Euclidean distance

d ( A , B ) = d ( B , A ) = ( A 1 − B 1 ) 2 + ( A 2 − B 2 ) 2 + ⋯ + ( A n − B n ) 2 = ∑ i = 1 n ( A i − B i ) 2 d(A,B)=d(B,A)=(A1B1)2+(A2B2)2++(AnBn)2=ni=1(AiBi)2 d(A,B)=d(B,A)=(A1B1)2+(A2B2)2++(AnBn)2 =i=1n(AiBi)2

  • n n n 是向量中元素的个数
  • A A A B B B 是相应的单词向量。
  • 词语越相似,欧氏距离越有可能接近 0。

def euclidean(A, B):
        A: a numpy array which corresponds to a word vector
        B: A numpy array which corresponds to a word vector
        d: numerical number representing the Euclidean distance between A and B.

    ### START CODE HERE ###

    # euclidean distance    
    d = np.sqrt(np.sum((A-B)**2))

    ### END CODE HERE ###

    return d
# Test your function
euclidean(king, queen)
Finding the country of each capital

1: Athens 2: Greece 3: Baghdad,
预测结果应该为:4: Iraq
1.您可需要参考上述国王 - 男人 + 女人 = 皇后的示例,并使用单词嵌入和相似度函数将该方案转化为数学函数。

# UNQ_C3 GRADED FUNCTION: get_country

def get_country(city1, country1, city2, embeddings, cosine_similarity=cosine_similarity):
        city1: a string (the capital city of country1)
        country1: a string (the country of capital1)
        city2: a string (the capital city of country2)
        embeddings: a dictionary where the keys are words and
        countries: a dictionary with the most likely country and its similarity score
    ### START CODE HERE ###

    # store the city1, country 1, and city 2 in a set called group
    group = set((city1, country1, city2))

    # get embeddings of city 1
    city1_emb = embeddings[city1]

    # get embedding of country 1
    country1_emb = embeddings[country1]

    # get embedding of city 2
    city2_emb = embeddings[city2]

    # get embedding of country 2 (it's a combination of the embeddings of country 1, city 1 and city 2)
    # Remember: King - Man + Woman = None
    vec = country1_emb-city1_emb+city2_emb

    # Initialize the similarity to -1 (it will be replaced by a similarities that are closer to +1)
    similarity = -1

    # initialize country to an empty string
    country = ''

    # loop through all words in the embeddings dictionary
    for word in embeddings.keys():

        # first check that the word is not already in the 'group'
        if word not in group:

            # get the word embedding
            word_emb = embeddings[word]

            # calculate cosine similarity between embedding of country 2 and the word in the embeddings dictionary
            cur_similarity = cosine_similarity(vec,word_emb)

            # if the cosine similarity is more similar than the previously best similarity...
            if cur_similarity > similarity:

                # update the similarity to the new, better similarity
                similarity = cur_similarity

                # store the country as a tuple, which contains the word and the similarity
                country = (word,similarity)

    ### END CODE HERE ###

    return country
# Testing your function, note to make it more robust you can return the 5 most similar words.
get_country('Athens', 'Greece', 'Cairo', word_embeddings)
(‘Egypt’, 0.7626821)

Model Accuracy

Accuracy = Correct # of predictions Total # of predictions \text{Accuracy}=\frac{\text{Correct \# of predictions}}{\text{Total \# of predictions}} Accuracy=Total # of predictionsCorrect # of predictions

# UNQ_C4 GRADED FUNCTION: get_accuracy

def get_accuracy(word_embeddings, data, get_country=get_country):
        word_embeddings: a dictionary where the key is a word and the value is its embedding
        data: a pandas data frame as


    ### START CODE HERE ###
    # initialize num correct to zero
    num_correct = 0

    # loop through the rows of the dataframe
    for i, row in data.iterrows():

        # get city1
        city1 = row['city1']

        # get country1
        country1 = row['country1']

        # get city2
        city2 = row['city2']
        # get country2
        country2 = row['country2']

        # use get_country to find the predicted country2
        predicted_country2, _ = get_country(city1,country1,city2,word_embeddings)

        # if the predicted country2 is the same as the actual country2...
        if predicted_country2 == country2:
            # increment the number of correct by 1
            num_correct += 1

    # get the number of rows in the data dataframe (length of dataframe)
    m = len(data)

    # calculate the accuracy by dividing the number correct by m
    accuracy = num_correct /m

    ### END CODE HERE ###
    return accuracy

accuracy = get_accuracy(word_embeddings, data)
print(f"Accuracy is {accuracy:.2f}")
Accuracy is 0.92

Plotting the vectors using PCA

接下来使用主成分分析法principal component analysis (PCA)来探索降低词向量维度后它们之间的距离。现在的词向量是300维的,难以使用可视化的方式显示这些词向量,因此我们使用PCA将向量投射到一个维度更小的空间中,并尽量保持原始信息不丢失。可视化后相似的单词会相互聚集在一起。例如,“悲伤”、"快乐 "和 "喜悦 “都是描述情绪的词语,在绘制时应该相互靠近。这些词 石油”、"天然气 "和 "石油 “都是描述自然资源的词语。城市”、“村庄”、"城镇 "等词可视为同义词,描述的是类似的事物。

  1. 对数据进行均值归一化处理
  2. 计算数据的协方差矩阵( Σ \Sigma Σ)。
  3. 计算协方差矩阵的特征向量和特征值
  4. 将前 K 个特征向量与归一化数据相乘。结果如下:


# UNQ_C5 GRADED FUNCTION: compute_pca

def compute_pca(X, n_components=2):
        X: of dimension (m,n) where each row corresponds to a word vector
        n_components: Number of components you want to keep.
        X_reduced: data transformed in 2 dims/columns + regenerated original data
    pass in: data as 2D NumPy array

    ### START CODE HERE ###
    # mean center the data
    X_demeaned = X - np.mean(X,axis=0)

    # calculate the covariance matrix
    covariance_matrix = np.cov(X_demeaned, rowvar=False)

    # calculate eigenvectors & eigenvalues of the covariance matrix
    eigen_vals, eigen_vecs = np.linalg.eigh(covariance_matrix, UPLO='L')

    # sort eigenvalue in increasing order (get the indices from the sort)
    idx_sorted = np.argsort(eigen_vals)
    # reverse the order so that it's from highest to lowest.
    idx_sorted_decreasing = idx_sorted[::-1]

    # sort the eigen values by idx_sorted_decreasing
    eigen_vals_sorted = eigen_vals[idx_sorted_decreasing]

    # sort eigenvectors using the idx_sorted_decreasing indices
    eigen_vecs_sorted = eigen_vecs[:,idx_sorted_decreasing]

    # select the first n eigenvectors (n is desired dimension
    # of rescaled data array, or dims_rescaled_data)
    eigen_vecs_subset = eigen_vecs_sorted[:,0:n_components]

    # transform the data by multiplying the transpose of the eigenvectors with the transpose of the de-meaned data
    # Then take the transpose of that product.
    X_reduced = np.dot(eigen_vecs_subset.transpose(),X_demeaned.transpose()).transpose()

    ### END CODE HERE ###

    return X_reduced
# Testing your function
X = np.random.rand(3, 10)
X_reduced = compute_pca(X, n_components=2)
print("Your original matrix was " + str(X.shape) + " and it became:")
Your original matrix was (3, 10) and it became:
[[ 0.43437323 0.49820384]
[ 0.42077249 -0.50351448]
[-0.85514571 0.00531064]]


words = ['oil', 'gas', 'happy', 'sad', 'city', 'town',
         'village', 'country', 'continent', 'petroleum', 'joyful']

# given a list of words and the embeddings, it returns a matrix with all the embeddings
X = get_vectors(word_embeddings, words)

print('You have 11 words each of 300 dimensions thus X.shape is:', X.shape)
You have 11 words each of 300 dimensions thus X.shape is: (11, 300)

# We have done the plotting for you. Just run this cell.
result = compute_pca(X, 2)
plt.scatter(result[:, 0], result[:, 1])
for i, word in enumerate(words):
    plt.annotate(word, xy=(result[i, 0] - 0.05, result[i, 1] + 0.1))

  • 7


