赞
踩
学习github上的nlp课程https://github.com/yandexdataschool/nlp_course,以下是其中第一课embedding的实验部分seminar.iqynb的实现代码。https://github.com/yandexdataschool/nlp_course/blob/master/week01_embeddings/seminar.ipynb
看完上面那个实验教程基本就懂如何使用Word2Vec了。
说明:
1、每完成一个任务,我就会将相关部分注释,比如我训练完word2Vec并保存模型后,就会将训练模型的代码注释掉,以后一直load本地保存的模型以继续完成后面的任务。
2、不一定要完全遵守教程中的assert判断语句,教程里的参数检查有时太严格,可以忽略或者注释掉,不影响实验效果。
3、建议训练模型时放linux服务器上跑,我在笔记本上跑几个小时都不出结果,而且提示没有"C compiler,training slow"之类的,查遍了博客和stackoverflow都找不到解决方法。于是放服务器上跑,10秒内即可训练完毕。建议在服务器上训练完模型后保存成文件,将文件下载到电脑上即可用以完成接下来的其他任务,避免每次都训练很长时间。
4、文末放有相关知识点和参考链接。
- import re
- import numpy as np
- import multiprocessing
- from nltk.tokenize import WordPunctTokenizer
- import gensim.models.word2vec as w2v
- import gensim.downloader as api
- from sklearn.decomposition import PCA
- from sklearn.preprocessing import StandardScaler
- from sklearn.manifold import TSNE
-
- import bokeh.models as bm, bokeh.plotting as pl
- from bokeh.io import output_notebook, output_file
-
-
- def check_data_tok(data_tok):
- #下面几个assert都是检测data_tok格式是否正确
- #row如果是list或tuple类型,断言assert就为真,如果为假就报错输出后面的字符串
- assert all(isinstance(row, (list, tuple)) for row in data_tok), "please convert each line into a list of tokens (strings)"
- assert all(all(isinstance(tok, str) for tok in row) for row in data_tok), "please convert each line into a list of tokens (strings)"
- is_latin = lambda tok: all('a' <= x.lower() <= 'z' for x in tok)
- assert all(map(lambda l: not is_latin(l) or l.islower(), map(' '.join, data_tok))), "please make sure to lowercase the data"
-
- print([' '.join(row) for row in data_tok[:2]])
- #data_tok[:2]就是data_tok的前两个句子的list, 用空格把list的元素连接起来输出。
-
-
- def Data_Process():
- raw = list(open("lecture1_Embedding/quora.txt", 'rb')) # 一定要用b格式打开,不然报错
- # 简单清洗语料,去除a - z,A - Z之外的字符
- sentences=[]
- for line in raw:
- sentences.append(re.sub("[^a-zA-Z]"," ",line.decode('utf-8')))
-
- print(sentences[0]) # 第0个句子
-
- # a typical first step for an nlp task is to split raw data into words
- tokenizer = WordPunctTokenizer()
- print(tokenizer.tokenize(sentences[1].lower())) # 将第0个句子划分成word
-
- data_tok = [] # data_tok should be a list of lists of tokens for each line in data. + lowercase
-
- cnt = 0
- for sentence in sentences:
- word_list_of_a_sentence = tokenizer.tokenize(sentence.lower())
- data_tok.append(word_list_of_a_sentence)
- cnt += len(word_list_of_a_sentence)
- print("cnt=", cnt)
-
- #check_data_tok(data_tok) # 看格式是否正确
- return data_tok
-
- def draw_vectors(x,y,radius=10,alpha=0.25,color='blue',width=600,height=400,show=True,**kwargs):
- if isinstance(color, str): color = [color] * len(x)
- data_source = bm.ColumnDataSource({'x': x, 'y': y, 'color': color, **kwargs})
-
- fig = pl.figure(active_scroll='wheel_zoom', width=width, height=height)
- fig.scatter('x', 'y', size=radius, color='color', alpha=alpha, source=data_source)
-
- fig.add_tools(bm.HoverTool(tooltips=[(key, "@" + key) for key in kwargs.keys()]))
- output_file("patch.html") #notebook显示屏幕上失败,只能在网页上显示图像了
- pl.show(fig)
-
- def Use_pre_train():
- model=api.load('glove-twitter-100')
- # 检验效果
- print(model.most_similar(positive=["coder", "money"], negative=["brain"]))
- print(model.most_similar(['girl', 'father'], ['boy'], topn=3))
- print(model.most_similar(positive=['woman', 'king'], negative=['man']))
- return model
-
- def Train_by_myself(data_tok):
- # model = w2v.Word2Vec(
- # sg=1,
- # seed=1,
- # size=32,
- # min_count=5,
- # window=5,
- # workers=multiprocessing.cpu_count(),
- # sample=1e-3
- # )
- # model.build_vocab(data_tok) # 词树就是基于词在语料中出现的词频构建的哈夫曼树,这样那些经常出现的词汇会在训练时更快被检索到,节省训练的搜索时间.
- # print("model vocabulary length:", len(model.wv.vocab))
- # model.train(data_tok, total_examples=model.corpus_count, epochs=2)
- # model.save("model_quora.w2v") #保存训练好的模型
-
- model = w2v.Word2Vec.load("lecture1_Embedding/model_quora.w2v") # 加载训练好的模型
-
- #检验效果
- print(model.wv.get_vector('anything'))
- if 'abcdefg' in model:
- print("in model")
- else:
- print("not in model!!")
- print(model.most_similar('father'))
- print(model.most_similar('bread'))
-
- return model
-
- def Visualize_Word(model):
- # plot 1000 most frequent words
- words = sorted(model.wv.vocab.keys(),
- key=lambda word: model.wv.vocab[word].count,
- reverse=True # reverse = True 降序 , reverse = False 升序
- )[:1000]
- print('词频最高的100个words', words[:100]) # 遍历整个words,每隔100个取一个
-
- word_vectors_1000 = []
- for word in words:
- word_vectors_1000.append(model.wv.get_vector(word))
-
- # assert isinstance(word_vectors_1000, np.ndarray)
- # assert word_vectors_1000.shape == (len(words), 100)
- # assert np.isfinite(word_vectors_1000).all()
-
- # 用PCA降到2维
- pca = PCA(n_components=2)
- word_vectors_pca_1000 = pca.fit_transform(word_vectors_1000)
- print(word_vectors_pca_1000[:10])
-
- # 用TSNE降维
- tsne = TSNE()
- word_tsne = tsne.fit_transform(word_vectors_1000)
-
- # 完成标准化
- std = StandardScaler()
- word_vectors_pca_1000 = std.fit_transform(word_vectors_pca_1000)
- word_tsne = std.fit_transform(word_tsne)
-
- assert word_vectors_pca_1000.shape == (len(word_vectors_pca_1000), 2), "there must be a 2d vector for each word"
- assert max(abs(word_vectors_pca_1000.mean(0))) < 1e-5, "points must be zero-centered"
- assert max(abs(1.0 - word_vectors_pca_1000.std(0))) < 1e-2, "points must have unit variance"
-
- # draw_vectors(word_vectors_pca_1000[:,0],word_vectors_pca_1000[:,1],token=words)
- draw_vectors(word_tsne[:, 0], word_tsne[:, 1], token=words)
-
- def get_phrase_embedding(model,phrase):#Convert phrase to a vector by aggregating it's word embeddings. See description above.
- vector = np.zeros([model.vector_size], dtype='float32')
- phrase=re.sub("[^a-zA-Z]"," ",phrase)
- tokenizer = WordPunctTokenizer()
- word_list_of_phrase = tokenizer.tokenize(phrase.lower())
- word_cnt=0
- for word in word_list_of_phrase:
- if word in model:
- word_cnt+=1
- vector+=model.wv.get_vector(word)
- if word_cnt!=0:
- vector/=word_cnt
-
- return vector
-
- def Get_Phrases(number):
- raw = list(open("lecture1_Embedding/quora.txt", 'rb')) # 一定要用b格式打开,不然报错
- # 简单清洗语料,去除a - z,A - Z之外的字符
- phrases = []
- phrase_cnt=0
- for line in raw:
- phrases.append(re.sub("[^a-zA-Z]", " ", line.decode('utf-8')))
- phrase_cnt+=1
- if phrase_cnt>=number:
- break
-
- return phrases
-
- def cosine_similarity(vector1, vector2):
- dot_product = 0.0
- normA = 0.0
- normB = 0.0
- for a, b in zip(vector1, vector2):
- dot_product += a * b
- normA += a ** 2
- normB += b ** 2
- if normA == 0.0 or normB == 0.0:
- return 0
- else:
- return dot_product / ((normA * normB) ** 0.5)
-
-
-
-
- def Visualize_Phrases(model):
- vector_test = get_phrase_embedding(model,"I'm very sure. This never happened to me before...") #测试
-
- chosen_phrases=Get_Phrases(1000)
- phrase_vectors=[]
- for phrase in chosen_phrases:
- phrase_vectors.append(get_phrase_embedding(model,phrase))
-
- tsne = TSNE()# 用TSNE降维
- phrase_tsne = tsne.fit_transform(phrase_vectors)
- std = StandardScaler()# 完成标准化
- phrase_tsne = std.fit_transform(phrase_tsne)
-
- draw_vectors(phrase_tsne[:, 0], phrase_tsne[:, 1],
- phrase=[phrase[:50] for phrase in chosen_phrases],
- radius=20, )
-
- def find_nearest(model,phrases_all,phrases_all_vectors,query, k=10):
- query_vector=get_phrase_embedding(model,query)
- cosine_dis=[]
- for phrase_vector in phrases_all_vectors:
- cosine_dis.append(-1*cosine_similarity(phrase_vector,query_vector))
- sorted_index=np.argsort(cosine_dis) #加负号降序排序,不然就是升序。 cosine值越接近1说明越相似
- topk_phrases = []
- for i in range(k):
- topk_phrases.append(phrases_all[sorted_index[i]])
-
- return topk_phrases
-
- def SimilarQuestion(model):
- phrases_all = Get_Phrases(100000)
- phrases_all_vectors = np.array([get_phrase_embedding(model, phrase) for phrase in phrases_all])
- print(find_nearest(model, phrases_all,phrases_all_vectors, query="How do i enter the matrix?", k=10))
- print(find_nearest(model, phrases_all,phrases_all_vectors,query="How does Trump?", k=10))
- print(find_nearest(model, phrases_all,phrases_all_vectors,query="Why don't i ask a question myself?", k=10))
-
-
- if __name__ == "__main__":
- data_tok=[]
- #data_tok=Data_Process()
- #model=Use_pre_train()
- model=Train_by_myself(data_tok)
-
- #Visualize_Word(model)
- #Visualize_Phrases(model)
-
- SimilarQuestion(model)
-
-
-
-
![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png)
相关知识点和参考链接:
实验课程地址
https://github.com/yandexdataschool/nlp_course/blob/master/week01_embeddings/seminar.ipynb
Word2Vec内容概述:
https://blog.csdn.net/weixin_41519463/article/details/89716312
词向量方法分析《权利的游戏》
Genism Word2Vec 的使用方法
https://www.jianshu.com/p/b996e7e0d0b0
https://blog.csdn.net/sinat_26917383/article/details/69803018
https://www.jianshu.com/p/0702495e21de
sklearn PCA:
https://www.jianshu.com/p/8642d5ea5389
sorted lambda
https://www.cnblogs.com/zle1992/p/6271105.html
0均值归一化(或者称为标准化)将输入的原始数据集归一化为均值为0,方差为1的数据集。
使用scikit-learn对特征进行归一化和标准化:
https://blog.csdn.net/sinat_29957455/article/details/79490165
**kwargs:(表示的就是形参中按照关键字传值把多余的传值以字典的方式呈现)
def foo(x,**kwargs):
print(x)
print(kwargs)
foo(1,y=1,a=2,b=3,c=4)#将y=1,a=2,b=3,c=4以字典的方式给了kwargs
bokeh画图
https://blog.csdn.net/tankloverainbow/article/details/80442289
python切片
https://www.jianshu.com/p/15715d6f4dad
tsne降维
https://www.cnblogs.com/cygalaxy/p/6934197.html
通过Python gensim训练word2vec过程中,假如得到的模型变量是word2vecModel,判断某个词语是否在训练word2vecModel的词典中,利用下面的句子: if word in word2vecModel即可
numpy的allclose方法,比较两个array是不是每一元素都相等,默认在1e-05的误差范围内
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。