赞
踩
词向量是自然语言处理中重要的基础,有利于我们对文本、情感、词义等等方向进行分析,主要是将词转化为稠密向量,从而使得相似的词,其词向量也相近。
一、词向量的表示
词向量的表示通常有两种方式,一种是离散的,另一种是分布式的;其离散方式通常称为one-hot representation,其缺点是不能显示词与词之间的关系,但优点是在高维空间中,很多任务线性可分。
其分布式的方式通常称为 distribution representation,是将词转化为一种分布式的、连续的、定长的稠密向量,其优点是可以表示词与词之间的距离关系,每一维度都有其特定的含义;
两者的区别是用one-hot特征时,可以对特征向量进行删减,而分布式的则不可以。
二、词向量的训练
2.1 基于统计的方法
2.1.1 共现矩阵
统计一个窗口内word共现次数,以word周边的共现词的次数做为当前word的vector。该矩阵一定程度上缓解了one-hot向量相似度为0问题,但并没有解决数据的稀疏性和高维性问题。
2.1.2 奇异值分解
针对共现矩阵存在的问题,提出了对原始词向量进行降维,从而得到一个稠密的连续词向量。利用SVD的方法,最终可以得到一个正交矩阵,进行归一化后即为词向量。
该方法的有点是可以一定程度上反映语义相近的词,以及word间的线性关系;但由于很多词没有出线,导致矩阵及其稀疏,需要对词频做额外处理才能达到好的结果,并且其矩阵也是非常大,维度高。
基于共现矩阵的词向量代码如下:
# 基于词与词构造共现矩阵,提取词向量 import collections file_path = "D:\workspace\project\\NLPcase\\word2vec\\data\\data.txt" model_path = "D:\workspace\project\\NLPcase\\word2vec\\model\\skipgram_word2vec.txt" min_count = 5 #最低词频 word_demension = 200 window_size = 5 # 窗口大小 def load_data(file_path = file_path): dataset = [] for line in open(file_path,encoding='utf-8'): line = line.strip().split(',') dataset.append([word for word in line[1].split(' ') if 'nbsp' not in word and len(word)<1]) return dataset dataset = load_data() # 统计总词数 def build_wrod_dict(): words = [] for data in dataset: words.extend(data) reserved_words = [item for item in collections.Counter(words).most_common() if item[1]>min_count] word_dict = {item[0]:item[1] for item in reserved_words} return word_dict # 构造上下文窗口 def build_word2word_dict(): word2word_dict = {} for data_idx, data in enumerate(dataset): contexts = [] for index in range(len(data)): if index < window_size: left = data[:index] else: left = data[index-window_size:index] if index + window_size > len(data): right = data[index + 1:] else: right = data[index + 1: index + window_size + 1] context = left + [data[index]] + right# 得到了一句话中的上下文的窗口 for word in context: if word not in word2word_dict: word2word_dict[word] = {} else: for co_word in context: if co_word !=word: word2word_dict[word][co_word] =1 else: word2word_dict[word][co_word] += 1 return word2word_dict # 构造共现矩阵 def build_word2word_matrix(): word2word_dict = build_word2word_dict() word_dict =build_wrod_dict() word_list = list(word_dict)# 这个只会构造出一个word的key word2word_matrix = [] count = 0 for word1 in word_list: count +=1 temp = [] sumtf = sum(word2word_dict[word1].values()) for word2 in word_list: weight = word2word_dict[word2].get(word2, 0) / sumtf temp.append(weight) word2word_matrix.append(temp) return word2word_matrix
2.2 基于语言模型
语言模型生成词向量是通过训练神经网络模型附带产出的,一般是采用三层神经网络结构,分别为输入层、隐藏层以及输出层。常见的就是word2vect方法,该方法主要有两种方式,CBOW和skip-gram;
Word2vect的改进方法有两种,一种是基于Hierarchical softmax,另一种是基于负采样。
word2vect最先优化使用的结构是霍夫曼树,来代替隐藏层和输出层的神经元,但其问题就在隐藏层和输出层的softmax计算量很大(因为要计算所有词的softmax概率,再去找最大概率),因此霍夫曼树可以解决这个问题。霍夫曼树的叶子节点起到输出神经元的作用。一般霍夫曼树后会对叶子节点进行编码,由于权重高的叶子节点靠近根节点,而权重低的叶子节点远离根节点,这样权重高的节点编码段短,权重低的编码较长,符合信息论,也就是越是常用的词拥有更短的编码。霍夫曼树当中定义左节点还是右节点里面有个主意的sigmoid函数,因此最后变成了求解Hierarchical Softmax的参数的问题,求解梯度并进行计算。
基于负采样求解word2vect模型的方法摒弃了霍夫曼树,因为霍夫曼树针对样本中心词是一个生僻词时,就得在霍夫曼树中路径寻找很久。比如训练一个样本,中心词是w,他的周围上下文共有2c个词,则记为context(w)。由于这个中心词w和context(w)相关,则它是一个真实的正例;现在通过负采样技术,得到neg个和w不同的中心词wi,i=1,2,…,neg,则context(w)和这个wi组成一个负例子;利用这个正例和neg负例,我们进行二元逻辑回归,得到负采样对应每个词wi对应的模型参数theta,以及每个词的词向量。
简单的对负采样进行总结:
还是假设词库有10000个词,词向量300维,那么每一层神经网络的参数是300万个,输出层相当于有一万个可能类的多分类问题。可以想象,这样的计算量非常非常非常大。采样的思想非常简单,简单地令人发指:我们知道最终神经网络经过softmax输出一个向量,只有一个概率最大的对应正确的单词,其余的称为negative sample。现在只选择5个negative sample,所以输出向量就只是一个6维的向量。要考虑的参数不是300万个,而减少到了1800个! 这样做看上去很偷懒,实际效果却很好,大大提升了运算效率。
2.2.1 CBOW(连续词袋模型)
该模型是预测上下文已知的情况下,当前词出现的概率。上下文的选取采用窗口方式。本文基于负采样的TensorFlow下训练cbow的词向量代码如下:
# 连续词袋模型,根据上下文预测当前单词 import math import numpy as np import tensorflow as tf import collections file_path = "D:\workspace\project\\NLPcase\\word2vec\\data\\data.txt" model_path = "D:\workspace\project\\NLPcase\\word2vec\\model\\skipgram_word2vec.txt" min_count = 5 #最低词频 batch_size = 200 # 每次迭代的数量 embedding_size = 200 # 生成词向量的维度 window_size = 5 # 窗口大小 num_sampled = 100 # 负采样的样本 num_steps = 10000# 最大的迭代次数 def load_data(file_path = file_path): dataset = [] for line in open(file_path,encoding='utf-8'): line = line.strip().split(',') dataset.append([word for word in line[1].split(' ') if 'nbsp' not in word and len(word)<1]) return dataset dataset = load_data() # 获得所有的单词组 def read_data(dataset): words = [] for data in dataset: words.extend(data) return words # 创建数据集合 def build_dataset(words,min_count): count = [['unk',-1]] reserved_words = [item for item in collections.Counter(words).most_common() if item[1]>min_count] count.extend(reserved_words) dictionary = dict() for word,_ in count: dictionary[word] = len(dictionary) data = list() unk_count = 0 for word in words: if word in dictionary: index = dictionary[word] else: index = 0 unk_count += 1 data.append(index) count[0][1] = unk_count reverse_dictionary = dict(zip(dictionary.values(),dictionary.keys())) return data,count,dictionary,reverse_dictionary # 生成训练的样本 data_index = 0 def generate_batch(batch_size, window_size,data): # data的格式为编号 span = 2*window_size+1 batch = np.ndarray(shape=(batch_size,span-1),dtype=np.int32) labels = np.ndarray(shape=(batch_size,1),dtype=np.int32) buffer = collections.deque(maxlen=span) for _ in range(span): buffer.append(data[data_index]) data_index = (data_index+1)/len(data)# data中每个元素的下标 for i in range(batch_size): target=window_size target2avoid = [window_size] col_idx = 0 for j in range(span): if j ==span//2: continue batch[i,col_idx] = buffer[j] col_idx += 1 labels[i,0] = buffer[target] buffer.append(data[data_index]) data_index = (data_index+1)/len(data) return batch,labels # 进行训练 def train_word2vec(vocabulary_size,batch_size,embedding_size,window_size,num_sampled,num_steps,data): graph = tf.Graph() with graph.as_default(),tf.device('/cpu:0'): train_dataset = tf.placeholder(tf.int32,shape=[batch_size,2*window_size]) train_labels = tf.placeholder(tf.int32,shape=[batch_size,1]) embedding = tf.Variable(tf.random_uniform([vocabulary_size,embedding_size],-1.0,1.0)) # 这儿与skip-gram不同的是,cbow的输入是上下文向量的均值 #embed = tf.reshape(embedding,window_size*2,batch_size//(window_size*2),embedding_size)这个方法也可以 context_embedding = [] for i in range(2 * window_size):#对每列进行相加,然后取平均值 context_embedding.append(tf.nn.embedding_lookup(embedding,train_dataset[:,i])) ave_embed = tf.reduce_mean(tf.stack(axis=0,values=context_embedding),0,keep_dims=False) softmax_weights = tf.Variable(tf.truncated_normal([vocabulary_size,embedding_size],stddev=1.0/math.sqrt(embedding_size))) softmax_biases = tf.Variable(tf.zeros([vocabulary_size])) # 定义损失函数 loss = tf.reduce_mean(tf.nn.sampled_softmax_loss( weights=softmax_weights, biases=softmax_biases, inputs=ave_embed, labels=train_labels, num_sampled = num_sampled, num_classes=vocabulary_size )) opt = tf.train.AdamOptimizer(1.0).minimize(loss) norm = tf.sqrt(tf.reduce_mean(tf.square(embedding),1,keep_dims=True)) normalized_embeddings = embedding/norm with tf.Session(graph) as session: tf.global_variables_initializer() average_loss = 0 for step in range(num_steps): batch_data,batch_labels = generate_batch(batch_size,window_size,data) feed_dict = {train_labels:batch_data,train_labels:batch_labels} _,l = session.run([opt,loss],feed_dict=feed_dict) average_loss += l if step % 200 ==0: if step>0: average_loss = average_loss/200 print('average loss at step',step,':',average_loss) average_loss = 0 final_embedding = normalized_embeddings.eval() return final_embedding
2.2.2 skip-gram(跳字模型)
原理和CBOW大致相同,只是输入是中心词,输出是周围词词向量。
基于负采样的TensorFlow训练skipgram的词向量代码如下:
# 利用skip-gram进行词向量的训练,是当前单词预测上下文 import collections import math import random import numpy as np import tensorflow as tf file_path = "D:\workspace\project\\NLPcase\\word2vec\\data\\data.txt" model_path = "D:\workspace\project\\NLPcase\\word2vec\\model\\skipgram_word2vec.txt" min_count = 5 #最低词频 batch_size = 200 # 每次迭代的数量 embedding_size = 200 # 生成词向量的维度 window_size = 5 # 窗口大小 num_sampled = 100 # 负采样的样本 num_steps = 10000# 最大的迭代次数 def load_data(file_path = file_path): dataset = [] for line in open(file_path,encoding='utf-8'): line = line.strip().split(',') dataset.append([word for word in line[1].split(' ') if 'nbsp' not in word and len(word)<1]) return dataset dataset = load_data() # 获得所有的单词组 def read_data(dataset): words = [] for data in dataset: words.extend(data) return words # 创建数据集合 def build_dataset(words,min_count): # 把那些低频的词过滤掉,并根据出现频次的大小进行相关的编号 count = [['UNK',-1]] # 对不统计或者没有出现的进行计数 count.extend([item for item in collections.Counter(words).most_common() if item[1]>min_count]) dictionary = dict() for word,_ in count: dictionary[word] = len(dictionary)# 进行编号 data = list() unk_count = 0 for word in words: if word in dictionary: index = dictionary[word] else: index = 0 unk_count += 1 data.append(index) count[0][1] = unk_count reverse_dictionary = dict(zip(dictionary.values(),dictionary.keys()))# 形成id:单词,的形式 return data,dictionary,reverse_dictionary # 生成训练样本 data_index = 0 def generate_bath(batch_size,window_size,data): # 其中data的格式为进行编号的id格式 # num_skips: 表示为每个单词生成多少个样本,本实验设置的是2个,其中batch_size必须是num_skips的整数倍 # window_size:一般2*window_size>=num_skips batch = np.ndarray(shape=(batch_size),dtype=np.int32)# 建立一个batch大小的一维数组,保存任意单词 # 建立一个(batch,1)大小的二维数组,保存打次前一个或者后一个从而形成pair,其中1表示预测周围的词的数目 labels = np.ndarray(shape=(batch_size,1),dtype=np.int32) # Sample data [0, 5241, 3082, 12, 6, 195, 2, 3137, 46, 59] ['UNK', 'anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used'] # 假设取num_steps为2, window_size为1, batchsize为8 # batch:[5242, 3084, 12, 6] # labels[0, 3082, 5241, 12, 3082, 6, 12, 195] #print(batch)[5242 5242 3084 3084 12 12 6 6],共8维 #print(labels)[[0][3082][12][5242][6][3082][12][195]],共8维 span = 2*window_size+1 # 得到一个窗口的大小 buffer = collections.deque(maxlen=span) for _ in range(span): buffer.append(data[data_index]) data_index = (data_index+1)%len(data) # batch_size一定是num-skips的倍数,从而保证每个batch-size都能够用完num-skips for i in range(batch_size//(window_size*2)):#保证每个词产生的上下文组合用完 target = window_size#中心词 target2avoid = [window_size]#中心词首先被排除 for j in range(window_size*2):#一个窗口的数据 while target in target2avoid: target = random.randint(0,span-1) target2avoid.append(target2avoid) batch[i*window_size*2+j] = buffer[window_size] labels[i*window_size*2+j,0] = buffer[target] buffer.append(data[data_index]) data_index = (data_index + 1) % len(data) return batch,labels # 然后构建网络进行训练 def train_wordvec(vocabulary_size,batch_size,embeddingsize,window_size,num_sample,num_steps,data): gragh = tf.Graph() with gragh.as_default(): # 输入数据 train_inputs = tf.placeholder(tf.int32,shape=[batch_size]) train_labels = tf.placeholder(tf.int32,shape=[batch_size,1]) # 使用cpu进行训练 with tf.device('/cpu:0'): # 初始化一个embedding embedding = tf.Variable(tf.random_uniform([vocabulary_size,embedding_size],-1.0,1.0)) # 查找对应的embedding embed = tf.nn.embedding_lookup(embedding_size,train_inputs) # 全连接参数定义 nce_weights = tf.Variable(tf.truncated_normal([vocabulary_size,embedding_size],stddev=1.0/math.sqrt(embedding_size))) nce_bias = tf.Variable(tf.zeros([vocabulary_size])) # 定义一个loss loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weights, biases=nce_bias, inputs=embed, num_classes=vocabulary_size, num_sampled=num_sampled)) # 优化方法 opt = tf.train.GradientDescentOptimizer(1.0).minimize(loss) # 计算每个词的模,用于归一化 norm = tf.sqrt(tf.reduce_sum(tf.square(embedding),1,keep_dims=True)) normalized = embedding/norm # 初始化模型的变量 init = tf.global_variables_initializer() # 基于构造的网络进行训练 with tf.Session(gragh) as session: # 初始化运行 init.run() # 定义平均损失 average_loss = 0 for step in range(num_steps): batch_inputs,batch_labels = generate_bath(batch_size,window_size,data) feed_dict = {train_inputs:batch_inputs,train_labels:batch_labels} # 计算每一次迭代的loss _,loss = session.run([opt,loss],feed_dict=feed_dict) average_loss += loss # 每个一段时间将其打印出来 if step%200 == 0: if step>0: average_loss /=200 print('average loss at step',step,":",average_loss) average_loss =0 final_embedding = normalized.eval() return final_embedding
参考文献:
https://blog.csdn.net/mawenqi0729/article/details/80698350
http://www.cnblogs.com/pinard/p/7160330.html
https://blog.csdn.net/u014595019/article/details/54093161
http://www.cnblogs.com/pinard/p/7249903.html
https://blog.csdn.net/rxt2012kc/article/details/71123052
https://blog.csdn.net/leadai/article/details/80249999
https://github.com/liuhuanyong/Word2Vector
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。