赞
踩
本次需要tensorflow第三方库,同时需要的是1.X版本
其次通过我的上一偏的文章,对LSTM的基础理论做了一个简单的处理,在这通过实战加深对这个模型的印象,同时,这篇是对文本操作做处理,用到了自然语言处理(NLP)的一些理论。本次主要是实践为主,关于一些自然语言的内容也会简单的概述
1:以此次为主,收集唐诗但本地,可以采用爬虫的形式。
2:数据处理,可以将数据处理成一首诗保存一行,并且构建一个停用词库,去除一些不需要的成分
将文本中的每句话进行分词操作,可以用的工具有jieba、pkuseg 等,一般jieba用到的比较多。之所以需要对文本进行分词,是因为计算机无法通过文字进行计算,需要转换成计算机能识别的形式(数字),因此需要每一个词映射成一个数字代表这个文字在本文中的ID是唯一的。再将ID转换成embadding词向量就可以对其进行计算
Ont-Hot编码:是比较常见的离散型数据,一个位置一个坑,比如下表1,基本基于欧式空间对距离和相似度的计算,在机器学习中的分类、回归和聚类用的较多,但是有缺点:按照一个词对去重之后个每一个词都做一个向量,当对于具有非常多类型的类别变量,变换后的向量维数过于巨大,且过于稀疏 。并且映射之间完全独立,并不能表示出不同类别之间的关系 。
我 | 在 | 上 | 海 |
1 | 0 | 0 | 0 |
0 | 1 | 0 | 0 |
0 | 0 | 1 | 0 |
0 | 0 | 0 | 1 |
表1
读取本地的文件,去除停用词库里面的数据并且控制数量,需要给诗一个开始和结束的条件,对每首诗进行分词操作,再对词进行ID映射,统计词频进行排序
- def process_poems(file_name):
- poems=[]
- with open(file_name,'r',encoding='utf-8') as f :
- for line in f.readlines():
- try:
- title,content=line.strip().split(':')
- content=str(content).replace(' ','')
-
- if '_' in content or '(' in content or '(' in content or '《' in content or '[' in content or \
- start_token in content or end_token in content:
- continue
- if len(content) < 5 or len(content) > 79:
- continue
- content=start_token+content+end_token
- poems.append(content)
- except ValueError as e:
- pass
- poems=sorted(poems,key=lambda l:len(l))
-
- all_word=[]####保存但一个的每一个子
- for poem in poems:
- all_word+=[word for word in poem]
- counter=collections.Counter(all_word)
-
- ###对字段从大到小的排序
- count_pairs=sorted(counter.items(),key=lambda x:x[1],reverse=True)
-
-
- words, _ = zip(*count_pairs)
-
- #只那前面几个常用的
- words=words[:len(words)]+(' ',)
-
- ##映射ID
- word_int_map=dict(zip(words,range(len(words))))
- # print(word_int_map.get('山'))
- poems_vector=[list(map(lambda word:word_int_map.get(word,len(words)),poem)) for poem in poems]
-
- return poems_vector, word_int_map, words

每次对样本的处理,一次批量导入64首诗,由于每首诗的数量是不一样的,所以需要用padding做数据补充,最后用区分特征和标签用于后面模型的预测
- def generate_batch(batch_size, poems_vec, word_to_int):
- n_chunk=len(poems_vec)//batch_size
- ####一共540个样本,一个样本油64首诗歌的句子组成
- x_batches=[]
- y_batches=[]
- for i in range(n_chunk):
-
- '''获取数据的范围,起始数据喝最后的数据比如,[0:64],[1*64:128].....'''
- start_index=i*batch_size
- end_index=start_index+batch_size
-
- '''找到batch众最长的poem【64首诗歌中选取词最多的】'''
- batches=poems_vec[start_index:end_index]
- length=max(map(len,batches))
- '''通过找出来的长度padding填充'''
-
- # 填充一个这么大小的空batch,空的地方放空格对应的index标号,batch_siz行,length列
- x_data=np.full((batch_size,length),word_to_int[' '],np.int32)
-
- for row in range(batch_size):
- '''蒋每一行的诗句填充到x_data,多余的空出来的地方用’ ' 空格id来填充'''
- # x_data[row:len(batches[row])]=batches[row]
- x_data[row, :len(batches[row])] = batches[row]
- y_data=np.copy(x_data)
-
- '''y的话就是x向左边也就是前面移动一个'''
- y_data[:, :-1]=x_data[:, 1:]
- '''
- [[ 1 2 1 12]
- [ 1 2 1 12]]
-
- [[ 2 1 12 1]
- [ 2 1 12 1]]
- '''
- x_batches.append(x_data)
- y_batches.append(y_data)
- return x_batches, y_batches

使用lstm模型,模型写的相对比较简单好理解,基本对每一句话都进行了标注,先创建一个lstm对象,对与lstm来说都是属于神经网络,所以基本都是(输入层,隐藏层,输出层),lstm在传输的过程中多出了一些控制变量比如(控制门、遗忘门),如图,一个输入对应的隐藏层是5个细胞单元,每个是独立的结果也是不同的比如 ,所以输入层到隐藏层是3*5
隐藏层:通常是线性变换后接着是非线性激活函数。隐藏层的工作是将输入转换为输出层可以使用的东西,而输出层则将隐藏层的激活转换为我们希望输出的任何比例
- def rnn_model(model, input_data, output_data, vocab_size, rnn_size=128, num_layers=2, batch_size=64,
- learning_rate=0.01):
- end_points={}
- if model=='rnn':cell_fun=tf.contrib.rnn.BasicRNNCell
- elif model=='gru':cell_fun=tf.contrib.rnn.GRUCell
- elif model=='lstm':cell_fun=tf.contrib.rnn.BasicLSTMCell
-
- '''128个细胞单元'''
- cell=cell_fun(rnn_size,state_is_tuple=True)
- '''这里设置两层的隐藏层数,每一层128个细胞单元'''
- cell=tf.contrib.rnn.MultiRNNCell([cell]*num_layers, state_is_tuple=True)
-
- '''初始化状态'''
- if output_data is not None:
- initial_state=cell.zero_state(batch_size, tf.float32)
- else:
- initial_state = cell.zero_state(1, tf.float32)
-
- '''通过词ID转换为词向量方便计算机识别'''
- with tf.device('/cpu:0'):
- embedding = tf.get_variable('embedding', initializer=tf.random_uniform( [vocab_size + 1, rnn_size], -1.0, 1.0))
- inputs = tf.nn.embedding_lookup(embedding, input_data)
-
- '''通过指定的RNN Cell来展开计算神经网络并且输出tf.nn.dynamic_rnn(单层),输出每一层的最后一个'''
- outputs,last_state=tf.nn.dynamic_rnn(cell, inputs, initial_state=initial_state)
- output = tf.reshape(outputs, [-1, rnn_size])
-
- '''使用激活函数对模型输出的结果做一个概率预测'''
- weights = tf.Variable(tf.truncated_normal([rnn_size, vocab_size + 1]))
- bias = tf.Variable(tf.zeros(shape=[vocab_size + 1]))
- logits = tf.nn.bias_add(tf.matmul(output, weights), bias=bias)
-
- if output_data is not None:
- '''汉字无法直接做计算,因此真实的输出汉字使用one-hot'''
- labels = tf.one_hot(tf.reshape(output_data, [-1]), depth=vocab_size + 1)
- '''损失函数计算预测与真实之间的差异'''
- loss = tf.nn.softmax_cross_entropy_with_logits(labels=labels, logits=logits)
- total_loss = tf.reduce_mean(loss)
- '''模型优化'''
- train_op = tf.train.AdamOptimizer(learning_rate).minimize(total_loss)
-
- '''模型保存的参数'''
- end_points['initial_state'] = initial_state
- end_points['output'] = output
- end_points['train_op'] = train_op
- end_points['total_loss'] = total_loss
- end_points['loss'] = loss
- end_points['last_state'] = last_state
- else:
- prediction = tf.nn.softmax(logits)
-
- end_points['initial_state'] = initial_state
- end_points['last_state'] = last_state
- end_points['prediction'] = prediction
-
- return end_points

主要是main()函数内,is_train=True是训练模型,is_train=False就是测试模型
- import os
- import numpy as np
- # from model import rnn_model
- # from poems_ import process_poems,generate_batch
- import tensorflow as tf
- from chai_epoch import process_poems,generate_batch,rnn_model
- start_token = 'G'
- end_token = 'E'
-
- tf.compat.v1.disable_eager_execution()
- def run_training():
- ###id,词+ID,词
- poems_vector,word_to_int,vocabularies=process_poems('poems.txt')
-
- batches_inputs,batches_outputs=generate_batch(64,poems_vector,word_to_int)
-
- input_data=tf.compat.v1.placeholder(tf.int32,64,None)
- output_targets = tf.compat.v1.placeholder(tf.int32, 64, None)
- end_points=rnn_model(model='lstm',input_data=input_data,output_data=output_targets,vocab_size=len(vocabularies),
- rnn_size=128,num_layers=2,batch_size=64,learning_rate=0.01)
- '''模型建立一个saver对象'''
- saver = tf.train.Saver(tf.global_variables())
- init_op = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer())
-
- '''创建会话'''
- with tf.Session() as sess:
- sess.run(init_op)
- start_epoch = 0
- checkpoint = tf.train.latest_checkpoint('./checkpoints/poems/')
- if checkpoint:
- saver.restore(sess, checkpoint)
- print("[INFO] restore from the checkpoint {0}".format(checkpoint))
- start_epoch += int(checkpoint.split('-')[-1])
- print('[INFO] start training...')
- try:
- for epoch in range(start_epoch, 50):
- n = 0
- n_chunk = len(poems_vector) //64
- for batch in range(n_chunk):
- loss, _, _ = sess.run([
- end_points['total_loss'],
- end_points['last_state'],
- end_points['train_op']
- ], feed_dict={input_data: batches_inputs[n], output_targets: batches_outputs[n]})
- n += 1
- print('eoch: %d , batch: %d , training loss: %.6f' % (epoch, batch, loss))
-
- if epoch % 6 == 0:
- saver.save(sess, './model/', global_step=epoch)
- # saver.save(sess, os.path.join(FLAGS.checkpoints_dir, FLAGS.model_prefix), global_step=epoch)
- except KeyboardInterrupt:
- print('路径...')
- saver.save(sess, os.path.join('./checkpoints/poems/', 'poems'), global_step=epoch)
- print('保存的epoch {}.'.format(epoch))
- def to_word(predict, vocabs):
- t = np.cumsum(predict)
- s = np.sum(predict)
- sample = int(np.searchsorted(t, np.random.rand(1) * s))
- if sample > len(vocabs):
- sample = len(vocabs) - 1
- return vocabs[sample]
- def pretty_print_poem(poem):
- poem_sentences = poem.split('。')
- for s in poem_sentences:
- if s != '' and len(s) > 10:
- print(s + '。')
-
- def gen_poem(begin_word):
- batch_size = 1
- '''对样本数据做预处理'''
- poems_vector, word_int_map, vocabularies = process_poems('poems.txt')
- input_data = tf.placeholder(tf.int32, [batch_size, None])
-
- '''模型'''
- end_points = rnn_model(model='lstm', input_data=input_data, output_data=None, vocab_size=len(
- vocabularies), rnn_size=128, num_layers=2, batch_size=64, learning_rate=0.01)
- saver = tf.train.Saver(tf.global_variables())
- init_op = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer())
- with tf.Session() as sess:
- sess.run(init_op)
- '''提取训练好的数据'''
- saver.restore(sess, './model/-24')
- '''获取每首诗的起始点'''
- x = np.array([list(map(word_int_map.get, start_token))])
- '''预测'''
- [predict, last_state] = sess.run([end_points['prediction'], end_points['last_state']],
- feed_dict={input_data: x})
- if begin_word:
- word = begin_word
- else:
- word = to_word(predict, vocabularies)
- poem = ''
-
- '''直到获取一句诗的结尾处'''
- while word != end_token:
- print ('runing')
- poem += word
- x = np.zeros((1, 1))
- x[0, 0] = word_int_map[word]
- [predict, last_state] = sess.run([end_points['prediction'], end_points['last_state']],
- feed_dict={input_data: x, end_points['initial_state']: last_state})
- word = to_word(predict, vocabularies)
- return poem
-
-
- def main(is_train):
-
- try:
- if is_train:
- print('训练唐诗')
- run_training()
- except:
- print('写入...')
- begin_word = input('输入起始字:')
- poem2 = gen_poem(begin_word)
- pretty_print_poem(poem2)
- if __name__ == '__main__':
- is_train=True
- main(is_train)

因为本次主要是代码的展示,中间一些NLP相关的知识点没有仔细去写,打算下次逐步对NLP相关知识点跟新补充,这次比较懒,后续会在修改补充一些,下面是我在学习的过程中看到写的word2vec比较详细的知识点可以参考
word2vec: https://blog.csdn.net/qq_43477218/article/details/113097380
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。