赞
踩
本文为RNN做古诗生成的一个小demo,只要是为了完成课上的作业(由于训练比较慢,所以周期仅设置为3,大一点性能可能会更好),如有需要可以在这基础之上进行加工,数据集没办法上传,如有需要,可以私信我。
LSTM:
如上图所示LSTM神经元存在两个状态向量:h(t)和c(t)(可将h(t)视为短期状态,c(t)视为长期状态) 首先,将当前输入向量x(t)和先前的短期状态h(t-1)馈入四个不同的全连接层(FC)。它们都有不同的目的:
主要层是输出g(t)的层:它通常的作用是分析当前输入x(t)和先前(短期)状态 h(t-1),得到本时间步的信息。
遗忘门(由f(t)控制):控制长期状态的哪些部分应当被删除。
输入门(由i(t)控制):控制应将g(t)的哪些部分添加到长期状态。
输出门(由o(t)控制):控制应在此时间步长读取长期状态的哪些部分并输出 到h(t)和y(t)。
如图1,LSTM神经元运用了三个sigmoid激活函数和一个tanh激活函数,
Tanh 作用在于帮助调节流经网络的值,使得数值始终限制在 -1 和 1 之间。
Sigmoid 激活函数与 tanh 函数类似,不同之处在于 sigmoid 是把值压缩到0~1 这样的设置有助于更新或忘记信息,可将其理解为比例(任何数乘以 0 都得 0,这部分信息就会剔除掉;同样的,任何数乘以 1 都得到它本身,这部分信息就会完美地保存下来)因记忆能力有限,记住重要的,忘记不重要的。
例子:以输入门为例,首先输入x(t)和先前(短期)状态 h(t-1),得到本时间步的信息向量g(t) = (g1(t),g2(t),g3(t)……gn(t))(其中n个神经元的个数,g1(t)取值范围为(-1,1)),然后与向量i(t)=(i1(t),i2(t),i3(t)……in(t))(ii(t)取值范围为(0,1))对应元素相乘,得到向量(g1(t)*i1(t), g2(t)*i2(t)……gn(t)*in(t)),即本时间步有用信息,然后把他加上长期记忆c(t-1)中进行保存。
LSM关键的思想是网络可以学习长期状态下存储的内容、丢弃的内容以及从中读取的内容。当长期状态c(t-1)从左到右遍历网络时,可以看到它首先经过一个遗 忘门,丢掉了一些记忆,然后通过加法操作添加了一些新的记忆(由输入门选择的记忆)。结果c(t)直接送出来,无须任何进一步的转换。因此,在每个时间步长中,都会 丢掉一些记忆,并添加一些记忆。此外,在加法运算之后,长期状态被复制并通过tanh函数传输,然后结果被输出门滤波。这将产生短期状态h(t)(等于该时间步长的单元输出 y(t))。
原理:
本文使用LSTM生成古诗,那么RNN是怎么用作我们的文本生成呢?话不多说,其实用RNN来生成的思想很简单, 就是将前一个字进行词嵌入,后一个字作为标签,将这个组合输入到RNN的网络里面等待训练拟合之后,再用一个引导词,训练出它的预测结果,再用其预测结果,来训练下一个词,循环往复,从而实现RNN生成文本的效果.
main.py
- import numpy as np
- import collections
- import torch
- from torch.autograd import Variable
- import torch.optim as optim
-
- import rnn
-
- start_token = 'G'
- end_token = 'E'
- batch_size = 64
-
- def process_poems1(file_name):
- """
- :param file_name:
- :return: poems_vector have two dimmention ,first is the poem, the second is the word_index
- e.g. [[1,2,3,4,5,6,7,8,9,10],[9,6,3,8,5,2,7,4,1]]
- """
- poems = []
- i = 1
- with open(file_name, "r", encoding='utf-8', ) as f:
- for line in f.readlines():
- try:
- i = i+1
-
- title, content = line.strip().split(':')
- # content = content.replace(' ', '').replace(',','').replace('。','')
- content = content.replace(' ', '')
- if '_' in content or '(' in content or '(' in content or '《' in content or '[' in content or \
- start_token in content or end_token in content:
- continue
- if len(content) < 5 or len(content) > 80:
- continue
- content = start_token + content + end_token
- poems.append(content)
- except ValueError as e:
- print(line)
- print(i)
- print("error")
- pass
- # 按诗的字数排序
- poems = sorted(poems, key=lambda line: len(line))
- # print(poems)
- # 统计每个字出现次数
- all_words = []
- j = 0
- for poem in poems:
- all_words += [word for word in poem] # 数据连接
- counter = collections.Counter(all_words) # 统计词和词频。
-
- count_pairs = sorted(counter.items(), key=lambda x: -x[1]) # d.items() 以列表的形式返回可遍历的元组数组 逆序排序
- words, _ = zip(*count_pairs) # zip(*) 可理解为解压,返回二维矩阵式
-
- words = words[:len(words)] + (' ',) #(‘ ’,) 为一个元素的元祖
-
- word_int_map = dict(zip(words, range(len(words))))
- poems_vector = [list(map(word_int_map.get, poem)) for poem in poems] # 第一位为一个函数,后一位为一个迭代器
- return poems_vector, word_int_map, words # 诗句的向量表示,单词映射表,单词表
-
- def process_poems2(file_name):
- """
- :param file_name:
- :return: poems_vector have tow dimmention ,first is the poem, the second is the word_index
- e.g. [[1,2,3,4,5,6,7,8,9,10],[9,6,3,8,5,2,7,4,1]]
- """
- poems = []
- with open(file_name, "r", encoding='utf-8', ) as f:
- # content = ''
- for line in f.readlines():
- try:
- line = line.strip()
- if line:
- content = line.replace(' '' ', '').replace(',','').replace('。','')
- if '_' in content or '(' in content or '(' in content or '《' in content or '[' in content or \
- start_token in content or end_token in content:
- continue
- if len(content) < 5 or len(content) > 80:
- continue
- # print(content)
- content = start_token + content + end_token
- poems.append(content)
- # content = ''
- except ValueError as e:
- # print("error")
- pass
- # 按诗的字数排序
- poems = sorted(poems, key=lambda line: len(line))
- # print(poems)
- # 统计每个字出现次数
- all_words = []
- for poem in poems:
- all_words += [word for word in poem]
- counter = collections.Counter(all_words) # 统计词和词频。
- count_pairs = sorted(counter.items(), key=lambda x: -x[1]) # 排序
- words, _ = zip(*count_pairs)
- words = words[:len(words)] + (' ',)
- word_int_map = dict(zip(words, range(len(words))))
- poems_vector = [list(map(word_int_map.get, poem)) for poem in poems]
- return poems_vector, word_int_map, words
-
- def generate_batch(batch_size, poems_vec, word_to_int):
- #生成训练数据
-
- n_chunk = len(poems_vec) // batch_size #34813/100 = 348 古诗的向量表示
- x_batches = []
- y_batches = []
- for i in range(n_chunk):
- start_index = i * batch_size
- end_index = start_index + batch_size
- x_data = poems_vec[start_index:end_index]
- y_data = []
- for row in x_data:
- y = row[1:]
- y.append(row[-1])
- y_data.append(y)
- """
- x_data y_data
- [6,2,4,6,9] [2,4,6,9,9] 文本生成,所以用后面一位数据做label
- [1,4,2,8,5] [4,2,8,5,5]
- """
- # print(x_data[0])
- # print(y_data[0])
- # exit(0)
- x_batches.append(x_data)
- y_batches.append(y_data)
- return x_batches, y_batches
-
-
- def run_training():
- # 处理数据集
- # poems_vector, word_to_int, vocabularies = process_poems2('./tangshi.txt')
- poems_vector, word_to_int, vocabularies = process_poems1('./poems.txt')
- # 生成batch
- print("finish loadding data")
- BATCH_SIZE = 100
-
- torch.manual_seed(5)
- word_embedding = rnn.word_embedding( vocab_length= len(word_to_int) + 1 , embedding_dim= 100) #6123 x 100
- #print(word_embedding.shape)
-
- rnn_model = rnn.RNN_model(batch_sz = BATCH_SIZE,vocab_len = len(word_to_int) + 1 ,word_embedding = word_embedding ,embedding_dim= 100, lstm_hidden_dim=128)
- # optimizer = optim.Adam(rnn_model.parameters(), lr= 0.001)
- optimizer=optim.RMSprop(rnn_model.parameters(), lr=0.01)
-
- loss_fun = torch.nn.NLLLoss()
- # rnn_model.load_state_dict(torch.load('./poem_generator_rnn')) # if you have already trained your model you can load it by this line.
-
- for epoch in range(3):
- batches_inputs, batches_outputs = generate_batch(BATCH_SIZE, poems_vector, word_to_int) #生成训练数据 由batch组成的数组 348
- n_chunk = len(batches_inputs)
- for batch in range(n_chunk):
- batch_x = batches_inputs[batch]
- batch_y = batches_outputs[batch] # (batch , time_step)
-
- loss = 0
- for index in range(BATCH_SIZE): #batch_size = 100
- x = np.array(batch_x[index], dtype = np.int64)
- y = np.array(batch_y[index], dtype = np.int64)
-
- x = Variable(torch.from_numpy(np.expand_dims(x,axis=1))) #将数组转换成张量 np.expand_dims扩展数据的形状 x.sahpe = 7x1,
- y = Variable(torch.from_numpy(y ))
- pre = rnn_model(x) # 7 x 6125
- loss += loss_fun(pre , y)
- if index == 0:
- _, pre = torch.max(pre, dim=1)# pre为张量,tolist转换成列表
- print('prediction', pre.data.tolist()) # the following three line can print the output and the prediction
- print('b_y ', y.data.tolist()) # And you need to take a screenshot and then past is to your homework paper.
- print('*' * 30)
- loss = loss / BATCH_SIZE
- print("epoch ",epoch,'batch number',batch,"loss is: ", loss.data.tolist())
- optimizer.zero_grad()
- loss.backward()
- torch.nn.utils.clip_grad_norm(rnn_model.parameters(), 1) # 梯度裁剪 可以预防梯度爆炸,参数的平方和
- optimizer.step() #训练参数
-
- if batch % 20 ==0:
- torch.save(rnn_model.state_dict(), './poem_generator_rnn')
- print("finish save model")
-
-
-
- def to_word(predict, vocabs): # 预测的结果转化成汉字
- sample = np.argmax(predict)
-
- if sample >= len(vocabs):
- sample = len(vocabs) - 1
- return vocabs[sample]
-
-
- def pretty_print_poem(poem): # 令打印的结果更工整
- shige=[]
- for w in poem:
- if w == start_token or w == end_token:
- break
- shige.append(w)
- poem_sentences = poem.split('。')
- for s in poem_sentences:
- if s != '' and len(s) > 2:
- # print(s + '。')
- print(s + '。')
-
-
- def gen_poem(begin_word):
- # poems_vector, word_int_map, vocabularies = process_poems2('./tangshi.txt') # use the other dataset to train the network
- poems_vector, word_int_map, vocabularies = process_poems1('./poems.txt')
- word_embedding = rnn.word_embedding(vocab_length=len(word_int_map) + 1, embedding_dim=100)
- rnn_model = rnn.RNN_model(batch_sz=64, vocab_len=len(word_int_map) + 1, word_embedding=word_embedding,
- embedding_dim=100, lstm_hidden_dim=128)
-
- rnn_model.load_state_dict(torch.load('./poem_generator_rnn'))
- # 指定开始的字
-
- poem = begin_word
- word = begin_word
- while word != end_token:
- input = np.array([word_int_map[w] for w in poem],dtype= np.int64)
- input = Variable(torch.from_numpy(input))
- output = rnn_model(input, is_test=True)
- word = to_word(output.data.tolist(), vocabularies)
- poem += word
- if len(poem) > 30:
- break
- return poem
-
-
-
- #run_training() # 如果不是训练阶段 ,请注销这一行 。 网络训练时间很长。
-
-
- pretty_print_poem(gen_poem("日"))
- pretty_print_poem(gen_poem("红"))
- pretty_print_poem(gen_poem("山"))
- pretty_print_poem(gen_poem("夜"))
- pretty_print_poem(gen_poem("湖"))
- pretty_print_poem(gen_poem("湖"))
- pretty_print_poem(gen_poem("湖"))
- pretty_print_poem(gen_poem("君"))
rnn.py
- import torch.nn as nn
- import torch
- from torch.autograd import Variable
- import torch.nn.functional as F
-
- import numpy as np
-
- def weights_init(m):
- classname = m.__class__.__name__ # obtain the class name
- if classname.find('Linear') != -1:
- weight_shape = list(m.weight.data.size()) #6123 x 128
- fan_in = weight_shape[1]
- fan_out = weight_shape[0]
- w_bound = np.sqrt(6. / (fan_in + fan_out))
- m.weight.data.uniform_(-w_bound, w_bound)
- m.bias.data.fill_(0)
- print("inital linear weight ")
-
-
- class word_embedding(nn.Module):
- def __init__(self,vocab_length , embedding_dim):
- super(word_embedding, self).__init__()
- w_embeding_random_intial = np.random.uniform(-1,1,size=(vocab_length ,embedding_dim)) #生成服从均匀分布的随机数
- self.word_embedding = nn.Embedding(vocab_length,embedding_dim) #创建一个embedding层
- self.word_embedding.weight.data.copy_(torch.from_numpy(w_embeding_random_intial))
- def forward(self,input_sentence):
- """
- :param input_sentence: a tensor ,contain several word index.
- :return: a tensor ,contain word embedding tensor
- """
- sen_embed = self.word_embedding(input_sentence)
- return sen_embed
-
-
- class RNN_model(nn.Module):
- def __init__(self, batch_sz ,vocab_len ,word_embedding,embedding_dim, lstm_hidden_dim):
- super(RNN_model,self).__init__()
-
- self.word_embedding_lookup = word_embedding
- self.batch_size = batch_sz
- self.vocab_length = vocab_len
- self.word_embedding_dim = embedding_dim
- self.lstm_dim = lstm_hidden_dim
- #########################################
- # here you need to define the "self.rnn_lstm" the input size is "embedding_dim" and the output size is "lstm_hidden_dim"
- # the lstm should have two layers, and the input and output tensors are provided as (batch, seq, feature)
- # ???
-
- self.rnn_lstm = nn.LSTM(input_size=embedding_dim,hidden_size=lstm_hidden_dim, num_layers=2,batch_first=True)
-
- ##########################################
- self.fc = nn.Linear(lstm_hidden_dim, vocab_len )
- self.apply(weights_init) # call the weights initial function.
- self.softmax = nn.LogSoftmax() # the activation function.
- # self.tanh = nn.Tanh()
- def forward(self,sentence,is_test = False):
- batch_input = self.word_embedding_lookup(sentence).view(1,-1,self.word_embedding_dim) # sentence=[7,1] [7x1x100] batch_input=[1,7,100])
- # print(batch_input.size()) # print the size of the input
- ################################################
- # here you need to put the "batch_input" input the self.lstm which is defined before.
- # the hidden output should be named as output, the initial hidden state and cell state set to zero.
- # ???
- #print(batch_input.shape)
- output,_ = self.rnn_lstm(batch_input) # 1x7x128
- ################################################
- out = output.contiguous().view(-1,self.lstm_dim) #1x128
- #print(out.shape)
- out = F.relu(self.fc(out))
- out = self.softmax(out)
-
- if is_test:
- prediction = out[ -1, : ].view(1,-1) #[1,6125]
- #prediction = torch.max(out,0)
- output = prediction
- else:
- output = out
-
- # print(out)
- return output
-
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。