赞
踩
循环神经网络
的提出是基于记忆模型的想法,期望网络能够记住前面出现的特征,并依据特征推断后面的结果,而且整体的网络结构不断循环,因为得名循环神经网络
RNN 的特征
就在于拥有一个环路(或回路)。
环路
可以使数据不断循环。通过数据的循环,RNN 一边记住过去的数据,一边更新到最新的数据RNN(Recurrent Neural Network)循环神经网络
,一般以序列数据为输入,通过网络内部结构
设计有效捕捉序列之前的关系特征,一般也是以序列形式进行输出
循环神经网络
的基本结构特别简单,就是将网络的输出保存在一 个记忆单元
中,这个记忆单
元和下一 次的输入一起进入神经网络
中 。输入序列的顺序改变,会改变网络的输出结果
RNN的循环机制
使模型隐层上一时间步骤产生的结果
h
t
h_t
ht,能够作为下一时间步输入的一部分
当下时间步的输出
包括:正常的输入
x
t
x_t
xt和上一步的隐层输出
h
t
h_t
ht对当下时间步的输出产生影响输入和输出序列
是等长的,由于这个限制,适用范围比较小输入
为一个序列
,输出
为一个单独的值
输入
不是一个序列
,输出
为一个序列
编码器和解码器两部分组成
,两者的内部结构都是某类RNN,也称为seq to seq 结构
编码器
最终输出一个隐含变量c【N to 1结构】,使用隐含变量c
作用在解码器进行解码的每一步操作【1 to N结构】,以确保输入信息被有效利用传统RNN
LSTM
Bi-LSTM
GRU
Bi-GRU
参数 | 含义 |
---|---|
t | 时刻 t,代指时序数据的索引 |
h(t) | 第t时刻记忆单元 【
h
t
h_t
ht 是由前一个输出
h
t
−
1
h_{t−1}
ht−1计算出来的】
h
t
h_t
ht称为隐藏状态或隐藏状态向量 |
h t − 1 h_{t-1} ht−1 | 第t-1时刻记忆单元 |
f W f_{W} fW | 非线性激活函数,tanh |
W h h W_{hh} Whh | 每一时刻h(t)记忆单元参数 |
W x h W_{xh} Wxh | 每一时刻x的输入参数 |
y t y_{t} yt | 第t时刻输出的结果 |
A | 特征融合 ,
t
a
n
h
(
W
h
h
t
−
1
+
W
x
x
t
)
tanh(W_{h}h_{t-1}+W_{x }x_t)
tanh(Whht−1+Wxxt) |
X t X_t Xt | 索引为t的序列特征,时刻 t 的输入数据 |
h 0 h_0 h0 |
h
0
h_0
h0是记忆单元,第0时刻前面没有数据,所以
h
0
h_0
h0是初始化全为0的矩阵 |
h
t
=
f
W
(
h
t
−
1
,
x
t
)
=
t
a
n
h
(
W
h
h
t
−
1
+
W
x
x
t
)
y
t
=
W
h
y
h
t
w,x
的值且维度不可改变参数 | 维度 | 含义 |
---|---|---|
X | (N,2) | (数据个数,数据向量长度) |
W | (2,3) | (数据向量长度,神经元个数) |
B | (3,) | (神经元个数,) |
X*W | (N,3) | (数据个数,神经元个数) |
Y | (3,) | (神经元个数,) |
∂ L ∂ Y \frac{\partial L}{\partial Y} ∂Y∂L | (N,3) | (数据个数,神经元个数) |
X
的维度是(N,2)
,确保维度不改变w
中找到一个参数中含有N
的放在前面,去更新X
的值,
∂
L
∂
Y
\frac{\partial L}{\partial Y}
∂Y∂L在前列维度
为3
,所以需要对W
进行转置
才能运算,即得到下面公式X
的维度是(2,3)
,确保维度不改变X
中找到一个参数中含有2
的放在前面,去更新w的值,即X
在前且需转置
列维度
为N
,所以直接与
∂
L
∂
Y
\frac{\partial L}{\partial Y}
∂Y∂L相乘即可,即得到下面公式参数 | 含义 |
---|---|
h n e x t h_{next} hnext | 当前时刻的 RNN 层的输出( 下一时刻的 RNN 层的输入) |
h p r e h_{pre} hpre | 当前时刻的 RNN 层的输入( 上一时刻的 RNN 层的输出),相当于
h
t
−
1
h_{t-1}
ht−1 |
d h n e x t dh_{next} dhnext | 下一时刻的RNN 层梯度 |
d c u r r e n t d_{current} dcurrent | 当前时刻的导数 |
d t d_{t} dt | 当前时刻的梯度 |
x x x | 当前时刻的输入参数,相当于
x
t
x_t
xt |
矩阵求导运算法则
∂
A
B
∂
A
=
B
T
∂
A
B
∂
B
=
A
T
\\ \frac{\partial AB}{\partial A} =B^T \\ \frac{\partial AB}{\partial B} =A^T
∂A∂AB=BT∂B∂AB=AT
相关公式
h
n
e
x
t
=
h
t
−
1
W
h
+
x
t
W
x
+
b
h
c
u
r
r
e
n
t
=
t
a
n
h
(
h
n
e
x
t
)
=
t
a
n
h
(
h
t
−
1
W
h
+
x
t
W
x
+
b
)
当前梯度推导
偏置项b求导
W h W_h Wh求导
W x W_x Wx求导
x x x求导
h p r e h_{pre} hpre求导【 h t − 1 h_{t-1} ht−1】
反向传播相关函数
相关公式
h
i
=
W
h
h
t
−
1
+
W
x
x
t
h
t
=
t
a
n
h
(
W
h
h
t
−
1
+
W
x
x
t
)
y
t
=
W
y
h
h
t
L
o
s
s
t
=
1
2
(
y
−
y
t
)
2
链式求导法则
∂
L
o
s
s
(
t
)
∂
W
h
=
∂
L
o
s
s
(
t
)
∂
y
t
∗
∂
y
(
t
)
∂
h
t
∗
∂
h
(
t
)
∂
h
i
∗
∂
h
(
i
)
∂
W
h
求导结果
∂
L
o
s
s
(
t
)
∂
y
t
=
1
2
(
y
t
r
u
e
−
y
t
)
2
∂
y
t
∂
y
(
t
)
∂
h
t
=
W
y
h
h
t
∂
h
t
=
W
y
h
∂
h
(
i
)
∂
W
h
=
W
h
h
t
−
1
+
W
x
x
t
∂
W
h
=
h
t
−
1
∂
h
t
∂
h
i
=
∂
h
t
∂
h
t
−
1
∗
∂
h
t
−
1
∂
h
t
−
2
∗
∂
h
t
−
2
∂
h
t
−
3
.
.
.
∂
h
i
+
1
∂
h
i
=
∏
k
=
i
t
−
1
∂
h
k
+
1
∂
h
k
∂
h
k
+
1
∂
h
k
=
d
i
a
g
(
f
′
(
W
h
h
t
−
1
+
W
x
x
t
)
)
W
h
∂
h
k
∂
h
1
=
∏
k
=
i
k
d
i
a
g
(
f
′
(
W
h
h
t
−
1
+
W
x
x
t
)
)
W
h
∂
h
t
∂
h
i
=
∂
h
t
∂
h
t
−
1
∗
∂
h
t
−
1
∂
h
t
−
2
∗
∂
h
t
−
2
∂
h
t
−
3
.
.
.
∂
h
i
+
1
∂
h
i
=
∏
k
=
i
t
−
1
∂
h
k
+
1
∂
h
k
∂
h
k
+
1
∂
h
k
=
d
i
a
g
(
f
′
(
W
h
h
t
−
1
+
W
x
x
t
)
)
W
h
∂
h
k
∂
h
1
=
∏
k
=
i
k
d
i
a
g
(
f
′
(
W
h
h
t
−
1
+
W
x
x
t
)
)
W
h
根据求导公式发现,再求 ∂ h k ∂ h 1 \frac{\partial h_{k}}{\partial h_{1}} ∂h1∂hk的导数时,会出现累积计算,会出现 W h W_{h} Wh的k次方
def preprocess_rnnlm(sentences_list,lis = []): """ 语料库预处理 :param text_list:句子列表 :return: word_list 是单词列表 word_dict:是单词到单词 ID 的字典 number_dict 是单词 ID 到单词的字典 n_class 单词数 """ for i in sentences_list: text = i.split('.')[0].split(' ') # 按照空格分词,统计 sentences的分词的个数 word_list = list({}.fromkeys(text).keys()) # 去重 统计词典个数 lis=lis+word_list word_list=list({}.fromkeys(lis).keys()) corpus = [i for i, w in enumerate(word_list)] word_dict = {w: i for i, w in enumerate(word_list)} number_dict = {i: w for i, w in enumerate(word_list)} n_class = len(word_dict) # 词典的个数,也是softmax 最终分类的个数 return word_list, word_dict, number_dict,n_class,corpus def make_batch_rnnlm(sentences_list, word_dict, windows_size=1): """ 词向量编码函数 :param sentences_list:句子列表 :param word_dict: 字典{'You': 0,,,} key:单词,value:索引 :param windows_size: 窗口大小 :return: input_batch:数据集向量 target_batch:标签值 """ input_batch, target_batch = [], [] for sen in sentences_list: word_repeat_list = sen.split(' ') # 按照空格分词 for i in range(windows_size, len(word_repeat_list)): # 目标词索引迭代 target = word_repeat_list[i] # 获取目标词 input_index = [word_dict[word_repeat_list[j]] for j in range((i - windows_size), i)][0] # 获取目标词相关输入数据集 target_index = word_dict[target] # 目标词索引 input_batch.append(input_index) target_batch.append(target_index) return input_batch, target_batch if __name__ == '__main__': # 获取数据集 sentences_list = ['After learning his achievement in science in a speech Tom has admired teddy much for his concentration on what he studies'] # 训练数据 word_list, word_to_id, id_to_word,n_class,corpus=preprocess_rnnlm(sentences_list) print('word_to_id为:',word_to_id) print('id_to_word为:',id_to_word) print('corpus为: ',corpus) # 文章单词序列索引 xs, ts = make_batch_rnnlm(sentences_list, word_to_id, windows_size=1) # 构建输入数据和 target label # 创建数据集 vocab_size = int(max(corpus) + 1) # 不同单词个数 corpus_size = int(max(corpus)) # 单词最大id data_size = len(xs) # 样本集长度 print('xs',xs) # 输入文本 print('ts',ts) # 输出(监督标签)
import numpy as np import matplotlib.pyplot as plt def preprocess_rnnlm(sentences_list,lis = []): """ 语料库预处理 :param text_list:句子列表 :return: word_list 是单词列表 word_dict:是单词到单词 ID 的字典 number_dict 是单词 ID 到单词的字典 n_class 单词数 """ for i in sentences_list: text = i.split('.')[0].split(' ') # 按照空格分词,统计 sentences的分词的个数 word_list = list({}.fromkeys(text).keys()) # 去重 统计词典个数 lis=lis+word_list word_list=list({}.fromkeys(lis).keys()) corpus = [i for i, w in enumerate(word_list)] word_dict = {w: i for i, w in enumerate(word_list)} number_dict = {i: w for i, w in enumerate(word_list)} n_class = len(word_dict) # 词典的个数,也是softmax 最终分类的个数 return word_list, word_dict, number_dict,n_class,corpus def make_batch_rnnlm(sentences_list, word_dict, windows_size=1): """ 词向量编码函数 :param sentences_list:句子列表 :param word_dict: 字典{'You': 0,,,} key:单词,value:索引 :param windows_size: 窗口大小 :return: input_batch:数据集向量 target_batch:标签值 """ input_batch, target_batch = [], [] for sen in sentences_list: word_repeat_list = sen.split(' ') # 按照空格分词 for i in range(windows_size, len(word_repeat_list)): # 目标词索引迭代 target = word_repeat_list[i] # 获取目标词 input_index = [word_dict[word_repeat_list[j]] for j in range((i - windows_size), i)][0] # 获取目标词相关输入数据集 target_index = word_dict[target] # 目标词索引 input_batch.append(input_index) target_batch.append(target_index) return input_batch, target_batch class SGD: ''' 随机梯度下降法(Stochastic Gradient Descent) ''' def __init__(self, lr=0.01): self.lr = lr def update(self, params, grads): for i in range(len(params)): params[i] -= self.lr * grads[i] def softmax(x): if x.ndim == 2: x = x - x.max(axis=1, keepdims=True) x = np.exp(x) x /= x.sum(axis=1, keepdims=True) elif x.ndim == 1: x = x - np.max(x) x = np.exp(x) / np.sum(np.exp(x)) return x def cross_entropy_error(y, t): if y.ndim == 1: t = t.reshape(1, t.size) y = y.reshape(1, y.size) # 在监督标签为one-hot-vector的情况下,转换为正确解标签的索引 if t.size == y.size: t = t.argmax(axis=1) batch_size = y.shape[0] return -np.sum(np.log(y[np.arange(batch_size), t] + 1e-7)) / batch_size class TimeSoftmaxWithLoss: def __init__(self): self.params, self.grads = [], [] self.cache = None self.ignore_label = -1 def forward(self, xs, ts): N, T, V = xs.shape if ts.ndim == 3: # 在监督标签为one-hot向量的情况下 ts = ts.argmax(axis=2) mask = (ts != self.ignore_label) # 按批次大小和时序大小进行整理(reshape) xs = xs.reshape(N * T, V) ts = ts.reshape(N * T) mask = mask.reshape(N * T) ys = softmax(xs) ls = np.log(ys[np.arange(N * T), ts]) ls *= mask # 与ignore_label相应的数据将损失设为0 loss = -np.sum(ls) loss /= mask.sum() self.cache = (ts, ys, mask, (N, T, V)) return loss def backward(self, dout=1): ts, ys, mask, (N, T, V) = self.cache dx = ys dx[np.arange(N * T), ts] -= 1 dx *= dout dx /= mask.sum() dx *= mask[:, np.newaxis] # 与ignore_label相应的数据将梯度设为0 dx = dx.reshape((N, T, V)) return dx class SoftmaxWithLoss: def __init__(self): self.params, self.grads = [], [] self.y = None # softmax的输出 self.t = None # 监督标签 def forward(self, x, t): self.t = t self.y = softmax(x) # 在监督标签为one-hot向量的情况下,转换为正确解标签的索引 if self.t.size == self.y.size: self.t = self.t.argmax(axis=1) loss = cross_entropy_error(self.y, self.t) return loss def backward(self, dout=1): batch_size = self.t.shape[0] dx = self.y.copy() dx[np.arange(batch_size), self.t] -= 1 dx *= dout dx = dx / batch_size return dx class SimpleRnnlm: def __init__(self, vocab_size, wordvec_size, hidden_size): V, D, H = vocab_size, wordvec_size, hidden_size rn = np.random.randn # 初始化权重 embed_W = (rn(V, D) / 100).astype('f') rnn_Wx = (rn(D, H) / np.sqrt(D)).astype('f') rnn_Wh = (rn(H, H) / np.sqrt(H)).astype('f') rnn_b = np.zeros(H).astype('f') affine_W = (rn(H, V) / np.sqrt(H)).astype('f') affine_b = np.zeros(V).astype('f') # 生成层 self.layers = [ TimeEmbedding(embed_W), TimeRNN(rnn_Wx, rnn_Wh, rnn_b, stateful=True), TimeAffine(affine_W, affine_b) ] self.loss_layer = TimeSoftmaxWithLoss() self.rnn_layer = self.layers[1] # 将所有的权重和梯度整理到列表中 self.params, self.grads = [], [] for layer in self.layers: self.params += layer.params self.grads += layer.grads def forward(self, xs, ts): for layer in self.layers: xs = layer.forward(xs) loss = self.loss_layer.forward(xs, ts) return loss def backward(self, dout=1): dout = self.loss_layer.backward(dout) for layer in reversed(self.layers): dout = layer.backward(dout) return dout def reset_state(self): self.rnn_layer.reset_state() class TimeRNN: def __init__(self, Wx, Wh, b, stateful=False): self.params = [Wx, Wh, b] self.grads = [np.zeros_like(Wx), np.zeros_like(Wh), np.zeros_like(b)] self.layers = None self.h, self.dh = None, None self.stateful = stateful def forward(self, xs): Wx, Wh, b = self.params N, T, D = xs.shape D, H = Wx.shape self.layers = [] hs = np.empty((N, T, H), dtype='f') if not self.stateful or self.h is None: self.h = np.zeros((N, H), dtype='f') for t in range(T): layer = RNN(*self.params) self.h = layer.forward(xs[:, t, :], self.h) hs[:, t, :] = self.h self.layers.append(layer) return hs def backward(self, dhs): Wx, Wh, b = self.params N, T, H = dhs.shape D, H = Wx.shape dxs = np.empty((N, T, D), dtype='f') dh = 0 grads = [0, 0, 0] for t in reversed(range(T)): layer = self.layers[t] dx, dh = layer.backward(dhs[:, t, :] + dh) dxs[:, t, :] = dx for i, grad in enumerate(layer.grads): grads[i] += grad for i, grad in enumerate(grads): self.grads[i][...] = grad self.dh = dh return dxs def set_state(self, h): self.h = h def reset_state(self): self.h = None class RNN: def __init__(self, Wx, Wh, b): self.params = [Wx, Wh, b] self.grads = [np.zeros_like(Wx), np.zeros_like(Wh), np.zeros_like(b)] self.cache = None def forward(self, x, h_prev): Wx, Wh, b = self.params t = np.dot(h_prev, Wh) + np.dot(x, Wx) + b h_next = np.tanh(t) self.cache = (x, h_prev, h_next) return h_next def backward(self, dh_next): Wx, Wh, b = self.params x, h_prev, h_next = self.cache dt = dh_next * (1 - h_next ** 2) db = np.sum(dt, axis=0) dWh = np.dot(h_prev.T, dt) dh_prev = np.dot(dt, Wh.T) dWx = np.dot(x.T, dt) dx = np.dot(dt, Wx.T) self.grads[0][...] = dWx self.grads[1][...] = dWh self.grads[2][...] = db return dx, dh_prev class TimeEmbedding: def __init__(self, W): self.params = [W] self.grads = [np.zeros_like(W)] self.layers = None self.W = W def forward(self, xs): N, T = xs.shape V, D = self.W.shape out = np.empty((N, T, D), dtype='f') self.layers = [] for t in range(T): layer = Embedding(self.W) out[:, t, :] = layer.forward(xs[:, t]) self.layers.append(layer) return out def backward(self, dout): N, T, D = dout.shape grad = 0 for t in range(T): layer = self.layers[t] layer.backward(dout[:, t, :]) grad += layer.grads[0] self.grads[0][...] = grad return None class Embedding: def __init__(self, W): self.params = [W] self.grads = [np.zeros_like(W)] self.idx = None def forward(self, idx): W, = self.params self.idx = idx out = W[idx] return out def backward(self, dout): dW, = self.grads dW[...] = 0 if GPU: np.scatter_add(dW, self.idx, dout) else: np.add.at(dW, self.idx, dout) return None class TimeAffine: def __init__(self, W, b): self.params = [W, b] self.grads = [np.zeros_like(W), np.zeros_like(b)] self.x = None def forward(self, x): N, T, D = x.shape W, b = self.params rx = x.reshape(N*T, -1) out = np.dot(rx, W) + b self.x = x return out.reshape(N, T, -1) def backward(self, dout): x = self.x N, T, D = x.shape W, b = self.params dout = dout.reshape(N*T, -1) rx = x.reshape(N*T, -1) db = np.sum(dout, axis=0) dW = np.dot(rx.T, dout) dx = np.dot(dout, W.T) dx = dx.reshape(*x.shape) self.grads[0][...] = dW self.grads[1][...] = db return dx if __name__ == '__main__': GPU=False # 获取数据集 sentences_list = ['After learning his achievement in science in a speech Tom has admired teddy much for his concentration on what he studies'] # 训练数据 word_list, word_to_id, id_to_word,n_class,corpus=preprocess_rnnlm(sentences_list) print('word_to_id为:',word_to_id) print('id_to_word为:',id_to_word) print('corpus为: ',corpus) # 文章单词序列索引 xs, ts = make_batch_rnnlm(sentences_list, word_to_id, windows_size=1) # 构建输入数据和 target label # 创建数据集 vocab_size = int(max(corpus) + 1) # 不同单词个数 corpus_size = int(max(corpus)) # 单词最大id data_size = len(xs) # 样本集长度 print('xs',xs) # 输入文本 print('ts',ts) # 输出(监督标签) # 设定超参数 batch_size = 5 # 批次大小,每一次传入的数据集大小 wordvec_size = 100 # 词向量长度 hidden_size = 100 # 隐藏层输出个数 time_size = 2 # Truncated BPTT的时间跨度大小,反向传播步长 lr = 0.1 # 学习率 max_epoch = 100 # 迭代次数,训练次数 print('corpus size: %d, vocab size: %d' % (corpus_size, vocab_size)) # 学习用的参数 print('data_size,batch_size',data_size,batch_size) # batch_size * time_size=一次数据集大小 max_iters = data_size // (batch_size*time_size) print('max_iters',max_iters) time_idx ,total_loss,loss_count,ppl_list= 0,0,0,[] # 生成模型 model = SimpleRnnlm(vocab_size, wordvec_size, hidden_size) optimizer = SGD(lr) # print('model',model ) # 计算读入mini-batch的各笔样本数据的开始位置 jump = corpus_size // batch_size # 各批次样本数据的开始位置索引【数据集平分为time_size 2份】 输入数据的开始位置,需要在各个批次中进行“偏移” offsets = [i * jump for i in range(batch_size)] # 各批次元素增加偏移量,len的大小代表批次的元素的个数 # 例如:[0, 3, 6, 9, 12] print('jump',jump) print('offsets',offsets) print('batch_size',batch_size) print('time_size',time_size) for epoch in range(max_epoch): for iter in range(max_iters): # 获取mini-batch # 生成空数组【生成的不是空数组,是运算比较快,需要手动填写值】 batch_x = np.empty((batch_size, time_size), dtype='i') # shape(5,2) batch_t = np.empty((batch_size, time_size), dtype='i') # shape(5,2) for t in range(time_size): # 填充批次的每个元素值【列数据】 for i, offset in enumerate(offsets): # i:数据行数 batch_x[i, t] = xs[(offset + time_idx) % data_size] batch_t[i, t] = ts[(offset + time_idx) % data_size] time_idx += 1 # 计算梯度,更新参数 loss = model.forward(batch_x, batch_t) model.backward() optimizer.update(model.params, model.grads) total_loss += loss loss_count += 1 # 各个epoch的困惑度评价 ppl = np.exp(total_loss / loss_count) print('| epoch %d | perplexity %.2f' % (epoch+1, ppl)) ppl_list.append(float(ppl)) total_loss, loss_count = 0, 0 # break # 绘制图形 x = np.arange(len(ppl_list)) plt.plot(x, ppl_list, label='train') plt.xlabel('epochs') plt.ylabel('perplexity') plt.show()
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。