赞
踩
本文主要包含以下内容:
前向传播和反向传播的总体过程。
神经网络的直接输出记为
Z
[
l
]
Z^{[l]}
Z[l],表示激活前的输出,激活后的输出记为
A
A
A。
第一个图像是神经网络的前向传递和反向传播的过程,第二个图像用于解释中间的变量关系,第三个图像是前向和后向过程的计算图,方便进行推导,但是第三个图左下角的 A [ l − 2 ] A^{[l-2]} A[l−2]有错误,应该是 A [ l − 1 ] A^{[l-1]} A[l−1]。
为了方便进行推导,有必要对各个符号进行介绍
记号 | 含义 |
---|---|
n l n_l nl | 第 l l l层神经元个数 |
f l ( ⋅ ) f_l(\cdot) fl(⋅) | 第 l l l层神经元的激活函数 |
W l ∈ R n l − 1 × n l \mathbf{W}^l\in\R^{n_{l-1}\times n_{l}} Wl∈Rnl−1×nl | 第 l − 1 l-1 l−1层到第 l l l层的权重矩阵 |
b l ∈ R n l \mathbf{b}^l \in \R^{n_l} bl∈Rnl | 第 l − 1 l-1 l−1层到第 l l l层的偏置 |
Z l ∈ R n l \mathbf{Z}^l \in \R^{n_l} Zl∈Rnl | 第 l l l层的净输出,没有经过激活的输出 |
A l ∈ R n l \mathbf{A}^l \in \R^{n_l} Al∈Rnl | 第 l l l层经过激活函数的输出, A 0 = X A^0=X A0=X |
深层的神经网络都是由一个一个单层网络堆叠起来的,于是我们可以写出神经网络最基本的结构,然后进行堆叠得到深层的神经网络。
于是,我们可以开始编写代码,通过一个类Layer
来描述单个神经网络层
class Layer: def __init__(self, input_dim, output_dim): # 初始化参数 self.W = np.random.randn(input_dim, output_dim) * 0.01 self.b = np.zeros((1, output_dim)) def forward(self, X): # 前向计算 self.Z = np.dot(X, self.W) + self.b self.A = self.activation(self.Z) return self.A def backward(self, dA, A_prev, activation_derivative): # 反向传播 # 计算公式推导见下方 m = A_prev.shape[0] self.dZ = dA * activation_derivative(self.Z) self.dW = np.dot(A_prev.T, self.dZ) / m self.db = np.sum(self.dZ, axis=0, keepdims=True) / m dA_prev = np.dot(self.dZ, self.W.T) return dA_prev def update_parameters(self, learning_rate): # 参数更新 self.W -= learning_rate * self.dW self.b -= learning_rate * self.db # 带有ReLU激活函数的Layer class ReLULayer(Layer): def activation(self, Z): return np.maximum(0, Z) def activation_derivative(self, Z): return (Z > 0).astype(float) # 带有Softmax激活函数(主要用于分类)的Layer class SoftmaxLayer(Layer): def activation(self, Z): exp_z = np.exp(Z - np.max(Z, axis=1, keepdims=True)) return exp_z / np.sum(exp_z, axis=1, keepdims=True) def activation_derivative(self, Z): # Softmax derivative is more complex, not directly used in this form. return np.ones_like(Z)
权重更新的核心在于计算得到self.dW
和self.db
,同时,为了将梯度信息不断回传,需要backward
函数返回梯度信息dA_prev
。
需要用到的公式
Z
l
=
W
l
A
l
−
1
+
b
l
A
l
=
f
(
Z
l
)
d
Z
d
W
=
(
A
l
−
1
)
T
d
Z
d
b
=
1
Z^l = W^l A^{l-1} +b^l \\A^l = f(Z^l)\\\frac{dZ}{dW} = (A^{l-1})^T \\\frac{dZ}{db} = 1
Zl=WlAl−1+blAl=f(Zl)dWdZ=(Al−1)TdbdZ=1
解释:
从上方计算图右侧的反向传播过程可以看到,来自于上一层的梯度信息dA
经过dZ
之后直接传递到db
,也经过dU
之后传递到dW
,于是我们可以得到dW
和db
的梯度计算公式如下:
d
W
=
d
A
⋅
d
A
d
Z
⋅
d
Z
d
W
=
d
A
⋅
f
′
(
d
Z
)
⋅
A
p
r
e
v
T
其中,
f
(
⋅
)
f(\cdot)
f(⋅)是激活函数,
f
′
(
⋅
)
f'(\cdot)
f′(⋅)是激活函数的导数,
A
p
r
e
v
T
A_{prev}^T
AprevT是当前层上一层激活输出的转置。
同理,可以得到
d
b
=
d
A
⋅
d
A
d
Z
⋅
d
Z
d
b
=
d
A
⋅
f
′
(
d
Z
)
需要仅需往前传递的梯度信息:
d
A
p
r
e
v
=
d
A
⋅
d
A
d
Z
⋅
d
Z
A
p
r
e
v
=
d
A
⋅
f
′
(
d
Z
)
⋅
W
T
所以,经过上述推导,我们可以将梯度信息从后向前传递。
分类过程的损失函数最常见的就是交叉熵损失了,用来计算模型输出分布和真实值之间的差异,其公式如下:
L
=
−
1
N
∑
i
=
1
N
∑
j
=
1
C
y
i
j
l
o
g
(
y
i
j
^
)
L = -\frac{1}{N}\sum_{i=1}^N \sum_{j=1}^C{y_{ij} log(\hat{y_{ij}})}
L=−N1i=1∑Nj=1∑Cyijlog(yij^)
其中,
N
N
N表示样本个数,
C
C
C表示类别个数,
y
i
j
y_{ij}
yij表示第i个样本的第j个位置的值,由于使用了独热编码,因此每一行仅有1个数字是1,其余全部是0,所以,交叉熵损失每次需要对第
i
i
i个样本不为0的位置的概率计算对数,然后将所有所有概率取平均值的负数。
交叉熵损失函数的梯度可以简洁地使用如下符号表示:
∇
z
L
=
y
^
−
y
\nabla_zL = \mathbf{\hat{y}} - \mathbf{{y}}
∇zL=y^−y
均方差损失函数由于良好的性能被回归问题广泛采用,其公式如下:
L
=
1
N
∑
i
=
1
N
(
y
i
−
y
i
^
)
2
L = \frac{1}{N} \sum_{i=1}^N(y_i - \hat{y_i})^2
L=N1i=1∑N(yi−yi^)2
向量形式:
L
=
1
N
∣
∣
y
−
y
^
∣
∣
2
2
L = \frac{1}{N} ||\mathbf{y} - \mathbf{\hat{y}}||^2_2
L=N1∣∣y−y^∣∣22
梯度计算:
∇
y
^
L
=
2
N
(
y
^
−
y
)
\nabla_{\hat{y}}L = \frac{2}{N}(\mathbf{\hat{y}} - \mathbf{y})
∇y^L=N2(y^−y)
import numpy as np from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.preprocessing import OneHotEncoder import matplotlib.pyplot as plt class Layer: def __init__(self, input_dim, output_dim): self.W = np.random.randn(input_dim, output_dim) * 0.01 self.b = np.zeros((1, output_dim)) def forward(self, X): self.Z = np.dot(X, self.W) + self.b # 激活前的输出 self.A = self.activation(self.Z) # 激活后的输出 return self.A def backward(self, dA, A_prev, activation_derivative): # 注意:梯度信息是反向传递的: l+1 --> l --> l-1 # A_prev是第l-1层的输出,也即A^{l-1} # dA是第l+1的层反向传递的梯度信息 # activation_derivative是激活函数的导数 # dA_prev是传递给第l-1层的梯度信息 m = A_prev.shape[0] self.dZ = dA * activation_derivative(self.Z) self.dW = np.dot(A_prev.T, self.dZ) / m self.db = np.sum(self.dZ, axis=0, keepdims=True) / m dA_prev = np.dot(self.dZ, self.W.T) # 反向传递给下一层的梯度信息 return dA_prev def update_parameters(self, learning_rate): self.W -= learning_rate * self.dW self.b -= learning_rate * self.db class ReLULayer(Layer): def activation(self, Z): return np.maximum(0, Z) def activation_derivative(self, Z): return (Z > 0).astype(float) class SoftmaxLayer(Layer): def activation(self, Z): exp_z = np.exp(Z - np.max(Z, axis=1, keepdims=True)) return exp_z / np.sum(exp_z, axis=1, keepdims=True) def activation_derivative(self, Z): # Softmax derivative is more complex, not directly used in this form. return np.ones_like(Z) class NeuralNetwork: def __init__(self, layer_dims, learning_rate=0.01): self.layers = [] self.learning_rate = learning_rate for i in range(len(layer_dims) - 2): self.layers.append(ReLULayer(layer_dims[i], layer_dims[i + 1])) self.layers.append(SoftmaxLayer(layer_dims[-2], layer_dims[-1])) def cross_entropy_loss(self, y_true, y_pred): n_samples = y_true.shape[0] y_pred_clipped = np.clip(y_pred, 1e-12, 1 - 1e-12) return -np.sum(y_true * np.log(y_pred_clipped)) / n_samples def accuracy(self, y_true, y_pred): y_true_labels = np.argmax(y_true, axis=1) y_pred_labels = np.argmax(y_pred, axis=1) return np.mean(y_true_labels == y_pred_labels) def train(self, X, y, epochs): loss_history = [] for epoch in range(epochs): A = X # Forward propagation cache = [A] for layer in self.layers: A = layer.forward(A) cache.append(A) loss = self.cross_entropy_loss(y, A) loss_history.append(loss) # Backward propagation # 损失函数求导 dA = A - y for i in reversed(range(len(self.layers))): layer = self.layers[i] A_prev = cache[i] dA = layer.backward(dA, A_prev, layer.activation_derivative) # Update parameters for layer in self.layers: layer.update_parameters(self.learning_rate) if (epoch + 1) % 100 == 0: print(f'Epoch {epoch + 1}/{epochs}, Loss: {loss:.4f}') return loss_history def predict(self, X): A = X for layer in self.layers: A = layer.forward(A) return A # 导入数据 iris = load_iris() X = iris.data y = iris.target.reshape(-1, 1) # One hot encoding encoder = OneHotEncoder(sparse_output=False) y = encoder.fit_transform(y) # 分割数据 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 定义并训练神经网络 layer_dims = [X_train.shape[1], 100, 20, y_train.shape[1]] # Example with 2 hidden layers learning_rate = 0.01 epochs = 5000 nn = NeuralNetwork(layer_dims, learning_rate) loss_history = nn.train(X_train, y_train, epochs) # 预测和评估 train_predictions = nn.predict(X_train) test_predictions = nn.predict(X_test) train_acc = nn.accuracy(y_train, train_predictions) test_acc = nn.accuracy(y_test, test_predictions) print(f'Training Accuracy: {train_acc:.4f}') print(f'Test Accuracy: {test_acc:.4f}') # 绘制损失曲线 plt.plot(loss_history) plt.xlabel('Epochs') plt.ylabel('Loss') plt.title('Loss Curve') plt.show()
Epoch 100/1000, Loss: 1.0983
Epoch 200/1000, Loss: 1.0980
Epoch 300/1000, Loss: 1.0975
Epoch 400/1000, Loss: 1.0960
Epoch 500/1000, Loss: 1.0891
Epoch 600/1000, Loss: 1.0119
Epoch 700/1000, Loss: 0.6284
Epoch 800/1000, Loss: 0.3711
Epoch 900/1000, Loss: 0.2117
Epoch 1000/1000, Loss: 0.1290
Training Accuracy: 0.9833
Test Accuracy: 1.0000
可以看到经过1000轮迭代,最终的准确率到达100%。
import numpy as np from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler import matplotlib.pyplot as plt from sklearn.datasets import fetch_california_housing class Layer: def __init__(self, input_dim, output_dim): self.W = np.random.randn(input_dim, output_dim) * 0.01 self.b = np.zeros((1, output_dim)) def forward(self, X): self.Z = np.dot(X, self.W) + self.b self.A = self.activation(self.Z) return self.A def backward(self, dA, X, activation_derivative): m = X.shape[0] self.dZ = dA * activation_derivative(self.Z) self.dW = np.dot(X.T, self.dZ) / m self.db = np.sum(self.dZ, axis=0, keepdims=True) / m dA_prev = np.dot(self.dZ, self.W.T) return dA_prev def update_parameters(self, learning_rate): self.W -= learning_rate * self.dW self.b -= learning_rate * self.db class ReLULayer(Layer): def activation(self, Z): return np.maximum(0, Z) def activation_derivative(self, Z): return (Z > 0).astype(float) class LinearLayer(Layer): def activation(self, Z): return Z def activation_derivative(self, Z): return np.ones_like(Z) class NeuralNetwork: def __init__(self, layer_dims, learning_rate=0.01): self.layers = [] self.learning_rate = learning_rate for i in range(len(layer_dims) - 2): self.layers.append(ReLULayer(layer_dims[i], layer_dims[i + 1])) self.layers.append(LinearLayer(layer_dims[-2], layer_dims[-1])) def mean_squared_error(self, y_true, y_pred): return np.mean((y_true - y_pred) ** 2) def train(self, X, y, epochs): loss_history = [] for epoch in range(epochs): A = X # Forward propagation cache = [A] for layer in self.layers: A = layer.forward(A) cache.append(A) loss = self.mean_squared_error(y, A) loss_history.append(loss) # Backward propagation # 损失函数求导 dA = -(y - A) for i in reversed(range(len(self.layers))): layer = self.layers[i] A_prev = cache[i] dA = layer.backward(dA, A_prev, layer.activation_derivative) # Update parameters for layer in self.layers: layer.update_parameters(self.learning_rate) if (epoch + 1) % 100 == 0: print(f'Epoch {epoch + 1}/{epochs}, Loss: {loss:.4f}') return loss_history def predict(self, X): A = X for layer in self.layers: A = layer.forward(A) return A housing = fetch_california_housing() # 导入数据 X = housing.data y = housing.target.reshape(-1, 1) # 标准化 scaler_X = StandardScaler() scaler_y = StandardScaler() X = scaler_X.fit_transform(X) y = scaler_y.fit_transform(y) # 分割数据 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 定义并训练神经网络 layer_dims = [X_train.shape[1], 50, 5, 1] # Example with 2 hidden layers learning_rate = 0.8 epochs = 1000 nn = NeuralNetwork(layer_dims, learning_rate) loss_history = nn.train(X_train, y_train, epochs) # 预测和评估 train_predictions = nn.predict(X_train) test_predictions = nn.predict(X_test) train_mse = nn.mean_squared_error(y_train, train_predictions) test_mse = nn.mean_squared_error(y_test, test_predictions) print(f'Training MSE: {train_mse:.4f}') print(f'Test MSE: {test_mse:.4f}') # 绘制损失曲线 plt.plot(loss_history) plt.xlabel('Epochs') plt.ylabel('Loss') plt.title('Loss Curve') plt.show()
Epoch 100/1000, Loss: 1.0038
Epoch 200/1000, Loss: 0.9943
Epoch 300/1000, Loss: 0.3497
Epoch 400/1000, Loss: 0.3306
Epoch 500/1000, Loss: 0.3326
Epoch 600/1000, Loss: 0.3206
Epoch 700/1000, Loss: 0.3125
Epoch 800/1000, Loss: 0.3057
Epoch 900/1000, Loss: 0.2999
Epoch 1000/1000, Loss: 0.2958
Training MSE: 0.2992
Test MSE: 0.3071
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。