赞
踩
深层神经网络的搭建中,我们提到关于超参数权值的初始化至关重要。今天我们就来谈谈其重要性以及如何选择恰当的数值来初始化这一参数。
1. 权值初始化的意义
一个好的权值初始值,有以下优点:
- 加快梯度下降的收敛速度
- 增加梯度下降到最小训练误差的几率
2. 编写代码
为了理解上面提及的意义,下面通过比较来进行进一步地解释。
2.1 数据准备
import numpy as np import matplotlib.pyplot as plt import sklearn import sklearn.datasets from init_utils import sigmoid, relu, compute_loss, forward_propagation, backward_propagation from init_utils import update_parameters, predict, load_dataset, plot_decision_boundary, predict_dec %matplotlib inline plt.rcParams[ ‘figure.figsize’] = ( 7.0, 4.0) # set default size of plots plt.rcParams[ ‘image.interpolation’] = 'nearest’ plt.rcParams[ ‘image.cmap’] = ‘gray’ # load image dataset: blue/red dots in circles train_X, train_Y, test_X, test_Y = load_dataset()
- 1
执行结果如下:
2.2 编写相应初始化权值的方法
全初始化为0:
# GRADED FUNCTION: initialize_parameters_zeros def initialize_parameters_zeros(layers_dims): “”" Arguments: layer_dims – python array (list) containing the size of each layer. Returns: parameters – python dictionary containing your parameters “W1”, “b1”, …, “WL”, “bL”: W1 – weight matrix of shape (layers_dims[1], layers_dims[0]) b1 – bias vector of shape (layers_dims[1], 1) … WL – weight matrix of shape (layers_dims[L], layers_dims[L-1]) bL – bias vector of shape (layers_dims[L], 1) “”" parameters = {} L = len(layers_dims) # number of layers in the network for l in range( 1, L): ### START CODE HERE ### (≈ 2 lines of code) parameters[ ‘W’ + str(l)] = np.zeros((layers_dims[ 1], layers_dims[ 0])) if l == 1 else np.zeros((layers_dims[l], layers_dims[l -1])) parameters[ ‘b’ + str(l)] = np.zeros((layers_dims[l], 1)) ### END CODE HERE ### return parameters
- 1
全初始化为比较大的值:
# GRADED FUNCTION: initialize_parameters_random def initialize_parameters_random(layers_dims): “”" Arguments: layer_dims – python array (list) containing the size of each layer. Returns: parameters – python dictionary containing your parameters “W1”, “b1”, …, “WL”, “bL”: W1 – weight matrix of shape (layers_dims[1], layers_dims[0]) b1 – bias vector of shape (layers_dims[1], 1) … WL – weight matrix of shape (layers_dims[L], layers_dims[L-1]) bL – bias vector of shape (layers_dims[L], 1) “”" np.random.seed( 3) # This seed makes sure your “random” numbers will be the as ours parameters = {} L = len(layers_dims) # integer representing the number of layers for l in range( 1, L): ### START CODE HERE ### (≈ 2 lines of code) parameters[ ‘W’ + str(l)] = np.random.randn(layers_dims[ 1], layers_dims[ 0]) * 10 if l == 1 else </div> np.random.randn(layers_dims[l], layers_dims[l -1]) * 10 parameters[ ‘b’ + str(l)] = np.zeros((layers_dims[l], 1)) ### END CODE HERE ### return parameters
- 1
全初始化为比较小的值:
# GRADED FUNCTION: initialize_parameters_he def initialize_parameters_he(layers_dims): “”" Arguments: layer_dims – python array (list) containing the size of each layer. Returns: parameters – python dictionary containing your parameters “W1”, “b1”, …, “WL”, “bL”: W1 – weight matrix of shape (layers_dims[1], layers_dims[0]) b1 – bias vector of shape (layers_dims[1], 1) … WL – weight matrix of shape (layers_dims[L], layers_dims[L-1]) bL – bias vector of shape (layers_dims[L], 1) “”" np.random.seed( 3) parameters = {} L = len(layers_dims) - 1 # integer representing the number of layers for l in range( 1, L + 1): ### START CODE HERE ### (≈ 2 lines of code) parameters[ ‘W’ + str(l)] = np.random.randn(layers_dims[ 1], layers_dims[ 0]) * np.sqrt( 2./layers_dims[ 0]) if l == 1 </div> else np.random.randn(layers_dims[l], layers_dims[l -1]) * np.sqrt( 2./layers_dims[l -1]) parameters[ ‘b’ + str(l)] = np.zeros((layers_dims[l], 1)) ### END CODE HERE ### return parameters
- 1
2.3 编写深层神经网络模型
def model(X, Y, learning_rate = 0.01, num_iterations = 15000, print_cost = True, initialization = “he”): “”" Implements a three-layer neural network: LINEAR->RELU->LINEAR->RELU->LINEAR->SIGMOID. Arguments: X – input data, of shape (2, number of examples) Y – true “label” vector (containing 0 for red dots; 1 for blue dots), of shape (1, number of examples) learning_rate – learning rate for gradient descent num_iterations – number of iterations to run gradient descent print_cost – if True, print the cost every 1000 iterations initialization – flag to choose which initialization to use (“zeros”,“random” or “he”) Returns: parameters – parameters learnt by the model “”" grads = {} costs = [] # to keep track of the loss m = X.shape[ 1] # number of examples layers_dims = [X.shape[ 0], 10, 5, 1] # Initialize parameters dictionary. if initialization == “zeros”: parameters = initialize_parameters_zeros(layers_dims) elif initialization == “random”: parameters = initialize_parameters_random(layers_dims) elif initialization == “he”: parameters = initialize_parameters_he(layers_dims) # Loop (gradient descent) for i in range( 0, num_iterations): # Forward propagation: LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID. a3, cache = forward_propagation(X, parameters) # Loss cost = compute_loss(a3, Y) # Backward propagation. grads = backward_propagation(X, Y, cache) # Update parameters. parameters = update_parameters(parameters, grads, learning_rate) # Print the loss every 1000 iterations if print_cost and i % 1000 == 0: print( “Cost after iteration {}: {}”.format(i, cost)) costs.append(cost) # plot the loss plt.plot(costs) plt.ylabel( ‘cost’) plt.xlabel( ‘iterations (per hundreds)’) plt.title( “Learning rate =” + str(learning_rate)) plt.show() return parameters
- 1
3. 实验比较
3.1 方案一
parameters = model(train_X, train_Y, initialization = “zeros”) print ( “On the train set:”) predictions_train = predict(train_X, train_Y, parameters) print ( “On the test set:”) predictions_test = predict(test_X, test_Y, parameters)
- 1
执行结果:
如果把权值全部初始化为0,代价函数将不会减少,而且训练效果和预测效果都不好。这是因为权值设置全为0的网络是对称的,也就是说任一层的每个神经单元将学习相同的权值,最终学习的结果也是线性的,因此效果甚至还没有单个线性回归分类的效果有效。
3.2 方案二
parameters = model(train_X, train_Y, initialization = “random”) print ( “On the train set:”) predictions_train = predict(train_X, train_Y, parameters) print ( “On the test set:”) predictions_test = predict(test_X, test_Y, parameters)
- 1
执行结果:
开始迭代时,代价函数非常大。这是因为随机生成的权值向量较大,进一步通过激活函数sigmod使得某些样例的y_hat非常接近与0或1,最终表现出来的log0使得代价函数无穷大。此外,把权值初始化为较大的值会延缓优化的速度。权值过大或者过小会引起梯度爆炸或者梯度消失。
3.3 方案三
parameters = model(train_X, train_Y, initialization = “he”) print ( “On the train set:”) predictions_train = predict(train_X, train_Y, parameters) print ( “On the test set:”) predictions_test = predict(test_X, test_Y, parameters)
- 1
执行结果:
通过结果,不难发现用“He”来初始化训练效果非常棒!
4. 小结
- 不同的初始化会有不同的训练效果
- 随机初始化被用来打破对称性,使得每个神经单元可以学习不同的事情。
- 不要把初始值设置的太大
- 使用“He”初始化使用“ReLU”激活单元的网络训练效果最佳
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。