赞
踩
2020/01/15 添加附加题 BN1D与hswish的实现
2020/05/18 添加附加题 BN与track_running_stats
本文使用TensorRT Python API进行搭建,C++ API搭建方法异曲同工,使用的BN layer是PyTorch版本的,Caffe、TF应该相差不大。当前TRT版本为6.0.1.5,更低、更高版本应该都会支持,本TRT文档见链接。
瞅一眼PyTorch提供的BN层的定义,位于torch.nn.BatchNorm2d
,公式已经在注释中说明,或者直接看文档也行:
y = x − E [ x ] V a r [ x ] + ϵ ∗ γ + β y = \frac{x - \mathrm{E}[x]}{ \sqrt{\mathrm{Var}[x] + \epsilon}} * \gamma + \beta y=Var[x]+ϵ x−E[x]∗γ+β
简单地, E [ x ] E[x] E[x]是batch的均值, V a r [ x ] Var[x] Var[x]是batch的方差, ϵ \epsilon ϵ为了防止除0, γ \gamma γ对应batch学习得到的权重, β \beta β就是偏置。在PyTorch中相对应的,对于任意一个in层,它会有如下的结构:
weights = torch.load(your_model_dict_state_path)
in_gamma = weights['in.weight'].numpy() # in gamma
in_beta = weights['in.bias'].numpy() # in beta
in_mean = weights['in.running_mean'].numpy() # in mean
in_var = weights['in.running_var'].numpy() # in var sqrt
上面的weights
可以由torch.load()
得到,而in
就是你自己定义的BN层。
既然已经知道了BN的公式,那就按照公式实现就可以了。这里因为输入x
是卷积后的结果,一般是个4维矩阵,BN层中的乘法是对4维矩阵按通道数进行矩阵乘法,因此需要使用TRT API提供的IScaleLayer
。官方文档中提到,使用IElementWiseLayer
构建,这样做太复杂,不推荐。
IScaleLayer
的文档见链接,它提供
o
u
t
p
u
t
=
(
i
n
p
u
t
×
s
c
a
l
e
+
s
h
i
f
t
)
p
o
w
e
r
output=(input \times scale + shift)^{power}
output=(input×scale+shift)power操作,并且有三种模式,我们需要的就是trt.ScaleMode.CHANNEL
。代码如下:
import tensorrt as trt
weights = torch.load(your_model_dict_state_path)
in_gamma = weights['in.weight'].numpy() # in gamma
in_beta = weights['in.bias'].numpy() # in beta
in_mean = weights['in.running_mean'].numpy() # in mean
in_var = weights['in.running_var'].numpy() # in var sqrt
eps = 1e-05
in_var = np.sqrt(in_var + eps)
in_scale = in_gamma / in_var
in_shift = - in_mean / in_var * in_gamma + in_beta
in = network.add_scale(input=last_layer.get_output(0), mode=trt.ScaleMode.CHANNEL, shift=in_shift, scale=in_scale)
此处,power
未规定则默认为1
。
进一步,实际上卷积层和BN层在推理过程中是可以融合在一起的,简单来讲,卷积层的过程为:
z = w ∗ x + b z = w * x + b z=w∗x+b
这里的 z z z替换掉BN公式的 x x x就可以得到:
y = ( w V a r [ x ] + ϵ ∗ γ ) ∗ x + ( b − E [ x ] V a r [ x ] + ϵ ∗ γ + β ) y = (\frac{w}{ \sqrt{\mathrm{Var}[x] + \epsilon}}*\gamma) * x +(\frac{b - \mathrm{E}[x]}{ \sqrt{\mathrm{Var}[x] + \epsilon}} * \gamma + \beta) y=(Var[x]+ϵ w∗γ)∗x+(Var[x]+ϵ b−E[x]∗γ+β)
当然这里也是矩阵操作。 w V a r [ x ] + ϵ ∗ γ \frac{w}{ \sqrt{\mathrm{Var}[x] + \epsilon}}*\gamma Var[x]+ϵ w∗γ就是新的 w w w, b − E [ x ] V a r [ x ] + ϵ ∗ γ + β \frac{b - \mathrm{E}[x]}{ \sqrt{\mathrm{Var}[x] + \epsilon}} * \gamma + \beta Var[x]+ϵ b−E[x]∗γ+β就是新的 b b b了。
代码如下:
import tensorrt as trt weights = torch.load(your_model_dict_state_path) conv_w = weights['conv.weight'].numpy() # conv weight conv_b = weights['conv.bias'].numpy() # conv bias in_gamma = weights['in.weight'].numpy() # in gamma in_beta = weights['in.bias'].numpy() # in beta in_mean = weights['in.running_mean'].numpy() # in mean in_var = weights['in.running_var'].numpy() # in var sqrt eps = 1e-05 in_var = np.sqrt(in_var + eps) fused_conv_w = conv_w * (in_gamma / in_var).reshape([conv_w.shape[0], 1, 1, 1]) fused_conv_b = (conv_b - in_mean) / in_var * in_gamma + in_beta fused_conv = network.add_convolution(input=last_layer.get_output(0), num_output_maps=your_conv_out, kernel_shape=(your_conv_kernel, your_conv_kernel), kernel=fused_conv_w, bias=fused_conv_b) fused_conv.padding = (your_conv_pad, your_conv_pad) fused_conv.stride = (your_conv_stride, your_conv_stride)
其中,conv
是需要融合的卷积层,fused_conv
是与in
融合后的卷积层,你需要规定fused_conv
与conv
拥有相同的参数(padding, stride, kernel_shape, num_output_maps)。
[1] TensorRT API文档: https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/python_api/index.html
[2] TensorRT文档: https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html
[3] PyTorch文档: https://pytorch.org/docs/stable/
[4] Pytorch中的Batch Normalization操作: https://www.cnblogs.com/yongjieShi/p/9332655.html
同样地,参考BatchNorm2d的实现方法,这里需要添加一个tensorrt.IShuffleLayer
将1D的tensor转成2D,再在2D进行BN,最后转回1D,这里你需要规定输入tensor的大小,因为TRT在shuffle的时候需要知道该参数。大概的实现代码如下所示:
import tensorrt as trt weights = torch.load(your_model_dict_state_path) in_gamma = weights['in.weight'].numpy() # in gamma in_beta = weights['in.bias'].numpy() # in beta in_mean = weights['in.running_mean'].numpy() # in mean in_var = weights['in.running_var'].numpy() # in var sqrt eps = 1e-05 in_var = np.sqrt(in_var + eps) in_scale = in_gamma / in_var in_shift = - in_mean / in_var * in_gamma + in_beta # reshape to 2D shuffle = network.add_shuffle(last_layer.get_output(0)) shuffle.reshape_dims = (your_input_shape, your_input_shape, 1) # do in1d in = network.add_scale(input=shuffle.get_output(0), mode=trt.ScaleMode.CHANNEL, shift=in_shift, scale=in_scale) # reshape to 1D shuffle = network.add_shuffle(in.get_output(0)) shuffle.reshape_dims = (your_input_shape, your_input_shape, 1)
参考PyTorch的hswish
的实现:
class hswish(nn.Module):
def forward(self, x):
out = x * F.relu6(x + 3, inplace=True) / 6
return out
那么relu6
又是怎么实现的呢,参考relu6
的公式:
ReLU6 ( x ) = min ( max ( 0 , x ) , 6 ) \text{ReLU6}(x) = \min(\max(0,x), 6) ReLU6(x)=min(max(0,x),6)
因此我们可以得到如下TRT的实现代码:
import tensorrt as trt # x + 3 shape = (1, ) * len(your_input_shape) tensor = 3.0 * torch.ones(shape, dtype=trt.float32).cpu().numpy() trt_3 = network.add_constant(shape, tensor) tmp = network.add_elementwise(last_layer.get_output(0), trt_3.get_output(0), trt.ElementWiseOperation.SUM) # relu6(x + 3) relu = network.add_activation(input=tmp.get_output(0), type=trt.ActivationType.RELU) shape = (1, ) * len(your_input_shape) tensor = 6.0 * torch.ones(shape, dtype=trt.float32).cpu().numpy() trt_6 = network.add_constant(shape, tensor) relu_6 = network.add_elementwise(relu.get_output(0), trt_6.get_output(0), trt.ElementWiseOperation.MIN) # x * relu6(x + 3) tmp = network.add_elementwise(last_layer.get_output(0), tmp.get_output(0), trt.ElementWiseOperation.PROD) # x * relu6(x + 3) / 6 out = network.add_elementwise(tmp.get_output(0), trt_6.get_output(0), trt.ElementWiseOperation.DIV)
实际上我从来没思考过track_running_stats
的事,因为在训练的时候,用net.train
,在测试的时候,用net.eval
,最多再加上个with torch.no_grad():
,感谢@fyp_1995
、@下大禹了
提出问题,让我注意到原来pytorch在训练的时候还有这样的参数。
什么是track_running_stats
,知乎里面提到,这里我就摘一部分:
作者:李韶华
链接:https://www.zhihu.com/question/282672547/answer/529154567
来源:知乎
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。pytorch的batchnorm使用时需要小心,training和track_running_stats可以组合出三种behavior,很容易掉坑里(我刚发现我对track_running_stats的理解错了)。
- training=True, track_running_stats=True, 这是常用的training时期待的行为,running_mean 和running_var会跟踪不同batch数据的mean和variance,但是仍然是用每个batch的mean和variance做normalization。
- training=True, track_running_stats=False, 这时候running_mean 和running_var不跟踪跨batch数据的statistics了,但仍然用每个batch的mean和variance做normalization。
- training=False, track_running_stats=True, 这是我们期待的test时候的行为,即使用training阶段估计的running_mean 和running_var.
- training=False, track_running_stats=False,同2(!!!).
因此,对于@fyp_1995
同学的问题,我认为是不是搞错了,在训练的时候,当track_running_stats=True
的时候,会根据输入来修改var与mean,当track_running_stats=False
的时候,是没有var与mean。这个似乎跟weight
的修改没啥关系,而且这个weight
是BN层的
γ
\gamma
γ哦,是靠另一个参数affine=True
控制的。
而对于@下大禹了
同学的问题,我认为,实际上如果在测试模型时,model.eval()
,对应的应该是上面的第3点,也就是说,应该是track_running_stats=True
。num_batches_tracked
表示这个BN层在训练过程中,计算了/更新了多少次。
你们想,BN究竟有啥用?这个得看看提出Batch Normalization
的论文:我们一般在训练的时候,输入的是batch,每个batch是具有不同的分布,所以我们用BN把每个batch的差异给拉成一致的,也防止了训练过程中的梯度消失。我们在训练的时候,为了使用BN的特性,当然是用track_running_stats=True
的,为输入数据提供var与mean。而在测试的时候,是不是也得按训练的策略来?所以也得track_running_stats=True
,另外在测试的时候,var与mean是不会被修改。
通过查看源码我们可以发现,affine=True, track_running_stats=True
,这两个参数已经默认帮我们设好了,所以到目前为止,我都没出问题。
接着,当affine=False
,对应源码为
# affine – a boolean value that when set to true, gives the layer learnable affine parameters. Default: True
if self.affine:
self.weight = Parameter(torch.Tensor(num_features))
self.bias = Parameter(torch.Tensor(num_features))
else:
self.register_parameter('weight', None)
self.register_parameter('bias', None)
所以就是 γ \gamma γ和 β \beta β没了,等价于 γ = 1 \gamma = 1 γ=1与 β = 0 \beta = 0 β=0,所以怎么传播,BN层的权重与偏置都不会更新。
当track_running_stats=False
,对应源码为
# a boolean value that when set to ``True``, this module tracks the running mean and variance, and when set to ``False``, this module does not track such statistics and always uses batch statistics in both training and eval modes. Default: ``True``
if self.track_running_stats:
self.register_buffer('running_mean', torch.zeros(num_features))
self.register_buffer('running_var', torch.ones(num_features))
self.register_buffer('num_batches_tracked', torch.tensor(0, dtype=torch.long))
else:
self.register_parameter('running_mean', None)
self.register_parameter('running_var', None)
self.register_parameter('num_batches_tracked', None)
BN层将会失去var与mean,那这个BN层又有啥作用呢?
按照@fyp_1995
同学的说法,我希望在每次inference时,根据每次的输入修改var与mean,这个思想应该对应着Instance Normalization,具体就是风格迁移那一套了,我就不展开讲了,因为我也不熟。
实现方法就是把输入给放进网络里呗,这里实现的是torch.nn.InstanceNorm1d(track_running_stats=False)
,我想了下,代码大概如下,没有测试,不负责:
import tensorrt as trt weights = torch.load(your_model_dict_state_path) in_gamma = weights['in.weight'].numpy() # in gamma in_beta = weights['in.bias'].numpy() # in beta # in mean computed by yourself in_mean = network.add_reduce(input=your_conv_out, op=trt.ReduceOperation.AVG, axes=your_conv_out_dims, keep_dims=True).get_output(0) # in var sqrt computed by yourself in_delta = network.add_elementwise(input1=your_conv_out, input2=in_mean, op=trt.ElementWiseOperation.SUB).get_output(0) eps = 1e-05 in_var = network.add_scale(in_delta, trt.ScaleMode.UNIFORM, np.zeros_like(eps), np.ones_like(eps), 2 * np.ones_like(eps)).get_output(0) in_var = network.add_reduce(in_var, trt.ReduceOperation.AVG, axes=your_conv_out_dims, keep_dims=True).get_output(0) in_var = network.add_scale(in_var, trt.ScaleMode.UNIFORM, eps, np.ones_like(eps), 0.5 * np.ones_like(eps)).get_output(0) # do in1d in_out = network.add_elementwise(input1=in_delta, input2=in_var, op=trt.ElementWiseOperation.DIV).get_output(0) # reshape to 2D shuffle = network.add_shuffle(in_out.get_output(0)) shuffle.reshape_dims = (in_out.shape[0], in_out.shape[1], 1) # affine comput in_out = network.add_scale(input=in_out.get_output(0), mode=trt.ScaleMode.CHANNEL, shift=in_beta, scale=in_gamma, power=np.ones_like(in_beta)) # reshape to 1D shuffle = network.add_shuffle(in_out.get_output(0)) shuffle.reshape_dims = (in_out.shape[0], in_out.shape[1], 1)
以上,欢迎讨论,相互学习~
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。