当前位置:   article > 正文

Transformer 编码器层的前向传播过程_定义前向传播过程 transformer

定义前向传播过程 transformer

该函数描述了 Transformer 编码器层的前向传播过程,包括自注意力(self-attention)和前馈神经网络(feedforward network)的处理步骤,以及层归一化的应用。

定义了如何将输入数据通过编码器层进行前向传播

这个 forward 函数执行以下操作:

  1. 将输入 src(输入序列)保存在变量 x 中,这是为了在后面的计算中使用。

  2. 根据参数 norm_first 的值,选择执行不同的顺序。如果 norm_firstTrue,则首先应用层归一化 (self.norm1),然后将结果传递给自注意力块 (self._sa_block),最后再应用第二次层归一化 (self.norm2) 和前馈块 (self._ff_block)。

  3. 如果 norm_firstFalse,则首先执行自注意力块 (self._sa_block),然后应用第一次层归一化 (self.norm1),接着执行前馈块 (self._ff_block),最后再应用第二次层归一化 (self.norm2)。

  1. class TransformerEncoderLayer(Module):
  2. r"""TransformerEncoderLayer is made up of self-attn and feedforward network.
  3. This standard encoder layer is based on the paper "Attention Is All You Need".
  4. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
  5. Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in
  6. Neural Information Processing Systems, pages 6000-6010. Users may modify or implement
  7. in a different way during application.
  8. Args:
  9. d_model: the number of expected features in the input (required).
  10. nhead: the number of heads in the multiheadattention models (required).
  11. dim_feedforward: the dimension of the feedforward network model (default=2048).
  12. dropout: the dropout value (default=0.1).
  13. activation: the activation function of the intermediate layer, can be a string
  14. ("relu" or "gelu") or a unary callable. Default: relu
  15. layer_norm_eps: the eps value in layer normalization components (default=1e-5).
  16. batch_first: If ``True``, then the input and output tensors are provided
  17. as (batch, seq, feature). Default: ``False``.
  18. norm_first: if ``True``, layer norm is done prior to attention and feedforward
  19. operations, respectivaly. Otherwise it's done after. Default: ``False`` (after).
  20. Examples::
  21. >>> encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8)
  22. >>> src = torch.rand(10, 32, 512)
  23. >>> out = encoder_layer(src)
  24. Alternatively, when ``batch_first`` is ``True``:
  25. >>> encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8, batch_first=True)
  26. >>> src = torch.rand(32, 10, 512)
  27. >>> out = encoder_layer(src)
  28. """
  29. __constants__ = ['batch_first', 'norm_first']
  30. def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1, activation=F.relu,
  31. layer_norm_eps=1e-5, batch_first=False, norm_first=False,
  32. device=None, dtype=None) -> None:
  33. factory_kwargs = {'device': device, 'dtype': dtype}
  34. super(TransformerEncoderLayer, self).__init__()
  35. self.self_attn = MultiheadAttention(d_model, nhead, dropout=dropout, batch_first=batch_first,
  36. **factory_kwargs)
  37. # Implementation of Feedforward model
  38. self.linear1 = Linear(d_model, dim_feedforward, **factory_kwargs)
  39. self.dropout = Dropout(dropout)
  40. self.linear2 = Linear(dim_feedforward, d_model, **factory_kwargs)
  41. self.norm_first = norm_first
  42. self.norm1 = LayerNorm(d_model, eps=layer_norm_eps, **factory_kwargs)
  43. self.norm2 = LayerNorm(d_model, eps=layer_norm_eps, **factory_kwargs)
  44. self.dropout1 = Dropout(dropout)
  45. self.dropout2 = Dropout(dropout)
  46. # Legacy string support for activation function.
  47. if isinstance(activation, str):
  48. self.activation = _get_activation_fn(activation)
  49. else:
  50. self.activation = activation
  51. def __setstate__(self, state):
  52. if 'activation' not in state:
  53. state['activation'] = F.relu
  54. super(TransformerEncoderLayer, self).__setstate__(state)
  55. def forward(self, src: Tensor, src_mask: Optional[Tensor] = None, src_key_padding_mask: Optional[Tensor] = None) -> Tensor:
  56. r"""Pass the input through the encoder layer.
  57. Args:
  58. src: the sequence to the encoder layer (required).
  59. src_mask: the mask for the src sequence (optional).
  60. src_key_padding_mask: the mask for the src keys per batch (optional).
  61. Shape:
  62. see the docs in Transformer class.
  63. """
  64. # see Fig. 1 of https://arxiv.org/pdf/2002.04745v1.pdf
  65. x = src
  66. if self.norm_first:
  67. x = x + self._sa_block(self.norm1(x), src_mask, src_key_padding_mask)
  68. x = x + self._ff_block(self.norm2(x))
  69. else:
  70. x = self.norm1(x + self._sa_block(x, src_mask, src_key_padding_mask))
  71. x = self.norm2(x + self._ff_block(x))
  72. return x
  73. # self-attention block
  74. def _sa_block(self, x: Tensor,
  75. attn_mask: Optional[Tensor], key_padding_mask: Optional[Tensor]) -> Tensor:
  76. x = self.self_attn(x, x, x,
  77. attn_mask=attn_mask,
  78. key_padding_mask=key_padding_mask,
  79. need_weights=False)[0]
  80. return self.dropout1(x)
  81. # feed forward block
  82. def _ff_block(self, x: Tensor) -> Tensor:
  83. x = self.linear2(self.dropout(self.activation(self.linear1(x))))
  84. return self.dropout2(x)

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/很楠不爱3/article/detail/345716
推荐阅读
相关标签
  

闽ICP备14008679号