赞
踩
如下图,为 Transformer 的整体结构图,左侧为 Transformer Encoder Block,右侧为 Transformer Decoder Block。
在整体使用中,两个 Block 均被多次重复使用,即上一 Block 的输出向量作为下一 Block 的输入向量。
由(一)所介绍,Transformer 是由 TransformerEncoder 和 TransformerDecoder 组成,而这两者又分别是由多个 TransformerEncoderLayers 和 TransformerDecoderLayers 组成(或理解为多个 Block 组成)
下图代码,建议对照上图内部结构去看,更容易理解一些。
1.1)TransformerEncoderLayer 代码:
- # Transformer Encoder Layer
- class TransformerEncoderLayer(Module):
- r"""TransformerEncoderLayer is made up of self-attn and feedforward network.
- This standard encoder layer is based on the paper "Attention Is All You Need".
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
- Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in
- Neural Information Processing Systems, pages 6000-6010. Users may modify or implement
- in a different way during application.
- Args:
- d_model: the number of expected features in the input (required).
- nhead: the number of heads in the multiheadattention models (required).
- dim_feedforward: the dimension of the feedforward network model (default=2048).
- dropout: the dropout value (default=0.1).
- activation: the activation function of intermediate layer, relu or gelu (default=relu).
- Examples::
- >>> encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8)
- >>> src = torch.rand(10, 32, 512)
- >>> out = encoder_layer(src)
- """
-
- def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1, activation="relu"):
- super(TransformerEncoderLayer, self).__init__()
- self.self_attn = MultiheadAttention(d_model, nhead, dropout=dropout)
- # Implementation of Feedforward model
- self.linear1 = Linear(d_model, dim_feedforward)
- self.dropout = Dropout(dropout)
- self.linear2 = Linear(dim_feedforward, d_model)
-
- self.norm1 = LayerNorm(d_model)
- self.norm2 = LayerNorm(d_model)
- self.dropout1 = Dropout(dropout)
- self.dropout2 = Dropout(dropout)
-
- self.activation = _get_activation_fn(activation)
-
- def __setstate__(self, state):
- if 'activation' not in state:
- state['activation'] = F.relu
- super(TransformerEncoderLayer, self).__setstate__(state)
-
- def forward(self, src: Tensor, src_mask: Optional[Tensor] = None, src_key_padding_mask: Optional[Tensor] = None) -> Tensor:
- r"""Pass the input through the encoder layer.
- Args:
- src: the sequence to the encoder layer (required).
- src_mask: the mask for the src sequence (optional).
- src_key_padding_mask: the mask for the src keys per batch (optional).
- Shape:
- see the docs in Transformer class.
- """
- # look the picture of transformer encoder
- # Norm(src+Dropout(self_attention(src)))
- src2 = self.self_attn(src, src, src, attn_mask=src_mask,
- key_padding_mask=src_key_padding_mask)[0]
- src = src + self.dropout1(src2)
- src = self.norm1(src)
-
- # Norm(src+Dropout(Feedforward()))
- src2 = self.linear2(self.dropout(self.activation(self.linear1(src))))
- src = src + self.dropout2(src2)
- src = self.norm2(src)
- return src
1.2)TransformerEncoder 代码(多次执行 TransformerEncoderLayer 里的内容):
- # A stack of N encoder layers
- class TransformerEncoder(Module):
- r"""TransformerEncoder is a stack of N encoder layers
- Args:
- encoder_layer: an instance of the TransformerEncoderLayer() class (required).
- num_layers: the number of sub-encoder-layers in the encoder (required).
- norm: the layer normalization component (optional).
- Examples::
- >>> encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8)
- >>> transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=6)
- >>> src = torch.rand(10, 32, 512)
- >>> out = transformer_encoder(src)
- """
- __constants__ = ['norm']
-
- def __init__(self, encoder_layer, num_layers, norm=None):
- super(TransformerEncoder, self).__init__()
- self.layers = _get_clones(encoder_layer, num_layers)
- self.num_layers = num_layers
- self.norm = norm
-
- def forward(self, src: Tensor, mask: Optional[Tensor] = None, src_key_padding_mask: Optional[Tensor] = None) -> Tensor:
- r"""Pass the input through the encoder layers in turn.
- Args:
- src: the sequence to the encoder (required).
- mask: the mask for the src sequence (optional).
- src_key_padding_mask: the mask for the src keys per batch (optional).
- Shape:
- see the docs in Transformer class.
- """
- output = src
-
- for mod in self.layers:
- output = mod(output, src_mask=mask, src_key_padding_mask=src_key_padding_mask)
-
- if self.norm is not None:
- output = self.norm(output)
-
- return output
2.1)TransformerDecoderLayer 代码:
- # For language reconstruct
- class TransformerDecoderLayer(Module):
- r"""TransformerDecoderLayer is made up of self-attn, multi-head-attn and feedforward network.
- This standard decoder layer is based on the paper "Attention Is All You Need".
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
- Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in
- Neural Information Processing Systems, pages 6000-6010. Users may modify or implement
- in a different way during application.
- Args:
- d_model: the number of expected features in the input (required).
- nhead: the number of heads in the multiheadattention models (required).
- dim_feedforward: the dimension of the feedforward network model (default=2048).
- dropout: the dropout value (default=0.1).
- activation: the activation function of intermediate layer, relu or gelu (default=relu).
- Examples::
- >>> decoder_layer = nn.TransformerDecoderLayer(d_model=512, nhead=8)
- >>> memory = torch.rand(10, 32, 512)
- >>> tgt = torch.rand(20, 32, 512)
- >>> out = decoder_layer(tgt, memory)
- """
- # d_model = 768, nhead = 8
- def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1, activation="relu"):
- super(TransformerDecoderLayer, self).__init__()
- self.self_attn = MultiheadAttention(d_model, nhead, dropout=dropout)
- self.multihead_attn = MultiheadAttention(d_model, nhead, dropout=dropout)
- # Implementation of Feedforward model
- self.linear1 = Linear(d_model, dim_feedforward)
- self.dropout = Dropout(dropout)
- self.linear2 = Linear(dim_feedforward, d_model)
-
- self.norm1 = LayerNorm(d_model)
- self.norm2 = LayerNorm(d_model)
- self.norm3 = LayerNorm(d_model)
- self.dropout1 = Dropout(dropout)
- self.dropout2 = Dropout(dropout)
- self.dropout3 = Dropout(dropout)
-
- self.activation = _get_activation_fn(activation)
-
- def __setstate__(self, state):
- if 'activation' not in state:
- state['activation'] = F.relu
- super(TransformerDecoderLayer, self).__setstate__(state)
-
- # tgt: the sequence to the decoder layer (required). (20,1,768)
- # memory: the sequence from the last layer of the encoder (required). (3600,1,768)
- def forward(self, tgt: Tensor, memory: Tensor, tgt_mask: Optional[Tensor] = None, memory_mask: Optional[Tensor] = None,
- tgt_key_padding_mask: Optional[Tensor] = None, memory_key_padding_mask: Optional[Tensor] = None) -> Tensor:
- r"""Pass the inputs (and mask) through the decoder layer.
- Args:
- tgt: the sequence to the decoder layer (required).
- memory: the sequence from the last layer of the encoder (required).
- tgt_mask: the mask for the tgt sequence (optional).
- memory_mask: the mask for the memory sequence (optional).
- tgt_key_padding_mask: the mask for the tgt keys per batch (optional).
- memory_key_padding_mask: the mask for the memory keys per batch (optional).
- Shape:
- see the docs in Transformer class.
- """
-
- # 类比 Transformer Decoder 的结构
- # tgt = Norm(Dropout(attention(tgt,tgt,tgt))+tgt)
- tgt2 = self.self_attn(tgt, tgt, tgt, attn_mask=tgt_mask, # Multihead self-attention tgt2 (20,1,768)
- key_padding_mask=tgt_key_padding_mask)[0]
- tgt = tgt + self.dropout1(tgt2) # tgt = tgt + dropout1(0.1,tgt2) (20,1,768)
- tgt = self.norm1(tgt) # LayerNorm (1,768,60,60)
-
- # tgt = Norm(Dropout(attention(tgt,memory,memory))+tgt)
- tgt2 = self.multihead_attn(tgt, memory, memory, attn_mask=memory_mask, # Multihead self-attention tgt2 (20,1,768)
- key_padding_mask=memory_key_padding_mask)[0]
- tgt = tgt + self.dropout2(tgt2) # (20,1,768)
- tgt = self.norm2(tgt) # (20,1,768)
- tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt)))) # tgt2 (20,1,768)
- tgt = tgt + self.dropout3(tgt2)
- tgt = self.norm3(tgt)
- return tgt # (20,1,768)
2.2)TransformerDecoder 代码(多次执行 TransformerDecoderLayer 里的内容):
- # A stack of N decoder layers
- class TransformerDecoder(Module):
- r"""TransformerDecoder is a stack of N decoder layers
- Args:
- decoder_layer: an instance of the TransformerDecoderLayer() class (required).
- num_layers: the number of sub-decoder-layers in the decoder (required).
- norm: the layer normalization component (optional).
- Examples::
- >>> decoder_layer = nn.TransformerDecoderLayer(d_model=512, nhead=8)
- >>> transformer_decoder = nn.TransformerDecoder(decoder_layer, num_layers=6)
- >>> memory = torch.rand(10, 32, 512)
- >>> tgt = torch.rand(20, 32, 512)
- >>> out = transformer_decoder(tgt, memory)
- """
- __constants__ = ['norm']
-
- def __init__(self, decoder_layer, num_layers, norm=None):
- super(TransformerDecoder, self).__init__()
- self.layers = _get_clones(decoder_layer, num_layers)
- self.num_layers = num_layers
- self.norm = norm
-
- def forward(self, tgt: Tensor, memory: Tensor, tgt_mask: Optional[Tensor] = None,
- memory_mask: Optional[Tensor] = None, tgt_key_padding_mask: Optional[Tensor] = None,
- memory_key_padding_mask: Optional[Tensor] = None) -> Tensor:
- r"""Pass the inputs (and mask) through the decoder layer in turn.
- Args:
- tgt: the sequence to the decoder (required).
- memory: the sequence from the last layer of the encoder (required).
- tgt_mask: the mask for the tgt sequence (optional).
- memory_mask: the mask for the memory sequence (optional).
- tgt_key_padding_mask: the mask for the tgt keys per batch (optional).
- memory_key_padding_mask: the mask for the memory keys per batch (optional).
- Shape:
- see the docs in Transformer class.
- """
- output = tgt
-
- for mod in self.layers:
- output = mod(output, memory, tgt_mask=tgt_mask,
- memory_mask=memory_mask,
- tgt_key_padding_mask=tgt_key_padding_mask,
- memory_key_padding_mask=memory_key_padding_mask)
-
- if self.norm is not None:
- output = self.norm(output)
-
- return output
2.3)Transformer 代码:
a、先初始化 TransformerEncoder 和 TransformerDecoder
b、在 forward( ) 中分别调用他们
- # High Architecture of Transformer encoder and Transformer decoder
- class Transformer(Module):
- r"""A transformer model. User is able to modify the attributes as needed. The architecture
- is based on the paper "Attention Is All You Need". Ashish Vaswani, Noam Shazeer,
- Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and
- Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information
- Processing Systems, pages 6000-6010. Users can build the BERT(https://arxiv.org/abs/1810.04805)
- model with corresponding parameters.
- Args:
- d_model: the number of expected features in the encoder/decoder inputs (default=512).
- nhead: the number of heads in the multiheadattention models (default=8).
- num_encoder_layers: the number of sub-encoder-layers in the encoder (default=6).
- num_decoder_layers: the number of sub-decoder-layers in the decoder (default=6).
- dim_feedforward: the dimension of the feedforward network model (default=2048).
- dropout: the dropout value (default=0.1).
- activation: the activation function of encoder/decoder intermediate layer, relu or gelu (default=relu).
- custom_encoder: custom encoder (default=None).
- custom_decoder: custom decoder (default=None).
- Examples::
- >>> transformer_model = nn.Transformer(nhead=16, num_encoder_layers=12)
- >>> src = torch.rand((10, 32, 512))
- >>> tgt = torch.rand((20, 32, 512))
- >>> out = transformer_model(src, tgt)
- Note: A full example to apply nn.Transformer module for the word language model is available in
- https://github.com/pytorch/examples/tree/master/word_language_model
- """
-
- def __init__(self, d_model: int = 512, nhead: int = 8, num_encoder_layers: int = 6,
- num_decoder_layers: int = 6, dim_feedforward: int = 2048, dropout: float = 0.1,
- activation: str = "relu", custom_encoder: Optional[Any] = None, custom_decoder: Optional[Any] = None) -> None:
- super(Transformer, self).__init__()
-
- if custom_encoder is not None:
- self.encoder = custom_encoder
- else:
- encoder_layer = TransformerEncoderLayer(d_model, nhead, dim_feedforward, dropout, activation)
- encoder_norm = LayerNorm(d_model)
- self.encoder = TransformerEncoder(encoder_layer, num_encoder_layers, encoder_norm)
-
- if custom_decoder is not None:
- self.decoder = custom_decoder
- else:
- decoder_layer = TransformerDecoderLayer(d_model, nhead, dim_feedforward, dropout, activation)
- decoder_norm = LayerNorm(d_model)
- self.decoder = TransformerDecoder(decoder_layer, num_decoder_layers, decoder_norm)
-
- self._reset_parameters()
-
- self.d_model = d_model
- self.nhead = nhead
-
- def forward(self, src: Tensor, tgt: Tensor, src_mask: Optional[Tensor] = None, tgt_mask: Optional[Tensor] = None,
- memory_mask: Optional[Tensor] = None, src_key_padding_mask: Optional[Tensor] = None,
- tgt_key_padding_mask: Optional[Tensor] = None, memory_key_padding_mask: Optional[Tensor] = None) -> Tensor:
- r"""Take in and process masked source/target sequences.
- Args:
- src: the sequence to the encoder (required).
- tgt: the sequence to the decoder (required).
- src_mask: the additive mask for the src sequence (optional).
- tgt_mask: the additive mask for the tgt sequence (optional).
- memory_mask: the additive mask for the encoder output (optional).
- src_key_padding_mask: the ByteTensor mask for src keys per batch (optional).
- tgt_key_padding_mask: the ByteTensor mask for tgt keys per batch (optional).
- memory_key_padding_mask: the ByteTensor mask for memory keys per batch (optional).
- Shape:
- - src: :math:`(S, N, E)`.
- - tgt: :math:`(T, N, E)`.
- - src_mask: :math:`(S, S)`.
- - tgt_mask: :math:`(T, T)`.
- - memory_mask: :math:`(T, S)`.
- - src_key_padding_mask: :math:`(N, S)`.
- - tgt_key_padding_mask: :math:`(N, T)`.
- - memory_key_padding_mask: :math:`(N, S)`.
- Note: [src/tgt/memory]_mask ensures that position i is allowed to attend the unmasked
- positions. If a ByteTensor is provided, the non-zero positions are not allowed to attend
- while the zero positions will be unchanged. If a BoolTensor is provided, positions with ``True``
- are not allowed to attend while ``False`` values will be unchanged. If a FloatTensor
- is provided, it will be added to the attention weight.
- [src/tgt/memory]_key_padding_mask provides specified elements in the key to be ignored by
- the attention. If a ByteTensor is provided, the non-zero positions will be ignored while the zero
- positions will be unchanged. If a BoolTensor is provided, the positions with the
- value of ``True`` will be ignored while the position with the value of ``False`` will be unchanged.
- - output: :math:`(T, N, E)`.
- Note: Due to the multi-head attention architecture in the transformer model,
- the output sequence length of a transformer is same as the input sequence
- (i.e. target) length of the decode.
- where S is the source sequence length, T is the target sequence length, N is the
- batch size, E is the feature number
- Examples:
- >>> output = transformer_model(src, tgt, src_mask=src_mask, tgt_mask=tgt_mask)
- """
-
- if src.size(1) != tgt.size(1):
- raise RuntimeError("the batch number of src and tgt must be equal")
-
- if src.size(2) != self.d_model or tgt.size(2) != self.d_model:
- raise RuntimeError("the feature number of src and tgt must be equal to d_model")
-
- memory = self.encoder(src, mask=src_mask, src_key_padding_mask=src_key_padding_mask)
- output = self.decoder(tgt, memory, tgt_mask=tgt_mask, memory_mask=memory_mask,
- tgt_key_padding_mask=tgt_key_padding_mask,
- memory_key_padding_mask=memory_key_padding_mask)
- return output
-
- def generate_square_subsequent_mask(self, sz: int) -> Tensor:
- r"""Generate a square mask for the sequence. The masked positions are filled with float('-inf').
- Unmasked positions are filled with float(0.0).
- """
- mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
- mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
- return mask
-
- def _reset_parameters(self):
- r"""Initiate parameters in the transformer model."""
-
- for p in self.parameters():
- if p.dim() > 1:
- xavier_uniform_(p)
下面是完整的 transformer.py 文件:
- import copy
- from typing import Optional, Any
-
- import torch
- from torch import Tensor
- from .. import functional as F
- from .module import Module
- from .activation import MultiheadAttention
- from .container import ModuleList
- from ..init import xavier_uniform_
- from .dropout import Dropout
- from .linear import Linear
- from .normalization import LayerNorm
-
- # High Architecture of Transformer encoder and Transformer decoder
- class Transformer(Module):
- r"""A transformer model. User is able to modify the attributes as needed. The architecture
- is based on the paper "Attention Is All You Need". Ashish Vaswani, Noam Shazeer,
- Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and
- Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information
- Processing Systems, pages 6000-6010. Users can build the BERT(https://arxiv.org/abs/1810.04805)
- model with corresponding parameters.
- Args:
- d_model: the number of expected features in the encoder/decoder inputs (default=512).
- nhead: the number of heads in the multiheadattention models (default=8).
- num_encoder_layers: the number of sub-encoder-layers in the encoder (default=6).
- num_decoder_layers: the number of sub-decoder-layers in the decoder (default=6).
- dim_feedforward: the dimension of the feedforward network model (default=2048).
- dropout: the dropout value (default=0.1).
- activation: the activation function of encoder/decoder intermediate layer, relu or gelu (default=relu).
- custom_encoder: custom encoder (default=None).
- custom_decoder: custom decoder (default=None).
- Examples::
- >>> transformer_model = nn.Transformer(nhead=16, num_encoder_layers=12)
- >>> src = torch.rand((10, 32, 512))
- >>> tgt = torch.rand((20, 32, 512))
- >>> out = transformer_model(src, tgt)
- Note: A full example to apply nn.Transformer module for the word language model is available in
- https://github.com/pytorch/examples/tree/master/word_language_model
- """
-
- def __init__(self, d_model: int = 512, nhead: int = 8, num_encoder_layers: int = 6,
- num_decoder_layers: int = 6, dim_feedforward: int = 2048, dropout: float = 0.1,
- activation: str = "relu", custom_encoder: Optional[Any] = None, custom_decoder: Optional[Any] = None) -> None:
- super(Transformer, self).__init__()
-
- if custom_encoder is not None:
- self.encoder = custom_encoder
- else:
- encoder_layer = TransformerEncoderLayer(d_model, nhead, dim_feedforward, dropout, activation)
- encoder_norm = LayerNorm(d_model)
- self.encoder = TransformerEncoder(encoder_layer, num_encoder_layers, encoder_norm)
-
- if custom_decoder is not None:
- self.decoder = custom_decoder
- else:
- decoder_layer = TransformerDecoderLayer(d_model, nhead, dim_feedforward, dropout, activation)
- decoder_norm = LayerNorm(d_model)
- self.decoder = TransformerDecoder(decoder_layer, num_decoder_layers, decoder_norm)
-
- self._reset_parameters()
-
- self.d_model = d_model
- self.nhead = nhead
-
- def forward(self, src: Tensor, tgt: Tensor, src_mask: Optional[Tensor] = None, tgt_mask: Optional[Tensor] = None,
- memory_mask: Optional[Tensor] = None, src_key_padding_mask: Optional[Tensor] = None,
- tgt_key_padding_mask: Optional[Tensor] = None, memory_key_padding_mask: Optional[Tensor] = None) -> Tensor:
- r"""Take in and process masked source/target sequences.
- Args:
- src: the sequence to the encoder (required).
- tgt: the sequence to the decoder (required).
- src_mask: the additive mask for the src sequence (optional).
- tgt_mask: the additive mask for the tgt sequence (optional).
- memory_mask: the additive mask for the encoder output (optional).
- src_key_padding_mask: the ByteTensor mask for src keys per batch (optional).
- tgt_key_padding_mask: the ByteTensor mask for tgt keys per batch (optional).
- memory_key_padding_mask: the ByteTensor mask for memory keys per batch (optional).
- Shape:
- - src: :math:`(S, N, E)`.
- - tgt: :math:`(T, N, E)`.
- - src_mask: :math:`(S, S)`.
- - tgt_mask: :math:`(T, T)`.
- - memory_mask: :math:`(T, S)`.
- - src_key_padding_mask: :math:`(N, S)`.
- - tgt_key_padding_mask: :math:`(N, T)`.
- - memory_key_padding_mask: :math:`(N, S)`.
- Note: [src/tgt/memory]_mask ensures that position i is allowed to attend the unmasked
- positions. If a ByteTensor is provided, the non-zero positions are not allowed to attend
- while the zero positions will be unchanged. If a BoolTensor is provided, positions with ``True``
- are not allowed to attend while ``False`` values will be unchanged. If a FloatTensor
- is provided, it will be added to the attention weight.
- [src/tgt/memory]_key_padding_mask provides specified elements in the key to be ignored by
- the attention. If a ByteTensor is provided, the non-zero positions will be ignored while the zero
- positions will be unchanged. If a BoolTensor is provided, the positions with the
- value of ``True`` will be ignored while the position with the value of ``False`` will be unchanged.
- - output: :math:`(T, N, E)`.
- Note: Due to the multi-head attention architecture in the transformer model,
- the output sequence length of a transformer is same as the input sequence
- (i.e. target) length of the decode.
- where S is the source sequence length, T is the target sequence length, N is the
- batch size, E is the feature number
- Examples:
- >>> output = transformer_model(src, tgt, src_mask=src_mask, tgt_mask=tgt_mask)
- """
-
- if src.size(1) != tgt.size(1):
- raise RuntimeError("the batch number of src and tgt must be equal")
-
- if src.size(2) != self.d_model or tgt.size(2) != self.d_model:
- raise RuntimeError("the feature number of src and tgt must be equal to d_model")
-
- memory = self.encoder(src, mask=src_mask, src_key_padding_mask=src_key_padding_mask)
- output = self.decoder(tgt, memory, tgt_mask=tgt_mask, memory_mask=memory_mask,
- tgt_key_padding_mask=tgt_key_padding_mask,
- memory_key_padding_mask=memory_key_padding_mask)
- return output
-
- def generate_square_subsequent_mask(self, sz: int) -> Tensor:
- r"""Generate a square mask for the sequence. The masked positions are filled with float('-inf').
- Unmasked positions are filled with float(0.0).
- """
- mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
- mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
- return mask
-
- def _reset_parameters(self):
- r"""Initiate parameters in the transformer model."""
-
- for p in self.parameters():
- if p.dim() > 1:
- xavier_uniform_(p)
-
- # A stack of N encoder layers
- class TransformerEncoder(Module):
- r"""TransformerEncoder is a stack of N encoder layers
- Args:
- encoder_layer: an instance of the TransformerEncoderLayer() class (required).
- num_layers: the number of sub-encoder-layers in the encoder (required).
- norm: the layer normalization component (optional).
- Examples::
- >>> encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8)
- >>> transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=6)
- >>> src = torch.rand(10, 32, 512)
- >>> out = transformer_encoder(src)
- """
- __constants__ = ['norm']
-
- def __init__(self, encoder_layer, num_layers, norm=None):
- super(TransformerEncoder, self).__init__()
- self.layers = _get_clones(encoder_layer, num_layers)
- self.num_layers = num_layers
- self.norm = norm
-
- def forward(self, src: Tensor, mask: Optional[Tensor] = None, src_key_padding_mask: Optional[Tensor] = None) -> Tensor:
- r"""Pass the input through the encoder layers in turn.
- Args:
- src: the sequence to the encoder (required).
- mask: the mask for the src sequence (optional).
- src_key_padding_mask: the mask for the src keys per batch (optional).
- Shape:
- see the docs in Transformer class.
- """
- output = src
-
- for mod in self.layers:
- output = mod(output, src_mask=mask, src_key_padding_mask=src_key_padding_mask)
-
- if self.norm is not None:
- output = self.norm(output)
-
- return output
-
- # A stack of N decoder layers
- class TransformerDecoder(Module):
- r"""TransformerDecoder is a stack of N decoder layers
- Args:
- decoder_layer: an instance of the TransformerDecoderLayer() class (required).
- num_layers: the number of sub-decoder-layers in the decoder (required).
- norm: the layer normalization component (optional).
- Examples::
- >>> decoder_layer = nn.TransformerDecoderLayer(d_model=512, nhead=8)
- >>> transformer_decoder = nn.TransformerDecoder(decoder_layer, num_layers=6)
- >>> memory = torch.rand(10, 32, 512)
- >>> tgt = torch.rand(20, 32, 512)
- >>> out = transformer_decoder(tgt, memory)
- """
- __constants__ = ['norm']
-
- def __init__(self, decoder_layer, num_layers, norm=None):
- super(TransformerDecoder, self).__init__()
- self.layers = _get_clones(decoder_layer, num_layers)
- self.num_layers = num_layers
- self.norm = norm
-
- def forward(self, tgt: Tensor, memory: Tensor, tgt_mask: Optional[Tensor] = None,
- memory_mask: Optional[Tensor] = None, tgt_key_padding_mask: Optional[Tensor] = None,
- memory_key_padding_mask: Optional[Tensor] = None) -> Tensor:
- r"""Pass the inputs (and mask) through the decoder layer in turn.
- Args:
- tgt: the sequence to the decoder (required).
- memory: the sequence from the last layer of the encoder (required).
- tgt_mask: the mask for the tgt sequence (optional).
- memory_mask: the mask for the memory sequence (optional).
- tgt_key_padding_mask: the mask for the tgt keys per batch (optional).
- memory_key_padding_mask: the mask for the memory keys per batch (optional).
- Shape:
- see the docs in Transformer class.
- """
- output = tgt
-
- for mod in self.layers:
- output = mod(output, memory, tgt_mask=tgt_mask,
- memory_mask=memory_mask,
- tgt_key_padding_mask=tgt_key_padding_mask,
- memory_key_padding_mask=memory_key_padding_mask)
-
- if self.norm is not None:
- output = self.norm(output)
-
- return output
-
- # Transformer Encoder Layer
- class TransformerEncoderLayer(Module):
- r"""TransformerEncoderLayer is made up of self-attn and feedforward network.
- This standard encoder layer is based on the paper "Attention Is All You Need".
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
- Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in
- Neural Information Processing Systems, pages 6000-6010. Users may modify or implement
- in a different way during application.
- Args:
- d_model: the number of expected features in the input (required).
- nhead: the number of heads in the multiheadattention models (required).
- dim_feedforward: the dimension of the feedforward network model (default=2048).
- dropout: the dropout value (default=0.1).
- activation: the activation function of intermediate layer, relu or gelu (default=relu).
- Examples::
- >>> encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8)
- >>> src = torch.rand(10, 32, 512)
- >>> out = encoder_layer(src)
- """
-
- def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1, activation="relu"):
- super(TransformerEncoderLayer, self).__init__()
- self.self_attn = MultiheadAttention(d_model, nhead, dropout=dropout)
- # Implementation of Feedforward model
- self.linear1 = Linear(d_model, dim_feedforward)
- self.dropout = Dropout(dropout)
- self.linear2 = Linear(dim_feedforward, d_model)
-
- self.norm1 = LayerNorm(d_model)
- self.norm2 = LayerNorm(d_model)
- self.dropout1 = Dropout(dropout)
- self.dropout2 = Dropout(dropout)
-
- self.activation = _get_activation_fn(activation)
-
- def __setstate__(self, state):
- if 'activation' not in state:
- state['activation'] = F.relu
- super(TransformerEncoderLayer, self).__setstate__(state)
-
- def forward(self, src: Tensor, src_mask: Optional[Tensor] = None, src_key_padding_mask: Optional[Tensor] = None) -> Tensor:
- r"""Pass the input through the encoder layer.
- Args:
- src: the sequence to the encoder layer (required).
- src_mask: the mask for the src sequence (optional).
- src_key_padding_mask: the mask for the src keys per batch (optional).
- Shape:
- see the docs in Transformer class.
- """
- # look the picture of transformer encoder
- # Norm(src+Dropout(self_attention(src)))
- src2 = self.self_attn(src, src, src, attn_mask=src_mask,
- key_padding_mask=src_key_padding_mask)[0]
- src = src + self.dropout1(src2)
- src = self.norm1(src)
-
- # Norm(src+Dropout(Feedforward()))
- src2 = self.linear2(self.dropout(self.activation(self.linear1(src))))
- src = src + self.dropout2(src2)
- src = self.norm2(src)
- return src
-
- # For language reconstruct
- class TransformerDecoderLayer(Module):
- r"""TransformerDecoderLayer is made up of self-attn, multi-head-attn and feedforward network.
- This standard decoder layer is based on the paper "Attention Is All You Need".
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
- Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in
- Neural Information Processing Systems, pages 6000-6010. Users may modify or implement
- in a different way during application.
- Args:
- d_model: the number of expected features in the input (required).
- nhead: the number of heads in the multiheadattention models (required).
- dim_feedforward: the dimension of the feedforward network model (default=2048).
- dropout: the dropout value (default=0.1).
- activation: the activation function of intermediate layer, relu or gelu (default=relu).
- Examples::
- >>> decoder_layer = nn.TransformerDecoderLayer(d_model=512, nhead=8)
- >>> memory = torch.rand(10, 32, 512)
- >>> tgt = torch.rand(20, 32, 512)
- >>> out = decoder_layer(tgt, memory)
- """
- # d_model = 768, nhead = 8
- def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1, activation="relu"):
- super(TransformerDecoderLayer, self).__init__()
- self.self_attn = MultiheadAttention(d_model, nhead, dropout=dropout)
- self.multihead_attn = MultiheadAttention(d_model, nhead, dropout=dropout)
- # Implementation of Feedforward model
- self.linear1 = Linear(d_model, dim_feedforward)
- self.dropout = Dropout(dropout)
- self.linear2 = Linear(dim_feedforward, d_model)
-
- self.norm1 = LayerNorm(d_model)
- self.norm2 = LayerNorm(d_model)
- self.norm3 = LayerNorm(d_model)
- self.dropout1 = Dropout(dropout)
- self.dropout2 = Dropout(dropout)
- self.dropout3 = Dropout(dropout)
-
- self.activation = _get_activation_fn(activation)
-
- def __setstate__(self, state):
- if 'activation' not in state:
- state['activation'] = F.relu
- super(TransformerDecoderLayer, self).__setstate__(state)
-
- # tgt: the sequence to the decoder layer (required). (20,1,768)
- # memory: the sequence from the last layer of the encoder (required). (3600,1,768)
- def forward(self, tgt: Tensor, memory: Tensor, tgt_mask: Optional[Tensor] = None, memory_mask: Optional[Tensor] = None,
- tgt_key_padding_mask: Optional[Tensor] = None, memory_key_padding_mask: Optional[Tensor] = None) -> Tensor:
- r"""Pass the inputs (and mask) through the decoder layer.
- Args:
- tgt: the sequence to the decoder layer (required).
- memory: the sequence from the last layer of the encoder (required).
- tgt_mask: the mask for the tgt sequence (optional).
- memory_mask: the mask for the memory sequence (optional).
- tgt_key_padding_mask: the mask for the tgt keys per batch (optional).
- memory_key_padding_mask: the mask for the memory keys per batch (optional).
- Shape:
- see the docs in Transformer class.
- """
-
- # 类比 Transformer Decoder 的结构
- # tgt = Norm(Dropout(attention(tgt,tgt,tgt))+tgt)
- tgt2 = self.self_attn(tgt, tgt, tgt, attn_mask=tgt_mask, # Multihead self-attention tgt2 (20,1,768)
- key_padding_mask=tgt_key_padding_mask)[0]
- tgt = tgt + self.dropout1(tgt2) # tgt = tgt + dropout1(0.1,tgt2) (20,1,768)
- tgt = self.norm1(tgt) # LayerNorm (1,768,60,60)
-
- # tgt = Norm(Dropout(attention(tgt,memory,memory))+tgt)
- tgt2 = self.multihead_attn(tgt, memory, memory, attn_mask=memory_mask, # Multihead self-attention tgt2 (20,1,768)
- key_padding_mask=memory_key_padding_mask)[0]
- tgt = tgt + self.dropout2(tgt2) # (20,1,768)
- tgt = self.norm2(tgt) # (20,1,768)
- tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt)))) # tgt2 (20,1,768)
- tgt = tgt + self.dropout3(tgt2)
- tgt = self.norm3(tgt)
- return tgt # (20,1,768)
-
-
- def _get_clones(module, N):
- return ModuleList([copy.deepcopy(module) for i in range(N)])
-
-
- def _get_activation_fn(activation):
- if activation == "relu":
- return F.relu
- elif activation == "gelu":
- return F.gelu
-
- raise RuntimeError("activation should be relu/gelu, not {}".format(activation))
Transformer 中多次使用了多头注意力机制。
在 EncoderLayer 中,使用了一次多头自注意力机制。
在 DecoderLayer 中,先使用了一次多头自注意力机制,紧接着使用了一次多头非自注意力机制(k 为 tgt,q、v 为memory,是从上一个 encoder block 输出的结果)。
注意力机制的代码实现如下:
- # 多头注意力机制
- class MultiheadAttention(Module):
- r"""Allows the model to jointly attend to information
- from different representation subspaces.
- See reference: Attention Is All You Need
- .. math::
- \text{MultiHead}(Q, K, V) = \text{Concat}(head_1,\dots,head_h)W^O
- \text{where} head_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)
- Args:
- embed_dim: total dimension of the model.
- num_heads: parallel attention heads.
- dropout: a Dropout layer on attn_output_weights. Default: 0.0.
- bias: add bias as module parameter. Default: True.
- add_bias_kv: add bias to the key and value sequences at dim=0.
- add_zero_attn: add a new batch of zeros to the key and
- value sequences at dim=1.
- kdim: total number of features in key. Default: None.
- vdim: total number of features in value. Default: None.
- Note: if kdim and vdim are None, they will be set to embed_dim such that
- query, key, and value have the same number of features.
- Examples::
- >>> multihead_attn = nn.MultiheadAttention(embed_dim, num_heads)
- >>> attn_output, attn_output_weights = multihead_attn(query, key, value)
- """
- bias_k: Optional[torch.Tensor]
- bias_v: Optional[torch.Tensor]
-
- def __init__(self, embed_dim, num_heads, dropout=0., bias=True, add_bias_kv=False, add_zero_attn=False, kdim=None, vdim=None):
- super(MultiheadAttention, self).__init__()
- self.embed_dim = embed_dim
- self.kdim = kdim if kdim is not None else embed_dim
- self.vdim = vdim if vdim is not None else embed_dim
- self._qkv_same_embed_dim = self.kdim == embed_dim and self.vdim == embed_dim
-
- self.num_heads = num_heads
- self.dropout = dropout
- self.head_dim = embed_dim // num_heads
- assert self.head_dim * num_heads == self.embed_dim, "embed_dim must be divisible by num_heads"
-
- if self._qkv_same_embed_dim is False:
- self.q_proj_weight = Parameter(torch.Tensor(embed_dim, embed_dim))
- self.k_proj_weight = Parameter(torch.Tensor(embed_dim, self.kdim))
- self.v_proj_weight = Parameter(torch.Tensor(embed_dim, self.vdim))
- self.register_parameter('in_proj_weight', None)
- else:
- self.in_proj_weight = Parameter(torch.empty(3 * embed_dim, embed_dim))
- self.register_parameter('q_proj_weight', None)
- self.register_parameter('k_proj_weight', None)
- self.register_parameter('v_proj_weight', None)
-
- if bias:
- self.in_proj_bias = Parameter(torch.empty(3 * embed_dim))
- else:
- self.register_parameter('in_proj_bias', None)
- self.out_proj = _LinearWithBias(embed_dim, embed_dim)
-
- if add_bias_kv:
- self.bias_k = Parameter(torch.empty(1, 1, embed_dim))
- self.bias_v = Parameter(torch.empty(1, 1, embed_dim))
- else:
- self.bias_k = self.bias_v = None
-
- self.add_zero_attn = add_zero_attn
-
- self._reset_parameters()
-
- def _reset_parameters(self):
- if self._qkv_same_embed_dim:
- xavier_uniform_(self.in_proj_weight)
- else:
- xavier_uniform_(self.q_proj_weight)
- xavier_uniform_(self.k_proj_weight)
- xavier_uniform_(self.v_proj_weight)
-
- if self.in_proj_bias is not None:
- constant_(self.in_proj_bias, 0.)
- constant_(self.out_proj.bias, 0.)
- if self.bias_k is not None:
- xavier_normal_(self.bias_k)
- if self.bias_v is not None:
- xavier_normal_(self.bias_v)
-
- def __setstate__(self, state):
- # Support loading old MultiheadAttention checkpoints generated by v1.1.0
- if '_qkv_same_embed_dim' not in state:
- state['_qkv_same_embed_dim'] = True
-
- super(MultiheadAttention, self).__setstate__(state)
-
- def forward(self, query, key, value, key_padding_mask=None,
- need_weights=True, attn_mask=None):
- # type: (Tensor, Tensor, Tensor, Optional[Tensor], bool, Optional[Tensor]) -> Tuple[Tensor, Optional[Tensor]]
- r"""
- Args:
- query, key, value: map a query and a set of key-value pairs to an output.
- See "Attention Is All You Need" for more details.
- key_padding_mask: if provided, specified padding elements in the key will
- be ignored by the attention. When given a binary mask and a value is True,
- the corresponding value on the attention layer will be ignored. When given
- a byte mask and a value is non-zero, the corresponding value on the attention
- layer will be ignored
- need_weights: output attn_output_weights.
- attn_mask: 2D or 3D mask that prevents attention to certain positions. A 2D mask will be broadcasted for all
- the batches while a 3D mask allows to specify a different mask for the entries of each batch.
- Shape:
- - Inputs:
- - query: :math:`(L, N, E)` where L is the target sequence length, N is the batch size, E is
- the embedding dimension.
- - key: :math:`(S, N, E)`, where S is the source sequence length, N is the batch size, E is
- the embedding dimension.
- - value: :math:`(S, N, E)` where S is the source sequence length, N is the batch size, E is
- the embedding dimension.
- - key_padding_mask: :math:`(N, S)` where N is the batch size, S is the source sequence length.
- If a ByteTensor is provided, the non-zero positions will be ignored while the position
- with the zero positions will be unchanged. If a BoolTensor is provided, the positions with the
- value of ``True`` will be ignored while the position with the value of ``False`` will be unchanged.
- - attn_mask: 2D mask :math:`(L, S)` where L is the target sequence length, S is the source sequence length.
- 3D mask :math:`(N*num_heads, L, S)` where N is the batch size, L is the target sequence length,
- S is the source sequence length. attn_mask ensure that position i is allowed to attend the unmasked
- positions. If a ByteTensor is provided, the non-zero positions are not allowed to attend
- while the zero positions will be unchanged. If a BoolTensor is provided, positions with ``True``
- is not allowed to attend while ``False`` values will be unchanged. If a FloatTensor
- is provided, it will be added to the attention weight.
- - Outputs:
- - attn_output: :math:`(L, N, E)` where L is the target sequence length, N is the batch size,
- E is the embedding dimension.
- - attn_output_weights: :math:`(N, L, S)` where N is the batch size,
- L is the target sequence length, S is the source sequence length.
- """
- if not self._qkv_same_embed_dim:
- return F.multi_head_attention_forward(
- query, key, value, self.embed_dim, self.num_heads,
- self.in_proj_weight, self.in_proj_bias,
- self.bias_k, self.bias_v, self.add_zero_attn,
- self.dropout, self.out_proj.weight, self.out_proj.bias,
- training=self.training,
- key_padding_mask=key_padding_mask, need_weights=need_weights,
- attn_mask=attn_mask, use_separate_proj_weight=True,
- q_proj_weight=self.q_proj_weight, k_proj_weight=self.k_proj_weight,
- v_proj_weight=self.v_proj_weight)
- else:
- return F.multi_head_attention_forward(
- query, key, value, self.embed_dim, self.num_heads,
- self.in_proj_weight, self.in_proj_bias,
- self.bias_k, self.bias_v, self.add_zero_attn,
- self.dropout, self.out_proj.weight, self.out_proj.bias,
- training=self.training,
- key_padding_mask=key_padding_mask, need_weights=need_weights,
- attn_mask=attn_mask)
此博客为个人学习笔记,如有错误,欢迎指正!感谢各位大佬!
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。