繁依Fanyi0

这个屌丝很懒，什么也没留下！

热门标签

Attention的原理和实现_attention如何减少计算量

作者：繁依Fanyi0 | 2024-03-26 10:17:16

踩

attention如何减少计算量

Attention的原理和实现

1. Attention的介绍【让模型更加关注句子中的重点】

在普通的RNN结构中，Encoder需要把一个句子转化为一个向量，然后在Decoder中使用，这就要求Encoder把源句子中所有的信息都包含进去，但是当句子长度过长的时候，这个要求就很难达到，或者说会产生瓶颈（比如，输入一篇文章等场长内容），当然我们可以使用更深的RNN和大多的单元来解决这个问题，但是这样的代价也很大。那么有没有什么方法能够优化现有的RNN结构呢？

为此，Bahdanau等人在2015年提出了Attenion机制，Attention翻译成为中文叫做注意力，把这种模型称为Attention based model。就像我们自己看到一副画，我们能够很快的说出画的主要内容，而忽略画中的背景，因为我们注意的，更关注的往往是其中的主要内容。

通过这种方式，在我们的RNN中，我们有通过LSTM或者是GRU得到的所有信息，那么这些信息中只去关注重点，而不需要在Decoder的每个time step使用全部的encoder的信息，这样就可以解决第一段所说的问题了

2. Attenion的实现机制

假设我们现在有一个文本翻译的需求，即机器学习翻译为machine learning。那么这个过程通过前面的Seq2Seq就可以实现

上图的左边是Encoder，能够得到hidden_state在右边使用

Deocder中蓝色方框中的内容，是为了提高模型的训练速度而使用teacher forcing手段，否则的话会把前一次的输出作为下一次的输入（但是在Attention模型中不再是这样了）

那么整个过程中如果使用Attention应该怎么做呢？

在之前我们把encoder的最后一个输出，作为decoder的初始的隐藏状态，现在我们不再这样做

2.1 Attention的实现过程

1.初始化一个Decoder的隐藏状态 $z^{0}$

2.这个 $z^{0}$ 会和encoder第一个time step的output进行match操作（或者是socre操作），得到 $\alpha_0^1$ ，这里的match可以使很多中操作，比如：

z和h的余弦值
是一个神经网络，输入为z和h
或者 $\alpha = h^T W z$ 等

3.encoder中的每个output都和 $z^{0}$ 进行计算之后，得到的结果进行softmax，让他们的和为1(可以理解为权重)

4.之后把所有的softmax之后的结果和原来encoder的输出 $h^{i}$ 进行相加求和得到 $c^{0}$

即： $c^0 = \sum\hat{\alpha}_0^ih^i$

5.得到 $c^{0}$ 之后，把它作为decoder的input，同和传入初始化的 $z^{0}$ ，得到第一个time step的输出和hidden_state（ $z^{1}$ ）

6.把 $z^{1}$ 再和所有的encoder的output进行match操作，得到的结果进行softmax之后作为权重和encoder的每个timestep的结果相乘求和得到 $c^{1}$

7.再把 $c^{1}$ 作为decoder的input，和 $z^{1}$ 作为输入得到下一个输出，如此循环,只到最终decoder的output为终止符

8.上述参考：http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLSD15_2.html

9.整个过程写成数学公式如下：

$\begin{align*} \alpha_{ij} &= \frac{exp(score(h_i,\overline{h}_j))}{\sum exp(score(h_n,\overline{h}_m))} & [attention \quad weight]\\ c_i &=\sum \alpha_{ij}\overline{h}_s & [context\quad vector] \\ \alpha_i &= f(c_i,h_i) = tanh(W_c[c_i;h_i]) &[attenton \quad result] \end{align*}$

先计算attention权重
再计算上下文向量，图中的 $c_{i}$
最后计算结果，往往会把当前的output([batch_size,1,hidden_size])和上下文向量进行拼接然后使用


attention 原理实现过程
a. 初始化decoder的隐藏状态 z0
b. 计算attention权重：隐藏状态和和encoder的outputs计算（余弦，DNN，矩阵）出的结果进行sofamax之后得到attention weight
c. 得到context vector：attention weight 和 encoder outputs 计算得到
d. attention的最终结果：前一次的输出和context vector 进行concat，经过形状变换和tanh的处理后作为当前时间步的输入
e. decoder当前时间步会把attention的最终结果作为输入，还会把前一次的输出zi作为隐藏状态输入

2.2 不同Attention的介绍

在上述过程中，使用decoder的状态和encoder的状态的计算后的结果作为权重，乘上encoder每个时间步的输出，这需要我们去训练一个合适的match函数，得到的结果就能够在不同的时间步上使用不同的encoder的相关信息，从而达到只关注某一个局部的效果，也就是注意力的效果

2.2.1 `Soft-Attention 和 Hard-Attention`

最开始Bahdanau等人提出的Attention机制通常被称为soft-attention,所谓的soft-attention指的是encoder中输入的每个词语都会计算得到一个注意力的概率。【会关注每一个位置，每一个位置都会有概率】

在进行图像捕捉的时候，提出了一种hard-attenion的方法，希望直接从input中找到一个和输出的某个词对应的那一个词。但是由于NLP中词语和词语之间往往存在联系，不会只关注某一个词语，所以都会使用soft-attention，所以这里的就不多介绍hard-attention 【会关注某一个位置，只有一个位置会有概率】


soft attention: encoder中每一个输出都会计算一个概率
hard attention: encoder中只寻找某一个计算概率

2.2.3 `Global-Attention 和Local Attention`

Bahdanau等人提出的Bahdanau Attention 被称为local attention,后来Luong等人提出的Luong Attention是一种全局的attenion。

所谓全局的attenion指的是：使用的全部的encoder端的输入的attenion的权重

local-attenion就是使用了部分的encoder端的输入的权重(当前时间步上的encoder的hidden state)，这样可以减少计算量，特别是当句子的长度比较长的时候。


golbal attention：使用全部的encoder 的输出来计算attention权重
local attention：使用部分的encoder的输出来计算attention权重

2.2.4 `Bahdanau Attention和 Luong Attenion`的区别


Bahdanau Attention计算过程
1. 前一次的隐藏状态和encoder的output match计算得到attention weight
2. attention weight 和 encoder output计算得到context vector
3. context vector作为当前时间步的输入，同时当前时间步的输入还有前一次的隐藏状态
4. 得到当前时间步的输出和隐藏状态
Luong Attenion计算过程
1. GRU计算得到的decoder的hidden_state(若没有则可以初始化一个)
2. hidden state 和encoder的hidden计算得到a_t(attention weight)
3. attention weight 和 encoder 的output计算得到context vector
4. context vector 和 GRU的当前时间步的output合并计算得到最终的输出

区别在于两个地方：

一、attention的计算数据和位置

1.Bahdanau Attention会使用前一次的隐藏状态来计算attention weight，所以我们会在代码中的GRU之前使用attention的操作，同时会把attention的结果和word embedding的结果进行concat，作为GRU的输出(参考的是pytorch Toritul)。Bahdanau使用的是双向的GRU，会使用正反的encoder的output的concat的结果作为encoder output,如下图所示

2.Luong Attenion使用的是当前一次的decoder的output来计算得到attention weight，所以在代码中会在GRU的后面进行attention的操作，同时会把context vector和gru的结果进行concat的操作，最终的output。Luong使用的是多层GRU，只会使用最后一层的输出(encoder output)

二、计算attention weights的方法不同

1.Bahdanau Attention的match函数， $a_i^j= v^T_a tanh (W_aZ{i-1},+U_ah_j)$ ，计算出所有的 $a_i^j$ 之后，在计算softmax，得到 $\hat{a}i^j$ ，即 $\hat{a}_i^j = \frac{exp(a_i^j)}{\sum exp(a_i^j)}$

其中

$v_a^T$ 是一个参数矩阵，需要被训练， $W_a$ 是实现对 $Z_{i-1}$ 的形状变化，
$U_a$ 实现对 $h_j$ 的形状变化（矩阵乘法，理解为线性回归，实现数据形状的对齐），
$Z_{i-1}$ 是decoder端前一次的隐藏状态， $h_j$ 是encoder的output

2.Luong Attenion整体比Bahdanau Attention更加简单，他使用了三种方法来计算得到权重

矩阵乘法：general
- 直接对decoder的隐藏状态进行一个矩阵变换（线性回归），然后和encoder outputs进行矩阵乘法
dot
- 直接对decoder的隐藏状态和encoder outputs进行矩阵乘法
concat
- 把decoder的隐藏状态和encoder的output进行concat，把这个结果使用tanh进行处理后的结果进行对齐(进行矩阵乘法，变换为需要的形状)计算之后，和encoder outputs进行矩阵乘法

$h_{t}$ 是当前的decoder hidden state, $h_{s}$ 是所有的encoder 的hidden state(encoder outputs)

最终两个attention的结果区别并不太大，所以以后我们可以考虑使用Luong attention完成代码


Bahdanau Attention和 Luong Attenion 的区别
encoder:
 	1.  Bahdanau Attention采用双向GRU进行解码，正向和反向的output进行concat之后的结果作为encoder的结果
 	2. Luong attention采用单向多层GRU，把最后一层的输出作为encoder的输出
decoder:
	1. Bahdanau Attention使用之前的hidden_state 和encoder的output计算，得到attention weight和context vector，作为gru输入
	2. Luong attention使用当前时间步的输出和encoder的output计算得到attention weight，在和encoder out 计算得到context vector ，和decoder的output进行concat，作为输出

3. Attention的代码实现

完成代码之前，我们需要确定我们的思路，通过attention的代码，需要实现计算的是attention weight

通过前面的学习，我们知道attention_weight = f(hidden,encoder_outputs)，主要就是实现Luong attention中的三种操作

示例代码：


import torch
import torch.nn as nn
import torch.nn.functional as F
import config
 
 
class Attention(nn.Module):
    def __init__(self, method, batch_size, hidden_size):
        """
        attention 机制
        :param method:三种attention_weights 计算方法general, dot, concat
        :param method:
        :param batch_size:
        :param hidden_size:
        """
        super().__init__()
        self.method = method
        self.hidden_size = hidden_size
 
        assert self.method in ["dot", "general", "concat"], "methode只能是dot、general、concat,当前是{}".format(self.method)
 
        # dot 为decoder_hidden 和encoder_outputs 直接进行矩阵乘法
        if self.method == 'dot':
            pass
        #  general为对decoder_hidden 进行矩阵变换后，与encoder_outputs相乘
        elif self.method == 'general':
            self.wa = nn.Linear(hidden_size, hidden_size, bias=False)
        elif self.method == "concat":
            self.wa = nn.Linear(hidden_size * 2, hidden_size, bias=False)
            self.va = nn.Parameter(torch.FloatTensor(batch_size, hidden_size))
 
    def forward(self, hidden, encoder_outputs):
        """
        进行三种运算得到attn_weights
        :param hidden:decoder每个时间步的隐藏状态[1,batch_size,hidden_size]
        :param encoder_outputs:[batch_size,seq_len,hidden_size]
        :return:
        """
        batch_size, seq_len, hidden_size = encoder_outputs.size()
        hidden = hidden.squeeze(0)  # [batch_size,hidden_size]
 
        if self.method == "dot":
            return self.dot_score(hidden, encoder_outputs)
        elif self.method == 'general':
            return self.general_score(hidden, encoder_outputs)
        elif self.method == "concat":
            return self.concat_score(hidden, encoder_outputs)
 
    def _score(self, batch_size, seq_len, hidden, encoder_outputs):
        #  速度太慢
        #  [batch_size, seq_len]
        attn_energies = torch.zeros(batch_size, seq_len).to(config.device)
        for b in range(batch_size):
            for i in range(seq_len):
                # encoder_output : [batch_size,seq_len,hidden_size]
                # deocder_hidden :[batch_size,hidden_size]
                # torch.Size([256, 128]) torch.Size([128]) torch.Size([256, 24, 128]) torch.Size([128])
                # print("attn size:",hidden.size(),hidden[b,:].size(),encoder_output.size(),encoder_output[b,i].size())
                attn_energies[b, i] = hidden[b, :].dot(encoder_outputs[b, i])
        return F.softmax(attn_energies).unsqueeze(1)  # [batch_size,1,seq_len]
 
    def dot_score(self, hidden, encoder_outputs):
        """
        dot attention : 直接对decoder_hidden 和 encoder_outputs进行矩阵乘法
        :param hidden: [batch_size,hidden_size] --->[batch_size,hidden_size,1]
        :param encoder_outputs:[batch_size,seq_len,hidden_size]
        :return:
        """
        # hiiden :[hidden_size] -->[hidden_size,1] ，encoder_output:[seq_len,hidden_size]
        hidden = hidden.unsqueeze(-1)
        attn_energies = torch.bmm(encoder_outputs, hidden)
        attn_energies = attn_energies.squeeze(-1)  # [batch_size,seq_len,1] ==>[batch_size,seq_len]
 
        return F.softmax(attn_energies).unsqueeze(1)  # [batch_size,1,seq_len]
 
    def general_score(self, hidden, encoder_outputs):
        """
        general attention:对decoder_hidden进行线性变换后与encoder_outputs进行矩阵乘法
        :param hidden: int
        :param encoder_outputs:[batch_size,seq_len,hidden_size]
        :return:
        """
        x = self.wa(hidden)  # [batch_size,hidden_size]
        x = x.unqueeze(-1)  # [batch_size,hidden_size,1]
        attn_energies = torch.bmm(encoder_outputs, x).squeeze(-1)  # [batch_size,seq_len,1]
        return F.softmax(attn_energies, dim=-1).unsqueeze(1)  # [batch_size,1,seq_len]
 
    def concat_score(self, hidden, encoder_outputs):
        """
        concat attention:decoder_hidden和encoder_outputs拼接，
        把这个结果使用tanh进行处理后的结果进行对齐(进行矩阵乘法，变换为需要的形状)计算之后，
        和encoder outputs进行矩阵乘法
        :param hidden: int
        :param encoder_outputs:[batch_size,seq_len,hidden_size]
        :return:
        """
        # 需要先进行repeat操作，变成和encoder_outputs相同的形状,让每个batch有seq_len个hidden_size
        x = hidden.repeat(1, encoder_outputs.size(1), 1)  # [batch_size,seq_len,hidden_size]
        x = torch.tanh(self.wa(torch.cat([x, encoder_outputs],
                                         dim=-1)))  # [batch_size,seq_len,hidden_size*2] --> [batch_size,seq_len,hidden_size]
        # va [batch_size,hidden_size] ---> [batch_size,hidden_size,1]
        attn_energies = torch.bmm(x, self.va.unsqueeze(2))  # [batch_size,seq_len,1]
        attn_energies = attn_energies.squeeze(-1)
        # print("concat attention:",attn_energis.size(),encoder_outputs.size())
        return F.softmax(attn_energies, dim=-1).unsqueeze(1)  # [batch_size,1,seq_len]

完成了attention weight的计算之后，需要再对代码中forward_step的内容进行修改


    def forward_step(self, decoder_input, decoder_hidden, encoder_outputs):
        """
        :param decoder_input:[batch_size,1]
        :param decoder_hidden:[1,batch_size,hidden_size]
        :param encoder_outputs:encoder中所有的输出，[batch_size,seq_len,hidden_size]
        :return:out:[batch_size,vocab_size],decoder_hidden:[1,batch_size,didden_size]
        """
        embeded = self.embedding(decoder_input)
 
        # TODO 可以把embeded的结果和前一次的context（初始值为全0tensor） concate之后作为结果
        # rnn_input = torch.cat((embeded, last_context.unsqueeze(0)), 2)
 
        # gru_out:[256,1, 128]  decoder_hidden: [1, batch_size, hidden_size]
        gru_out, decoder_hidden = self.gru(embeded, decoder_hidden)
        gru_out = gru_out.squeeze(1)
 
        # TODO 注意：如果是单层，这里使用decoder_hidden没问题（output和hidden相同）
        # 如果是多层，可以使用GRU的output作为attention的输入
        # 开始使用attention
        attn_weights = self.attn(decoder_hidden, encoder_outputs)
        # attn_weights [batch_size,1,seq_len] * [batch_size,seq_len,hidden_size]
        context = attn_weights.bmm(encoder_outputs)  # [batch_size,1,hidden_size]
 
        gru_out = gru_out.squeeze(0)  # [batch_size,hidden_size]
        context = context.squeeze(1)  # [batch_size,hidden_size]
        # 把output和attention的结果合并到一起
        concat_input = torch.cat((gru_out, context), 1)  # [batch_size,hidden_size*2]
 
        concat_output = torch.tanh(self.concat(concat_input))  # [batch_size,hidden_size]
 
        output = F.log_softmax(self.fc(concat_output), dim=-1)  # [batch_Size, vocab_size]
        # out = out.squeeze(1)
        return output, decoder_hidden, attn_weights

attetnion的Bahdanau实现可以参考：https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation.ipynb

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/繁依Fanyi0/article/detail/316403