当前位置:   article > 正文

[LLM] 自然语言处理 --- Self-Attention(二) 动画与代码演示_self-attention代码

self-attention代码

参考这一篇,更容易理解

Understanding and Coding the Self-Attention Mechanism of Large Language Models From Scratch (sebastianraschka.com)

一  Self Attention 动画演示

Step 1: Prepare inputs

For this tutorial, we start with 3 inputs, each with dimension 4.

  1. Input 1: [1, 0, 1, 0]
  2. Input 2: [0, 2, 0, 2]
  3. Input 3: [1, 1, 1, 1]

Step 2: Initialise weights

Every input must have three representations (see diagram below). These representations are called key (orange), query (red), and value (purple). For this example, let’s take that we want these representations to have a dimension of 3. Because every input has a dimension of 4, this means each set of the weights must have a shape of 4×3.

(the dimension of value is also the dimension of the output.)

In order to obtain these representations, every input (green) is multiplied with a set of weights for keys, a set of weights for querys (I know that’s not the right spelling), and a set of weights for values. In our example, we ‘initialise’ the three sets of weights as follows.

Weights for key:

  1. [[0, 0, 1],
  2. [1, 1, 0],
  3. [0, 1, 0],
  4. [1, 1, 0]]

Weights for query:

  1. [[1, 0, 1],
  2. [1, 0, 0],
  3. [0, 0, 1],
  4. [0, 1, 1]]

Weights for value:

  1. [[0, 2, 0],
  2. [0, 3, 0],
  3. [1, 0, 3],
  4. [1, 1, 0]]

PS: In a neural network setting, these weights are usually small numbers, initialised randomly using an appropriate random distribution like Gaussian, Xavier and Kaiming distributions.

Step 3: Derive key, query and value

Now that we have the three sets of weights, let’s actually obtain the keyquery and value representations for every input.

Key representation for Input 1:

  1. [0, 0, 1]
  2. [1, 0, 1, 0] x [1, 1, 0] = [0, 1, 1]
  3. [0, 1, 0]
  4. [1, 1, 0]

Use the same set of weights to get the key representation for Input 2:

  1. [0, 0, 1]
  2. [0, 2, 0, 2] x [1, 1, 0] = [4, 4, 0]
  3. [0, 1, 0]
  4. [1, 1, 0]

Use the same set of weights to get the key representation for Input 3:

  1. [0, 0, 1]
  2. [1, 1, 1, 1] x [1, 1, 0] = [2, 3, 1]
  3. [0, 1, 0]
  4. [1, 1, 0]

1.   A faster way is to vectorise the above key operations:

  1. [0, 0, 1]
  2. [1, 0, 1, 0] [1, 1, 0] [0, 1, 1]
  3. [0, 2, 0, 2] x [0, 1, 0] = [4, 4, 0]
  4. [1, 1, 1, 1] [1, 1, 0] [2, 3, 1]

2.   Let’s do the same to obtain the value representations for every input:

  1. [0, 2, 0]
  2. [1, 0, 1, 0] [0, 3, 0] [1, 2, 3]
  3. [0, 2, 0, 2] x [1, 0, 3] = [2, 8, 0]
  4. [1, 1, 1, 1] [1, 1, 0] [2, 6, 3]

3. finally the query representations:

  1. [1, 0, 1]
  2. [1, 0, 1, 0] [1, 0, 0] [1, 0, 2]
  3. [0, 2, 0, 2] x [0, 0, 1] = [2, 2, 2]
  4. [1, 1, 1, 1] [0, 1, 1] [2, 1, 3]

PS: In practice, a bias vector may be added to the product of matrix multiplication.

Step 4: Calculate attention scores for Input 1

To obtain attention scores, we start off with taking a dot product between Input 1’s query (red) with all keys (orange), including itself. Since there are 3 key representations (because we have 3 inputs), we obtain 3 attention scores (blue).

  1. [0, 4, 2]
  2. [1, 0, 2] x [1, 4, 3] = [2, 4, 4]
  3. [1, 0, 1]

we only use the query from Input 1. Later we’ll work on repeating this same step for the other querys.

PS: The above operation is known as dot product attention, one of the several score functions. Other score functions include scaled dot product and additive/concat.

Step 5: Calculate softmax

Take the softmax across these attention scores (blue).

softmax([2, 4, 4]) = [0.0, 0.5, 0.5]

Step 6: Multiply scores with values

The softmaxed attention scores for each input (blue) is multiplied with its corresponding value (purple). This results in 3 alignment vectors (yellow). In this tutorial, we’ll refer to them as weighted values.

  1. 1: 0.0 * [1, 2, 3] = [0.0, 0.0, 0.0]
  2. 2: 0.5 * [2, 8, 0] = [1.0, 4.0, 0.0]
  3. 3: 0.5 * [2, 6, 3] = [1.0, 3.0, 1.5]

Step 7: Sum weighted values to get Output 1

Take all the weighted values (yellow) and sum them element-wise:

  1. [0.0, 0.0, 0.0]
  2. + [1.0, 4.0, 0.0]
  3. + [1.0, 3.0, 1.5]
  4. -----------------
  5. = [2.0, 7.0, 1.5]

The resulting vector [2.0, 7.0, 1.5] (dark green) is Output 1, which is based on the query representation from Input 1 interacting with all other keys, including itself.

Step 8: Repeat for Input 2 & Input 3

Query 与 Key 的纬度一定要相同,因为两者需要进行点积相乘, 然而, Value的纬度可以与Q, K的纬度不一样

The resulting output will consequently follow the dimension of value.

二  Self-Attention代码演示

Step 1: 准备输入X

  1. import tensorflow as tf
  2. x = [
  3. [1, 0, 1, 0], # Input 1
  4. [0, 2, 0, 2], # Input 2
  5. [1, 1, 1, 1] # Input 3
  6. ]
  7. x = tf.Variable(x, dtype=tf.float32)

Step 2: 参数W初始化

一般使用_Gaussian, Xavier_ 和 _Kaiming_随机分布初始化。在训练开始之前完成这些初始化工作。

  1. w_key = [
  2. [0, 0, 1],
  3. [1, 1, 0],
  4. [0, 1, 0],
  5. [1, 1, 0]
  6. ]
  7. w_query = [
  8. [1, 0, 1],
  9. [1, 0, 0],
  10. [0, 0, 1],
  11. [0, 1, 1]
  12. ]
  13. w_value = [
  14. [0, 2, 0],
  15. [0, 3, 0],
  16. [1, 0, 3],
  17. [1, 1, 0]
  18. ]
  19. w_key = tf.Variable(w_key, dtype=tf.float32)
  20. w_query = tf.Variable(w_query, dtype=tf.float32)
  21. w_value = tf.Variable(w_value, dtype=tf.float32)

Step 3:并计算出K, Q, V

    

  1. keys = x @ w_key
  2. querys = x @ w_query
  3. values = x @ w_value
  4. print(keys)
  5. # tensor([[0., 1., 1.],
  6. # [4., 4., 0.],
  7. # [2., 3., 1.]])
  8. print(querys)
  9. # tensor([[1., 0., 2.],
  10. # [2., 2., 2.],
  11. # [2., 1., 3.]])
  12. print(values)
  13. # tensor([[1., 2., 3.],
  14. # [2., 8., 0.],
  15. # [2., 6., 3.]])

Step 4: 计算注意力权重

首先计算注意力权重,通过计算K的转置矩阵和Q的点积得到。

  1. attn_scores = querys @ tf.transpose(keys, perm=[1, 0]) # [[1, 4]
  2. print(attn_scores)
  3. # tensor([[ 2., 4., 4.], # attention scores from Query 1
  4. # [ 4., 16., 12.], # attention scores from Query 2
  5. # [ 4., 12., 10.]]) # attention scores from Query 3

Step 5: 计算 softmax

例子中没有去除\sqrt{D_k}   

  1. attn_scores_softmax = tf.nn.softmax(attn_scores)
  2. print(attn_scores_softmax)
  3. # tensor([[6.3379e-02, 4.6831e-01, 4.6831e-01],
  4. # [6.0337e-06, 9.8201e-01, 1.7986e-02],
  5. # [2.9539e-04, 8.8054e-01, 1.1917e-01]])
  6. # For readability, approximate the above as follows
  7. attn_scores_softmax = [
  8. [0.0, 0.5, 0.5],
  9. [0.0, 1.0, 0.0],
  10. [0.0, 0.9, 0.1]
  11. ]
  12. attn_scores_softmax = tf.Variable(attn_scores_softmax)
  13. print(attn_scores_softmax)

下面例子除\sqrt{D_k}   

  1. attn_scores = attn_scores / 1.7
  2. print(attn_scores)
  3. attn_scores = [
  4. [1.2, 2.4, 2.4],
  5. [2.4, 9.4, 7.1],
  6. [2.4, 7.1, 5.9],
  7. ]
  8. attn_scores = tf.Variable(attn_scores, dtype=tf.float32)
  9. print(attn_scores)
  10. attn_scores_softmax = tf.nn.softmax(attn_scores)
  11. print(attn_scores_softmax)
  12. attn_scores_softmax = [
  13. [0.1, 0.4, 0.4],
  14. [0.0, 0.9, 0.0],
  15. [0.0, 0.7, 0.2],
  16. ]
  17. attn_scores_softmax = tf.Variable(attn_scores_softmax, dtype=tf.float32)
  18. print(attn_scores_softmax)

Step6+Step7一起算出来

  1. print(attn_scores_softmax)
  2. print(values)
  3. outputs = tf.matmul(attn_scores_softmax, values)
  4. print(outputs)
  1. <tf.Variable 'Variable:0' shape=(3, 3) dtype=float32, numpy=
  2. array([[0. , 0.5, 0.5],
  3. [0. , 1. , 0. ],
  4. [0. , 0.9, 0.1]], dtype=float32)>
  5. tf.Tensor(
  6. [[1. 2. 3.]
  7. [2. 8. 0.]
  8. [2. 6. 3.]], shape=(3, 3), dtype=float32)
  9. tf.Tensor(
  10. [[2. 7. 1.5 ]
  11. [2. 8. 0. ]
  12. [2. 7.7999997 0.3 ]], shape=(3, 3), dtype=float32)

下面例子使用的除\sqrt{D_k}   后,算出来的outputs

Step 6: Multiply scores with values

  1. weighted_values = values[:,None] * tf.transpose(attn_scores_softmax, perm=[1, 0])[:,:,None]
  2. print(weighted_values)
  3. # tensor([[[0.0000, 0.0000, 0.0000],
  4. # [0.0000, 0.0000, 0.0000],
  5. # [0.0000, 0.0000, 0.0000]],
  6. #
  7. # [[1.0000, 4.0000, 0.0000],
  8. # [2.0000, 8.0000, 0.0000],
  9. # [1.8000, 7.2000, 0.0000]],
  10. #
  11. # [[1.0000, 3.0000, 1.5000],
  12. # [0.0000, 0.0000, 0.0000],
  13. # [0.2000, 0.6000, 0.3000]]])

Step 7: Sum weighted values

  1. outputs = tf.reduce_sum(weighted_values, axis=0)
  2. print(outputs)
  3. # tensor([[2.0000, 7.0000, 1.5000], # Output 1
  4. # [2.0000, 8.0000, 0.0000], # Output 2
  5. # [2.0000, 7.8000, 0.3000]]) # Output 3

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/从前慢现在也慢/article/detail/953954
推荐阅读
相关标签
  

闽ICP备14008679号