当前位置:   article > 正文

Coursera自然语言处理专项课程04:Natural Language Processing with Attention Models笔记 Week01

Coursera自然语言处理专项课程04:Natural Language Processing with Attention Models笔记 Week01

Natural Language Processing with Attention Models

Course Certificate

在这里插入图片描述

本文是学习这门课 Natural Language Processing with Attention Models的学习笔记,如有侵权,请联系删除。

在这里插入图片描述

Week 01: Neural Machine Translation

Discover some of the shortcomings of a traditional seq2seq model and how to solve for them by adding an attention mechanism, then build a Neural Machine Translation model with Attention that translates English sentences into German.

Learning Objectives


  • Explain how an Encoder/Decoder model works
  • Apply word alignment for machine translation
  • Train a Neural Machine Translation model with Attention
  • Develop intuition for how teacher forcing helps a translation model check its predictions
  • Use BLEU score and ROUGE score to evaluate machine-generated text quality
  • Describe several decoding methods including MBR and Beam search

Seq2seq

Good to see you again. You will now learn about
neural machine translation, and you’ll see what
the architecture of this neural
network looks like. You will also learn which words the neural network
is focusing on when translating from
one language to another. Let’s formalize this task. To get started on
this week’s material, I’ll introduce you to neural machine
translation along with the model that was traditionally used for its implementation. The seq2seq model. Then, I’ll talk about some
of this models shortcomings and the solution as they
lead into the model that you’ll be using in
this week’s assignments. Exciting stuff. Let’s go.

In neural machine translation, you’re using an
encoder and a decoder to translate from one
language to another. For example, you
could translate, it’s time for tea from English to French, C’est l’heure du the. To do this, you could use a
machine translation system that has LSTMs for both
encoding and decoding. The traditional seq2seq
model was introduced by Google in 2014 and it was a revelation
at the time. Basically, it works by
taking one sequence of items such as words and its
output, another sequence. The way this is
done is by mapping variable length sequences
to a fixed length memory, which in machine translation, encodes the overall
meaning of sentences.

For example, you can have a text of length that varies and you can encode
it into a vector or fixed dimension
like 300, for example. This feature is what’s made this model a powerhouse
for machine translation. Additionally, the
inputs and outputs don’t need to have
matching lengths, which is a desirable feature
when translating texts. Then you might recall the
vanishing and exploding gradients problems from
earlier in the specialization. In seq2seq model, LSTMs and GRUs are typically
used to avoid these problems. As I mentioned, in
a seq2seq model, you have an encoder
and a decoder.

在这里插入图片描述

The encoder takes
word tokens as input, and it returns its final
hidden states as outputs. This hidden state is
used by the decoder to generate the translated sentence
in the target language. Before moving on, let’s look closer at the
encoder and decoder. The encoder typically consists
of an embedding layer and an LSTM module with
one or more layers. The embedding layer
transforms words tokenized first into a vector for
input to the LSTM module. At each step in the
input sequence, the LSTM module receives inputs
from the embedding layer, as well as the hidden states
from the previous step. The encoder returns the hidden
states of the final step, shown here as h_4. This final hidden
state has information from the whole sentence and it encodes its
overall meaning.

在这里插入图片描述

The decoder is constructed similarly with an embedding
layer and an LSTM layer. You use the output
word of a step as the input word
for the next step. You also pass the LSTM hidden
state to the next step. You start the input sequence where there is start of sequence token denoted as SOS here. The first step, C’est, as the most probable next word. Then you use C’est as the
input word for the next step and repeat to generate the rest of the sentence
l’heure du the.

在这里插入图片描述

One major limitation of the
traditional seq2seq model is what’s referred to as
the information bottleneck. Since seq2seq uses a
fixed length memory for the hidden states, long sequences
become problematic. This is due to the fact that in traditional
seq2seq models, only a fixed amount of
information can be passed from the encoder to
the decoder no matter how much information is
contained in the input sequence. The power of seq2seq, which allows for inputs and outputs to be different sizes, becomes not effective when
the input sequence is long. The result is lower
model performance, a sequence size increases
and that’s no good.

在这里插入图片描述

在这里插入图片描述

The issue with having one fixed size encoder hidden states is that it struggles to compress longer sequences and it
ends up throttling itself and punishing the decoder who only wants to make
a good prediction. One workaround is to use
the encoder hidden states for each word instead of trying to smash it all into
one big vector. But this model would have flaws
with memory and contexts. How could you build a time
and memory efficient model that predicts accurately
from a long sequence? This becomes possible if the
model has a way to select and focus on the most important
words at each time step. You can think of this as giving the model a new layer to
process this information, which in the slide
is called attention. If you provide the information specific to each input word, you can give the
model a way to focus it’s attention in
the right place at each step of the
decoding process. That is good progress.

在这里插入图片描述

Up next, you’ll get
a conceptual idea of what this new layer
is doing and why. You now have an overview of
neural machine translation, and you have a rough idea of what attention
is looking like. You know which words the
model is focusing on when translating from one
language to another language.

Seq2Seq是一种序列到序列的模型,通常用于自然语言处理任务,比如机器翻译和文本摘要。它由两个主要部分组成:编码器(encoder)和解码器(decoder)。

  1. 编码器(Encoder):接受输入序列,并将其转换为隐藏状态向量。编码器通常使用循环神经网络(RNN)或者变种(比如长短时记忆网络(LSTM)或门控循环单元(GRU))来处理输入序列,并捕捉输入序列中的信息。

  2. 解码器(Decoder):接受编码器生成的隐藏状态向量,并利用该向量生成输出序列。解码器也通常是一个循环神经网络,它会根据输入的隐藏状态和先前生成的标记来预测下一个标记。在训练期间,解码器通过将正确的目标标记传递给下一个时间步来生成序列。在推理阶段,解码器根据前一个时间步生成的标记来生成下一个标记,直到生成特殊的终止标记或达到最大输出长度。

Seq2Seq模型已经被广泛用于许多任务,它的灵活性和强大性使得它成为了自然语言处理领域的一个重要工具。

Seq2seq Model with Attention

Welcome. Attention is a
very important concepts and allows you to focus
where the model is looking at whenever
making a prediction. For example, when translating one paragraph from
English to French, you can focus on translating one sentence at a
time or even more, a couple of words at a time. Let’s dive into this concept. What we call attention
now was introduced in a landmark paper from
Dzmitry Bahdanau, KyungHyun Cho, and
Yoshua Bengio. The authors developed a method to fix the seq to seq models, and ability to translate
longer sentences. As you can see, attention was originally developed for
machine translation, but it’s since being used in many other domains
with great success. Before we move forward, I want to skip ahead
a bit and show you how well attention works. It’s surprising.

https://arxiv.org/abs/1409.0473

Title: Neural Machine Translation by Jointly Learning to Align and Translate

Here’s a comparison of
the performance between different models from
the Bahdanau paper using the bleu score, a performance metric that
you’ll learn about later. In brief, higher
scores are better, indicating more
correct translations. The dashed lines, they showed the scores for
bidirectional seq to seq model as the length of the input
sentence is increased. The 30 and 50 denotes the maximum sequence length
used to train the models. As you can see, the seq to seq models perform welfare sentences with
about 10-20 words, but they fall off beyond that. This is what you should expect. A seq to seq models
must store the meaning of the entire input sequence,
any single vector. The models developed
in this paper, RNN search 13-15, use bidirectional encoders and decoders, but with attention. First, these models
perform better than the traditional seqto seqmodels across all
sentence length. The RNN search 50 model has basically no fall off in performance as sentence
lengths increase. As you will see, this is because the models are able to focus on specific inputs to predict words in the output translation, instead of having to memorize
the entire input sentence.

在这里插入图片描述

Now I’ll show you the motivation behind attention
and how it works. Traditional seq to seq models, use the final hidden states of the encoder as the initial
hidden state of the decoder. This forces the encoder
to store the meaning of the entire input sequence
into this one hidden states.

在这里插入图片描述

Instead of using only
the final hidden states, you can pass all the hidden
states to the decoder. However, this quickly
becomes inefficient as you must retain the
hidden states for each input step in memory. To solve this, you can combine the hidden
states into one vector, typically called
the context vector. The samples operation here
is the point-wise addition. Since the hidden vectors
are all the same size, you can just add up
these vector elements by elements to produce another
vector of the same size. But now the decoder is getting information
about each step. But It really only
needs information from the first few inputs steps to predict the first word. This isn’t that much
different from using the last hidden states
from LSTM or GRU.

在这里插入图片描述

The solution here is to wait certain encoder vectors more than others before the
point-wise addition, [inaudible] are
more important for the next decoder outputs
would have larger weights. That this way, the
context vector holds more information about the most important words and less information
about other words. But how are these
weights calculated to determine which input words
are important at each step? The decoders previous
hidden states, denoted as S i minus 1, contains information
about the previous words in the output translation. This means, you can compare
the decoder states with each encoder state to determine the most
important inputs. Intuitively, the decoder can
set the weights such that if it focuses on only the
most important inputs words for the next prediction, it decides which parts of the input sequence
to pay attention to.

在这里插入图片描述

Now step into the
attention layer to examine how the weights and context
vector are calculated. The goal of the attention
layer is to return a context vector that contains the relevant information
from the encoder states.

The first step is to
calculate the alignments, E_IJ, which is a
score of how well the inputs around J match
the expected output its I. The more the much, the higher of his score we will expect. This is done using the
feedforward neural network with the encoder and decoder
hidden states as inputs, where the weights for the
feedforward network are learned along with the rest
of the seq to seq model. The scores are then
turned into weights which range from zero to one
using the softmax function. This means the weights
can be thought of as a probability distribution
which sum to one.

Finally, each encoder
states is multiplied by its respective weights and sum together into one
context vector. Since the weights are the
probability distribution, this is equivalent
to calculating an expected value
across word alignments.

在这里插入图片描述

Next up, you’ll get a better understanding
of how all this works by implementing a simple version of the attention operation
from the Bahdanau paper. I have now shown
you how attention works and why it is important. In the next video, I will define what our keys, queries and values, and show you how to use
them in attention.

Seq2Seq模型的一个改进版本是带有注意力机制(Attention Mechanism)的Seq2Seq模型。在传统的Seq2Seq模型中,编码器将整个输入序列编码为一个固定长度的向量,然后解码器使用这个向量来生成输出序列。然而,这种固定长度的表示可能会丢失输入序列中重要的信息,特别是当输入序列很长时。

引入注意力机制可以解决这个问题。注意力机制允许解码器在生成每个输出标记时都可以“注意到”输入序列的不同部分,并根据需要分配不同的注意力权重。这样,解码器可以根据当前要生成的输出标记,动态地选择性地关注输入序列的不同部分,从而更好地捕捉输入序列中的重要信息。

具体来说,带有注意力机制的Seq2Seq模型包括以下几个关键组件:

  1. 编码器(Encoder):与传统的Seq2Seq模型相同,将输入序列编码为一系列隐藏状态向量。

  2. 解码器(Decoder):与传统的Seq2Seq模型相同,使用编码器最后的隐藏状态向量作为初始隐藏状态,并生成输出序列。

  3. 注意力机制(Attention Mechanism):在解码器的每个时间步,计算注意力权重,用于加权编码器的隐藏状态向量,以生成上下文向量。这个上下文向量会结合当前解码器的隐藏状态向量,用于生成当前时间步的输出。

带有注意力机制的Seq2Seq模型在处理长序列和捕捉序列中的局部依赖关系方面通常表现更好,因为它可以在生成每个输出标记时根据需要动态地关注输入序列的不同部分。这使得它成为许多序列到序列任务(如机器翻译、文本摘要等)中的首选模型之一。

Ungraded Lab: Basic Attention

Basic Attention Operation: Ungraded Lab

As you’ve learned, attention allows a seq2seq decoder to use information from each encoder step instead of just the final encoder hidden state. In the attention operation, the encoder outputs are weighted based on the decoder hidden state, then combined into one context vector. This vector is then used as input to the decoder to predict the next output step.

In this ungraded lab, you’ll implement a basic attention operation as described in Bhadanau, et al (2014) using Numpy.

This is a practice notebook, where you can train writing your code. All of the solutions are provided at the end of the notebook.

# Import the libraries and define the functions you will need for this lab
import numpy as np

def softmax(x, axis=0):
    """ Calculate softmax function for an array x along specified axis
    
        axis=0 calculates softmax across rows which means each column sums to 1 
        axis=1 calculates softmax across columns which means each row sums to 1
    """
    return np.exp(x) / np.expand_dims(np.sum(np.exp(x), axis=axis), axis)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

1: Calculating alignment scores

The first step is to calculate the alignment scores. This is a measure of similarity between the decoder hidden state and each encoder hidden state. From the paper, this operation looks like

e i j = v a ⊤ tanh ⁡ ( W a s i − 1 + U a h j ) \large e_{ij} = v_a^\top \tanh{\left(W_a s_{i-1} + U_a h_j\right)} eij=vatanh(Wasi1+Uahj)

where W a ∈ R n × m W_a \in \mathbb{R}^{n\times m} WaRn×m, U a ∈ R n × m U_a \in \mathbb{R}^{n \times m} UaRn×m, and v a ∈ R m v_a \in \mathbb{R}^m vaRm
are the weight matrices and n n n is the hidden state size. In practice, this is implemented as a feedforward neural network with two layers, where m m m is the size of the layers in the alignment network. It looks something like:

在这里插入图片描述

Here h j h_j hj are the encoder hidden states for each input step j j j and s i − 1 s_{i - 1} si1 is the decoder hidden state of the previous step. The first layer corresponds to W a W_a Wa and U a U_a Ua, while the second layer corresponds to v a v_a va.

To implement this, first concatenate the encoder and decoder hidden states to produce an array with size K × 2 n K \times 2n K×2n where K K K is the number of encoder states/steps. For this, use np.concatenate (docs). Note that there is only one decoder state so you’ll need to reshape it to successfully concatenate the arrays. The easiest way is to use decoder_state.repeat (docs) to match the hidden state array size.

Then, apply the first layer as a matrix multiplication between the weights and the concatenated input. Use the tanh function to get the activations. Finally, compute the matrix multiplication of the second layer weights and the activations. This returns the alignment scores.

hidden_size = 16
attention_size = 10
input_length = 5

np.random.seed(42)

# Synthetic vectors used to test
encoder_states = np.random.randn(input_length, hidden_size)
decoder_state = np.random.randn(1, hidden_size)

#print(decoder_state.repeat(input_length, axis=0))
# Weights for the neural network, these are typically learned through training
# Use these in the alignment function below as the layer weights
layer_1 = np.random.randn(2 * hidden_size, attention_size)
layer_2 = np.random.randn(attention_size, 1)

# Implement this function. Replace None with your code. Solution at the bottom of the notebook
def alignment(encoder_states, decoder_state):
    # First, concatenate the encoder states and the decoder state
    inputs = np.concatenate((encoder_states, decoder_state.repeat(input_length, axis=0)), axis=1)
    assert inputs.shape == (input_length, 2 * hidden_size)
    
    # Matrix multiplication of the concatenated inputs and layer_1, with tanh activation
    activations = np.tanh(np.dot(inputs, layer_1))
    assert activations.shape == (input_length, attention_size)
    
    # Matrix multiplication of the activations with layer_2. Remember that you don't need tanh here
    scores = np.dot(activations, layer_2)
    assert scores.shape == (input_length, 1)
    
    return scores
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
# Run this to test your alignment function
scores = alignment(encoder_states, decoder_state)
print(scores)
  • 1
  • 2
  • 3

Output

[[4.35790943]
 [5.92373433]
 [4.18673175]
 [2.11437202]
 [0.95767155]]
  • 1
  • 2
  • 3
  • 4
  • 5

If you implemented the function correctly, you should get these scores:

[[4.35790943]
 [5.92373433]
 [4.18673175]
 [2.11437202]
 [0.95767155]]
  • 1
  • 2
  • 3
  • 4
  • 5

2: Turning alignment into weights

The next step is to calculate the weights from the alignment scores. These weights determine the encoder outputs that are the most important for the decoder output. These weights should be between 0 and 1. You can use the softmax function (which is already implemented above) to get these weights from the attention scores. Pass the attention scores vector to the softmax function to get the weights. Mathematically,

α i j = exp ⁡ ( e i j ) ∑ k = 1 K exp ⁡ ( e i k ) \large \alpha_{ij} = \frac{\exp{\left(e_{ij}\right)}}{\sum_{k=1}^K \exp{\left(e_{ik}\right)}} αij=k=1Kexp(eik)exp(eij)

3: Weight the encoder output vectors and sum

The weights tell you the importance of each input word with respect to the decoder state. In this step, you use the weights to modulate the magnitude of the encoder vectors. Words with little importance will be scaled down relative to important words. Multiply each encoder vector by its respective weight to get the alignment vectors, then sum up the weighted alignment vectors to get the context vector. Mathematically,

c i = ∑ j = 1 K α i j h j \large c_i = \sum_{j=1}^K\alpha_{ij} h_{j} ci=j=1Kαijhj

Implement these steps in the attention function below.

# Implement this function. Replace None with your code.
def attention(encoder_states, decoder_state):
    """ Example function that calculates attention, returns the context vector 
    
        Arguments:
        encoder_vectors: NxM numpy array, where N is the number of vectors and M is the vector length
        decoder_vector: 1xM numpy array, M is the vector length, much be the same M as encoder_vectors
    """ 
    
    # First, calculate the alignment scores
    scores = alignment(encoder_states, decoder_state)
    
    # Then take the softmax of the alignment scores to get a weight distribution
    weights = softmax(scores) # 5x1
    
    # Multiply each encoder state by its respective weight
    weighted_scores = encoder_states * weights # 广播机制,逐元素相乘 5x16 vs. 5x1,后者变成5x16
    print(weighted_scores.shape)
    #print(weighted_scores)

    
    # Sum up weighted alignment vectors to get the context vector and return it
    context = np.sum(weighted_scores, axis=0)
    return context

context_vector = attention(encoder_states, decoder_state)
print(context_vector)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27

Output

(5, 16)
[-0.63514569  0.04917298 -0.43930867 -0.9268003   1.01903919 -0.43181409
  0.13365099 -0.84746874 -0.37572203  0.18279832 -0.90452701  0.17872958
 -0.58015282 -0.58294027 -0.75457577  1.32985756]
  • 1
  • 2
  • 3
  • 4

If you implemented the attention function correctly, the context vector should be

[-0.63514569  0.04917298 -0.43930867 -0.9268003   1.01903919 -0.43181409
  0.13365099 -0.84746874 -0.37572203  0.18279832 -0.90452701  0.17872958
 -0.58015282 -0.58294027 -0.75457577  1.32985756]
  • 1
  • 2
  • 3

See below for solutions

# Solution
def alignment(encoder_states, decoder_state):
    # First, concatenate the encoder states and the decoder state.
    inputs = np.concatenate((encoder_states, decoder_state.repeat(input_length, axis=0)), axis=1)
    assert inputs.shape == (input_length, 2*hidden_size)
    
    # Matrix multiplication of the concatenated inputs and the first layer, with tanh activation
    activations = np.tanh(np.matmul(inputs, layer_1))
    assert activations.shape == (input_length, attention_size)
    
    # Matrix multiplication of the activations with the second layer. Remember that you don't need tanh here
    scores = np.matmul(activations, layer_2)
    assert scores.shape == (input_length, 1)
    
    return scores

# Run this to test your alignment function
scores = alignment(encoder_states, decoder_state)
print(scores)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
# Solution
def attention(encoder_states, decoder_state):
    """ Example function that calculates attention, returns the context vector 
    
        Arguments:
        encoder_vectors: NxM numpy array, where N is the number of vectors and M is the vector length
        decoder_vector: 1xM numpy array, M is the vector length, much be the same M as encoder_vectors
    """ 
    
    # First, calculate the dot product of each encoder vector with the decoder vector
    scores = alignment(encoder_states, decoder_state)
    
    # Then take the softmax of those scores to get a weight distribution
    weights = softmax(scores)
    
    # Multiply each encoder state by its respective weight
    weighted_scores = encoder_states * weights
    
    # Sum up the weights encoder states
    context = np.sum(weighted_scores, axis=0)
    
    return context

context_vector = attention(encoder_states, decoder_state)
print(context_vector)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25

Background on seq2seq

Recurrent models typically take in a sequence in the order it is written and use that to output a sequence. Each elementin the sequence is associated with its step in computation time t t t.(i.e.if a word is in the third element, it will be computed at t 3 ) t_3) t3). These models generate a sequence of hidden states h t h_t ht, as afunction of the previous hidden state h t − 1 h_{t-1} ht1 and the input for position t.

The sequential nature of models you learned in the previous course (RNNs, LSTMs, GRUs) does not allow for parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. In other words, if you rely on sequences and you need to know the beginning of a text before being able to compute something about the ending of it, then you can not use parallel computing. You would have to wait until the initial computations are complete. This is not good, because if your text is too long, then 1) it will take a long time for you to process it and 2) you will lose a good amount of information mentioned earlier in the text as you approach the end.

Therefore, attention mechanisms have become critical for sequence modeling in various tasks, allowing modeling of dependencies without caring too much about their distance in the input or output sequences.

In this course, you will learn about these attention mechanisms and see how they are implemented. Welcome to Course 4!

Queries, Keys, Values, and Attention

Queries, keys and values are terms
that you will be using for attention in this video. I will define them for you and
show you how they could be used. Let’s get started. The original attention paper
was published in 2014. Since then there have been multiple
variations on attention with some models that don’t rely on
recurrent neural networks. For example, the 2017 paper attention is all you need
to introduce the transformer model and the form of attention based on information
retrieval, using queries, keys and values. This is an efficient and powerful form
of attention that you’ll be using in this week’s assignment in this video. I’ll show you how this type of attention
works as well as the concept of alignments between languages. Conceptually, you can think of keys and
values as a look up table. The query is matched to a key and the value associated with
that key is returned. For example,
if we are translating between french and english heure matches with time. So we’d like to get the value for
time, in practice to the queries, keys and
values are all represented by vectors. Embedding vectors for example.

Due to this, you don’t get exact matches
but the model can learn which words are the most similar between
the source and target languages. The similarity between
words is called alignment. The query and key vectors are used
to calculate alignment scores that are measures of how well the query and
keys match. These alignment scores are then
turned into weights used for a weighted sum of the value vectors, this weighted sum of the value vectors
is returned as the attention vector.

在这里插入图片描述

This process can be performed
using scale dot-product attention. The queries for each step are packed
together into a matrix Q. So attention can be computed
simultaneously for each query. The keys and values are also
packed into matrices K and V. These matrices are the inputs for the
attention function shown as a diagram on the left and mathematically on the rights. First, the queries and keys matrices are multiplied together
to get a matrix of alignments course. These are then scaled by the square
root of the key vector dimension, dk the scaling improves
the model performance for larger model sizes and could be
seen as a regularization constants. Next the scale scores are converted to
weights using the softmax function. Such that the weights for
each query sum to one. Finally the weights and the value matrices
are multiplied to get the attention vectors for each query, you can think of
the keys and the values as being the same. So when you multiply the softmax
output with V you are taking a linear combination of your initial input which
is then being fed to the decoder. Take a minute to make sure
what I just said makes sense.

No, that unlike the original form of
attention, scale dot-product attention consists of only two Matrix
multiplications and no neural networks. Since matrix multiplication is highly
optimized in modern deep learning frameworks. This form of attention is
much faster to compute but this also means that the alignments
between the source and target languages must
be learned elsewhere. Typically, alignment is learned
in the input embeddings or in other linear layers
before the attention layer.

在这里插入图片描述

Before moving on,
I want to look a bit closer at alignment. The alignment weights form a matrix with
queries, targets words on the rows and keys or source words on the columns. Each entry in this matrix is
the weight for the correspondent query, key pair word pairs that have similar
meanings, K and T, for example, will have larger weights than
the similar words like day and time. Through training, the model learns
which words have similar meanings and encodes that information and
the query and key vectors.

在这里插入图片描述

Learning alignment like
this is beneficial for translating between languages with
different grammatical structures. Since attention looks at the entire
input and target sentences at once and calculates alignments based on word pairs, weights are assigned appropriately
regardless of word order. For example, In the sentence, the
agreement on the European Economic Area was signed in August 1992 and this other
sentence lack of lasagne economic open. I mean you’re not meeting of sangatte
revenues, you can see that zone in the area are at different positions,
let’s have the same meaning. The model has learned to align them
appropriately, allowing the decoder to focus on the appropriate inputs
words despite different ordering.

在这里插入图片描述

Congrats on absorbing
all these new concepts. I introduced you to the purpose
of an attention layer. You saw how it is related with
information retrieval and I showed you how well it works even for
languages with very different structures. In the next video, I’ll be talking
about neural machine translation and show you what the setup looks like for
the system. I’ll show you what the data set looks
like and the steps required for pre processing your data sets. You have now seen what key square ease and
values are. These are important because if
you read a research paper you might come across these terms and
you will understand them. In the next video. I will talk about the setup for
machine translation.

在这里插入图片描述

在这里插入图片描述

Ungraded Lab: Scaled Dot-Product Attention

Scaled Dot-Product Attention: Ungraded Lab

The 2017 paper Attention Is All You Need introduced the Transformer model and scaled dot-product attention, sometimes also called QKV (Queries, Keys, Values) attention. Since then, Transformers have come to dominate large-scale natural language applications. Scaled dot-product attention can be used to improve seq2seq models as well. In this ungraded lab, you’ll implement a simplified version of scaled dot-product attention and replicate word alignment between English and French, as shown in Bhadanau, et al. (2014).

The Transformer model learns how to align words in different languages. You won’t be training any weights here, so instead you will use pre-trained aligned word embeddings from here. Run the cell below to load the embeddings and set up the rest of the notebook.

This is a practice notebook, where you can train writing your code. All of the solutions are provided at the end of the notebook.

# Import the libraries
import pickle
import matplotlib.pyplot as plt
import numpy as np

# Load the word2int dictionaries
with open("./data/word2int_en.pkl", "rb") as f:
    en_words = pickle.load(f)
    
with open("./data/word2int_fr.pkl", "rb") as f:
    fr_words = pickle.load(f)

# Load the word embeddings
en_embeddings = np.load("./data/embeddings_en.npz")["embeddings"]
fr_embeddings = np.load("./data/embeddings_fr.npz")["embeddings"]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
# Define some helper functions

def tokenize(sentence, token_mapping):
    tokenized = []
    
    for word in sentence.lower().split(" "):
        try:
            tokenized.append(token_mapping[word])
        except KeyError:
            # Using -1 to indicate an unknown word
            tokenized.append(-1)
        
    return tokenized


def embed(tokens, embeddings):
    embed_size = embeddings.shape[1]
    
    output = np.zeros((len(tokens), embed_size))
    for i, token in enumerate(tokens):
        if token == -1:
            output[i] = np.zeros((1, embed_size))
        else:
            output[i] = embeddings[token]
            
    return output
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26

The scaled-dot product attention consists of two matrix multiplications and a softmax scaling as shown in the diagram below from Vaswani, et al. (2017). It takes three input matrices, the queries, keys, and values.

在这里插入图片描述

Mathematically, this is expressed as

A t t e n t i o n ( Q , K , V ) = s o f t m a x ( Q K ⊤ d k ) V \large \mathrm{Attention}\left(Q, K, V\right) = \mathrm{softmax}\left(\frac{QK^{\top}}{\sqrt{d_k}}\right)V Attention(Q,K,V)=softmax(dk QK)V

where Q Q Q, K K K, and V V V are the queries, keys, and values matrices respectively, and d k d_k dk is the dimension of the keys. In practice, Q, K, and V all have the same dimensions. This form of attention is faster and more space-efficient than what you implemented before since it consists of only matrix multiplications instead of a learned feed-forward layer.

Conceptually, the first matrix multiplication is a measure of the similarity between the queries and the keys. This is transformed into weights using the softmax function. These weights are then applied to the values with the second matrix multiplication resulting in output attention vectors. Typically, decoder states are used as the queries while encoder states are the keys and values.

Exercise 1

Implement the softmax function with Numpy and use it to calculate the weights from the queries and keys. Assume the queries and keys are 2D arrays (matrices). Note that since the dot-product of Q and K will be a matrix, you’ll need to calculate softmax over a specific axis. See the end of the notebook for solutions.

def softmax(x, axis=0):    
    """ Calculate softmax function for an array x

        axis=0 calculates softmax across rows which means each column sums to 1 
        axis=1 calculates softmax across columns which means each row sums to 1
    """
    # Replace pass with your code.
    y = np.exp(x)
    return y / np.expand_dims(np.sum(y, axis=axis), axis)

def calculate_weights(queries, keys):
    """ Calculate the weights for scaled dot-product attention"""
    # Replace None with your code.
    dot = np.dot(queries, keys.T)/ np.sqrt(keys.shape[1])
    weights = softmax(dot, axis=1)
    
    assert weights.sum(axis=1)[0] == 1, "Each row in weights must sum to 1"
    
    # Replace pass with your code.
    return weights
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20

在这段代码中,np.sum(y, axis=axis)计算了y数组沿着指定轴的和。然后,np.expand_dims()函数用于在这个和的基础上扩展一个维度,使得结果与y数组具有相同的维度,但在指定的轴上增加了一个长度为1的维度。

具体来说,假设y是一个二维数组,axis=1np.sum(y, axis=1)将对每一行求和,得到一个形状为(y.shape[0],)的一维数组。然后,np.expand_dims(np.sum(y, axis=1), axis=1)将这个一维数组在第二个轴上扩展,得到一个形状为(y.shape[0], 1)的二维数组,其中每行的和仍然保持不变。

这个操作通常用于在计算softmax函数时,将每个元素除以对应行(或列)的总和,以确保每行(或列)的元素之和为1。这是因为softmax函数的结果通常被解释为概率分布,所以每行(或列)的和应该为1。

# Tokenize example sentences in English and French, then get their embeddings
sentence_en = "The agreement on the European Economic Area was signed in August 1992 ."
tokenized_en = tokenize(sentence_en, en_words)
embedded_en = embed(tokenized_en, en_embeddings)

sentence_fr = "L accord sur la zone économique européenne a été signé en août 1992 ."
tokenized_fr = tokenize(sentence_fr, fr_words)
embedded_fr = embed(tokenized_fr, fr_embeddings)

# These weights indicate alignment between words in English and French
alignment = calculate_weights(embedded_fr, embedded_en)

# Visualize weights to check for alignment
fig, ax = plt.subplots(figsize=(7,7))
ax.imshow(alignment, cmap='gray')
ax.xaxis.tick_top()
ax.set_xticks(np.arange(alignment.shape[1]))
ax.set_xticklabels(sentence_en.split(" "), rotation=90, size=16);
ax.set_yticks(np.arange(alignment.shape[0]));
ax.set_yticklabels(sentence_fr.split(" "), size=16);
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20

If you implemented the weights calculations correctly, the alignment matrix should look like this:

在这里插入图片描述

This is a demonstration of alignment where the model has learned which words in English correspond to words in French. For example, the words signed and signé have a large weight because they have the same meaning. Typically, these alignments are learned using linear layers in the model, but you’ve used pre-trained embeddings here.

Exercise 2

Complete the implementation of scaled dot-product attention using your calculate_weights function (ignore the mask).

def attention_qkv(queries, keys, values):
    """ Calculate scaled dot-product attention from queries, keys, and values matrices """
    
    # Replace pass with your code.
    attention = np.dot(calculate_weights(queries, keys), values)
    return attention


attention_qkv_result = attention_qkv(embedded_fr, embedded_en, embedded_en)

print(f"The shape of the attention_qkv function is {attention_qkv_result.shape}")
print(f"Some elements of the attention_qkv function are \n{attention_qkv_result[0:2,:10]}")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12

Output

The shape of the attention_qkv function is (14, 300)
Some elements of the attention_qkv function are 
[[-0.04039161 -0.00275749  0.00389873  0.04842744 -0.02472726  0.01435613
  -0.00370253 -0.0619686  -0.00206159  0.01615228]
 [-0.04083253 -0.00245985  0.00409068  0.04830341 -0.02479128  0.01447497
  -0.00355203 -0.06196036 -0.00241327  0.01582606]]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

Expected output

The shape of the attention_qkv function is (14, 300)

Some elements of the attention_qkv function are

[[-0.04039161 -0.00275749  0.00389873  0.04842744 -0.02472726  0.01435613
  -0.00370253 -0.0619686  -0.00206159  0.01615228]
 [-0.04083253 -0.00245985  0.00409068  0.04830341 -0.02479128  0.01447497
  -0.00355203 -0.06196036 -0.00241327  0.01582606]]
  • 1
  • 2
  • 3
  • 4

Solutions

def softmax(x, axis=0):
    """ Calculate softmax function for an array x
    
        axis=0 calculates softmax across rows which means each column sums to 1 
        axis=1 calculates softmax across columns which means each row sums to 1
    """
    y = np.exp(x) 
    return y / np.expand_dims(np.sum(y, axis=axis), axis)

def calculate_weights(queries, keys):
    """ Calculate the weights for scaled dot-product attention"""
    dot = np.matmul(queries, keys.T)/np.sqrt(keys.shape[1])
    weights = softmax(dot, axis=1)
    
    assert weights.sum(axis=1)[0] == 1, "Each row in weights must sum to 1"
    
    return weights

def attention_qkv(queries, keys, values):
    """ Calculate scaled dot-product attention from queries, keys, and values matrices """
    weights = calculate_weights(queries, keys)
    return np.matmul(weights, values)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22

Setup for Machine Translation

You will now learn about
how words are being represented in the neural
machine translation setting. You will also see what
the dataset looks like. When implementing
the systems I’ll show you that you need to
keep track of a few things. For example, which words
correspond to what sectors. With that said let’s dive in. This is an example
of the type of input data that you will have for your
assignments this week. Over here you have the
sequence, I’m hungry, and on the right you have the corresponding
French equivalent. Further down, I watch the soccer game and the
corresponding French equivalent. You’re going to have a
great many of these inputs. You should know
that the state of the art models use
pretrained vectors. But otherwise, the first
thing you’ll do is to use a one-hot vector
to represent the words. Usually you’ll keep track of your mappings with
the word to index, and index to word dictionary. Given any input, you
transform it into indices and then vice versa when you make
the predictions. You’ll also normally use
an end of sequence token. You will pad your token vectors with zeros to match the length of the longest sequence.

在这里插入图片描述

Here’s an example. This is an English sentence and the tokenized version of
the English sentence. You can see that
it has an index of 4,546 for the word both. After the initial tokenization, just add EOS token
shown here is one, and pad with zeros to match the length of
the longest sequence. Now let’s go to the
French translation of that sequence along with the tokenized version of
the French translation. Notice that one is the end
of sentence token here to. It’s also followed by a
series of padding zeros. Given now that you know
how to represent words, how to initialize your model, and how to structure
your dataset, you can go ahead and start
training your model. In the next video, I’ll show
you how you can do this.

Teacher Forcing

Hello. You’ll now learn how to train your neural machine
translation system. You will learn about
certain concepts like teacher forcing, and you’ll see some of its
advantages. Let’s dive in. In this section, you’ll see how to train your neural
machine translation, NMT for sorts, model
with attention. I’ll introduce you to the
concepts of teacher forcing. As you learned before, seek to seek models generate
translations by feeding the output of the decoder
back in as the next inputs. This way there is no set
length on the output sequence. When training the
model, intuitively, you would compare the
decoder output sequence with the target sequence
to calculate the loss. That is, you would calculate the cross entropy
loss for each step, then sum the steps together
for the total loss. However, in practice, this
doesn’t work too well. The problem is that in the
early stages of training, the model is naive. It’ll make wrong predictions
early in the sequence. This problem compounds as the model keeps making
wrong predictions and the translated sequence gets further and further from
the target sequence.

在这里插入图片描述

The problem is illustrated
in this slide, where the final
outputs word duveteux has a similar word to the
word fluffy in English, which has a very different
meaning from the word team. To avoid this problem, you can use the
ground truth words as decoder inputs instead
of the decoder outputs. Even if the model makes
a wrong prediction, it pretends as if it’s made the correct one and
this can continue. This method makes training much faster and has a special
name, teacher forcing. There are some
variations on this tool. For example, you can slowly start using decoder
outputs over time, so that leads into training, you are no longer feeding
in the target words. This is known as
curriculum learning. You are now familiar
with teacher forcing, and you can add this
technique to your toolbox, to help you with
training your model, and to help you get
a better accuracy.

在这里插入图片描述

Teacher forcing 是一种训练循环神经网络(RNN)等序列模型的技术,它在训练过程中使用真实的(或者模型自己生成的)前一步输出作为当前步的输入,而不是使用上一步的预测结果。这样可以加快模型的训练速度和提高收敛性,尤其是在训练初期。

在使用Teacher forcing时,模型在训练过程中可以更快地学习到输入序列和输出序列之间的映射关系,因为它可以直接观察到正确的输出。然而,这种方法也存在一个问题,就是在实际推理阶段(即不使用Teacher forcing时),因为模型在训练过程中始终依赖于前一步的真实输出,可能导致模型在推理阶段表现不佳,即所谓的“曝光偏差”(exposure bias)问题。

为了解决这个问题,可以在训练过程中以一定的概率使用模型自己生成的前一步输出作为当前步的输入,这样可以更好地模拟实际推理时的情况,称为“Scheduled Sampling”。通过逐渐增加使用模型自己生成的输出的概率,可以平衡训练和推理之间的差异,提高模型在推理阶段的性能。

NMT Model with Attention

Welcome. I will now
show you how to train a neural machine
translation system from scratch. I’ll go through every step
slowly so you can understand what is going on behind the
scenes. Let’s get started. In this video, I’ll show you how everything you have
seen this week fits together into the
model architecture you will implement in
this week’s assignments. First, I’ll give you
a general overview before I go into the
more intricate details. You will implement
a model similar to the one you have seen
in previous lectures. You will have an encoder that
gets the input sequence, a decoder which is supposed
to do the translation, and an Attention Mechanism
which would help the decoder focus on the important parts of
the input sequence. Recall that the decoder
is supposed to pass hidden states to the
Attention Mechanism to get context vectors. The pass of the hidden
states from the decoder to the Attention Mechanism could
not be easy to implement. Instead, you will be
using two decoders, a pre-attention decoder
to provide hidden states, and a post-attention decoder which will provide
the translation.

在这里插入图片描述

A general overview of the modified model
looks as follows. You will have the encoder
and a pre-attention decoder that’s got the inputs
and target sequences. Then for the
pre-attention decoder, the target sequence
is shifted right, which is how you’ll be
implementing the teacher forcing. From the encoder and
pre-attention decoder, you will retrieve
the hidden states at each step and use them as inputs for the
Attention Mechanism. You will use the
hidden states from the encoder as the
keys and values, while those from the
decoder are the queries. As you have seen in
previous lectures, the Attention Mechanism will use these values to compute
the context vectors. Finally, the post-attention
decoder will use the context vectors as inputs to provide the
predicted sequence.

在这里插入图片描述

Now, let’s take a closer look at each piece of the model. The initial step is
to make two copies of the input tokens and
the target tokens because you will need them in different places of the model. One copy of the input tokens
is fed into the encoder, which is used to transform them into the key
and value vectors, while a copy of
the target tokens goes into the
pre-attention decoder. Note that the
computations done in the encoder and
pre-attention decoder could be done in parallel, since they don’t
depend on each other. Within the
pre-attention decoder, you shift each
sequence to the right and add a start of
sentence token. In the encoder and
pre-attention decoder, the inputs and
targets go through an embedding layer
before going to LSTMs. After getting the query
key and value vectors, you have to prepare them
for the attention layer. You’ll use a function
to help you get a padding mask to help the attention layer determine
the padding tokens. This step is where you will use the copy of
the input tokens. Now, everything is
ready for attention. You pass the queries,
keys, values, and the mask to the
attention layer that outputs the context
vector and the mask. Before going through the
decoder, you drop the mask. You then pass the
context vectors through the decoder composed of an LSTM, a dense layer, and a LogSoftmax. In the end, your model returns log probabilities and the copy of the target tokens that
you made at the beginning. There you have it,
the model you’ll be building and the intuition
behind all the steps. Take a break and just
let all that sink in. You now have an overview
of how NMT is implemented. If you did not
understand everything, do not worry about it. We will go in more detail in this week’s programming
assignments. In the next video, I will talk about how to
evaluate your system.

在这里插入图片描述

BLEU Score

After building and
training your model, it is essential to assess
how well it performs. For machine translation, you have different metrics that were engineered
just for this task. In this lecture, I will
show you the BLEU score and some of its issues
for evaluating machine translation models. The BLEU score, a bilingual
evaluation under study, is an algorithm designed
to evaluate some of the most challenging problems in NLP, including
machine translation. It evaluates the quality of
machine-translated text by comparing a candidate
translation to one or more references, which are often
human translations. The closer the BLEU
score is to one, the better your model is, the closer to zero,
the worse it is.

在这里插入图片描述

With that said, what is the BLEU score and why is
this an important metric? To get the BLEU score, you have to compute the
precision of the candidates by comparing its end-grams
with reference translations. To demonstrate, I’ll use
unigrams as an example. Let’s say that you have a
candidate sequence that you got from your model
composed of I, I, am, I. You also have one
reference translation which contains the words, Eunice said, I’m hungry. A second reference translation
that includes the words, he said, I’m hungry. To get the BLEU score, you count how many words from the candidate appear in any of the references and
divide that count by the total number of words in
the candidate translation. You can view it as
a precision metric.

在这里插入图片描述

You have to go
through all the words in the candidate translation. First, you have the word I, which appears in both
reference translations. You add one to your count. Then you have again the word I, which you already know
appears on both references, and you add one to your count. After that, you have the word am which also appears
in both references. You add that word to your count. At the end, you have
the word I again, which appears on
both references. You can add one to your count. Finally, you can get the
BLEU score by dividing your count by the number of words in the candidate
translation, which in this case
is equal to 4. The whole process gives you
a BLEU score equal to 1. Weird? This translation that is far from being equal to the references got
a perfect score. With this vanilla BLEU score, a model that always outputs
common words will do great.

在这里插入图片描述

Let’s try a modified
version that will give you a better estimate of
your model’s performance. For the modified version
of the BLEU score, after you find a word from the candidates in one or
more of the references, you stop considering
that word from the reference for the following
words in the candidates. In other words, you
exhaust the words in the references after you match them with a word
in the candidates. Let’s start from the beginning of the candidate translation. You have the word I that
appears in both references. You add one to your count and exhaust the word I
from both references. Then you have the word I again, but you don’t have that word
in the references because it was taken out for the
previous word in the candidate. You don’t add anything
to your count. Then you have the word M, which appears in
both references. You add one to your counts and eliminate the word M
from both references. After that, you have
the word I again, but no left occurrences
in the references. You don’t add anything
to your counts. Finally, you divide your count
by the number of words in the candidate translation
to get BLEU score of 2/4 or 0.5. As you can note, this version of the BLEU score makes more sense than the vanilla implementation.

在这里插入图片描述

However, like anything in life, using the BLEU score as an evaluation metric
has some caveats. For one, it doesn’t consider the semantic
meaning of the words. It also doesn’t consider the
structure of the sentence. Imagine getting
this translation. Ate I was hungry because. If the reference sentence is
I ate because I was hungry, this would get a
perfect BLEU score. BLEU score is the most widely
adopted evaluation metric for machine translation. But you should be aware of these drawbacks before using it.

在这里插入图片描述

You now know how to evaluate your machine translation
model using the BLEU score. I also showed you that this
metric has some issues because it doesn’t care about semantics and
sentence structure. In the following video, you’ll see another metric
for machine translation. That metric could be used to better estimate your
model performance.

BLEU(Bilingual Evaluation Understudy)和ROUGE(Recall-Oriented Understudy for Gisting Evaluation)都是用于评估自然语言处理任务中生成文本质量的指标,但它们在应用和计算方式上有一些不同之处。

  1. 用途

    • BLEU主要用于机器翻译任务,用于评估机器翻译系统生成的译文与参考译文之间的相似程度。
    • ROUGE主要用于文本摘要任务,用于评估生成的摘要与参考摘要之间的相似程度。
  2. 计算方式

    • BLEU通过比较候选译文中的n-gram与参考译文中的n-gram的匹配情况来计算得分。它计算了n-gram的精确匹配率,并使用一个惩罚项来惩罚过度短的译文。
    • ROUGE使用类似的方法,但通常使用的是召回率(Recall)作为评估指标,因为在文本摘要任务中,关键信息的召回更为重要。
  3. 评价指标

    • BLEU的评价指标是介于0到1之间的值,接近1表示候选译文与参考译文之间的相似度更高。
    • ROUGE通常包括多个指标,如ROUGE-N(N-gram级别的召回率)、ROUGE-L(最长公共子序列级别的召回率)等,也是介于0到1之间的值,值越高表示生成的摘要与参考摘要之间的相似度更高。

总的来说,BLEU和ROUGE都是用于评估生成文本质量的重要指标,但它们适用于不同的任务,并且在计算方式和评价指标上存在一些差异。

Ungraded Lab: BLEU Score

Calculating the Bilingual Evaluation Understudy (BLEU) score: Ungraded Lab

In this ungraded lab, you will implement a popular metric for evaluating the quality of machine-translated text: the BLEU score proposed by Kishore Papineni, et al. in their 2002 paper “BLEU: a Method for Automatic Evaluation of Machine Translation”. The BLEU score works by comparing a “candidate” text to one or more “reference” texts. The score is higher the better the result. In the following sections you will calculate this value using your own implementation as well as using functions from a library.

1. Importing the Libraries

You will start by importing the Python libraries. First, you will implement your own version of the BLEU Score using NumPy. To verify that your implementation is correct, you will compare the results with those generated by the SacreBLEU library. This package provides hassle-free computation of shareable, comparable, and reproducible BLEU scores. It also knows all the standard test sets and handles downloading, processing, and tokenization.

import numpy as np                  # import numpy to make numerical computations.
import nltk                         # import NLTK to handle simple NL tasks like tokenization.
nltk.download("punkt")
from nltk.util import ngrams
from collections import Counter     # import a counter.
!pip3 install 'sacrebleu'           # install the sacrebleu package.
import sacrebleu                    # import sacrebleu in order compute the BLEU score.
import matplotlib.pyplot as plt     # import pyplot in order to make some illustrations.
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

2. BLEU score

2.1 Definitions and formulas

You have seen how to calculate the BLEU score in this week’s lectures. Formally, you can express the BLEU score as:

B L E U = B P × ( ∏ i = 1 n p r e c i s i o n i ) ( 1 / n ) . (1) BLEU = BP\times\Bigl(\prod_{i=1}^{n}precision_i\Bigr)^{(1/n)}.\tag{1} BLEU=BP×(i=1nprecisioni)(1/n).(1)

The BLEU score depends on the B P BP BP, which stands for Brevity Penalty, and the weighted geometric mean precision for different lengths of n-grams, both of which are described below. The product runs from i = 1 i=1 i=1 to i = n i=n i=n to account for 1-grams to n-grams and the exponent of 1 / n 1/n 1/n is there to calculate the geometrical average. In this notebook, you will use n = 4 n=4 n=4

The Brevity Penalty is defined as an exponential decay:

B P = m i n ( 1 , e ( 1 − ( l e n ( r e f ) / l e n ( c a n d ) ) ) ) , (2) BP = min\Bigl(1, e^{(1-({len(ref)}/{len(cand)}))}\Bigr),\tag{2} BP=min(1,e(1(len(ref)/len(cand)))),(2)

where l e n ( r e f ) {len(ref)} len(ref) and l e n ( c a n d ) {len(cand)} len(cand) refer to the length or count of words in the reference and candidate translations. The brevity penalty helps to handle very short translations.

The precision is defined as :

p r e c i s i o n i = ∑ s i ∈ c a n d m i n ( C ( s i , c a n d ) , C ( s i , r e f ) ) ∑ s i ∈ c a n d C ( s i , c a n d ) . (3) precision_i = \frac {\sum_{s_i \in{cand}}min\Bigl(C(s_i, cand), C(s_i, ref)\Bigr)}{\sum_{s_i \in{cand}} C(s_i, cand)}.\tag{3} precisioni=sicandC(si,cand)sicandmin(C(si,cand),C(si,ref)).(3)

The sum goes over all the i-grams s i s_i si in the candidate sentence c a n d cand cand. C ( s i , c a n d ) C(s_i, cand) C(si,cand) and C ( s i , r e f ) C(s_i, ref) C(si,ref) are the counts of the i-grams in the candidate and reference sentences respectively. So the sum counts all the n-grams in the candidate sentence that also appear in the reference sentence, but only counts them as many times as they appear in the reference sentence and not more. This is then divided by the total number of i-grams in the candidate sentence.

2.2 Visualizing the BLEU score
Brevity Penalty:

The brevity penalty penalizes generated translations that are shorter than the reference sentence. It compensates for the fact that the BLEU score has no recall term.

reference_length = 1
candidate_length = np.linspace(1.5, 0.5, 100)

length_ratio = reference_length / candidate_length
BP = np.minimum(1, np.exp(1 - length_ratio))

# Plot the data
fig, ax = plt.subplots(1)
lines = ax.plot(length_ratio, BP)
ax.set(
    xlabel="Ratio of the length of the reference to the candidate text",
    ylabel="Brevity Penalty",
)
plt.show()
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14

Output

在这里插入图片描述

N-Gram Precision:

The n-gram precision counts how many n-grams (in your case unigrams, bigrams, trigrams, and four-grams for i =1 , … , 4) match their n-gram counterpart in the reference translations. This term acts as a precision metric. Unigrams account for adequacy while longer n-grams account for fluency of the translation. To avoid overcounting, the n-gram counts are clipped to the maximal n-gram count occurring in the reference ( m n r e f m_{n}^{ref} mnref). Typically precision shows exponential decay with the degree of the n-gram.

# Mocked dataset showing the precision for different n-grams
data = {"1-gram": 0.8, "2-gram": 0.7, "3-gram": 0.6, "4-gram": 0.5}

# Plot the datapoints defined above
fig, ax = plt.subplots(1)
bars = ax.bar(*zip(*data.items()))
ax.set(ylabel="N-gram precision")
plt.show()
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

Output

在这里插入图片描述

N-gram BLEU score:

When the n-gram precision is normalized by the brevity penalty (BP), then the exponential decay of n-grams is almost fully compensated. The BLEU score corresponds to a geometric average of this modified n-gram precision.

# Mocked dataset showing the precision multiplied by the BP for different n-grams
data = {"1-gram": 0.8, "2-gram": 0.77, "3-gram": 0.74, "4-gram": 0.71}

# Plot the datapoints defined above
fig, ax = plt.subplots(1)
bars = ax.bar(*zip(*data.items()))
ax.set(ylabel="Modified N-gram precision")
plt.show()
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

Output

在这里插入图片描述

3. Example Calculations of the BLEU score

In this example you will have a reference sentence and 2 candidate sentences. You will tokenize all sentences using the NLTK package. Then you will compare the two candidates to the reference using BLEU score.

First you define and tokenize the sentences.

reference = "The NASA Opportunity rover is battling a massive dust storm on planet Mars."
candidate_1 = "The Opportunity rover is combating a big sandstorm on planet Mars."
candidate_2 = "A NASA rover is fighting a massive storm on planet Mars."

tokenized_ref = nltk.word_tokenize(reference.lower())
tokenized_cand_1 = nltk.word_tokenize(candidate_1.lower())
tokenized_cand_2 = nltk.word_tokenize(candidate_2.lower())

print(f"{reference} -> {tokenized_ref}")
print("\n")
print(f"{candidate_1} -> {tokenized_cand_1}")
print("\n")
print(f"{candidate_2} -> {tokenized_cand_2}")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13

Output

The NASA Opportunity rover is battling a massive dust storm on planet Mars. -> ['the', 'nasa', 'opportunity', 'rover', 'is', 'battling', 'a', 'massive', 'dust', 'storm', 'on', 'planet', 'mars', '.']


The Opportunity rover is combating a big sandstorm on planet Mars. -> ['the', 'opportunity', 'rover', 'is', 'combating', 'a', 'big', 'sandstorm', 'on', 'planet', 'mars', '.']


A NASA rover is fighting a massive storm on planet Mars. -> ['a', 'nasa', 'rover', 'is', 'fighting', 'a', 'massive', 'storm', 'on', 'planet', 'mars', '.']
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
3.1 Define the functions to calculate the BLEU score

Computing the Brevity Penalty

You will start by defining the function for brevity penalty according to the equation (2) in section 2.1.

def brevity_penalty(candidate, reference):
    """
    Calculates the brevity penalty given the candidate and reference sentences.
    """
    reference_length = len(reference)
    candidate_length = len(candidate)

    if reference_length < candidate_length:
        BP = 1
    else:
        penalty = 1 - (reference_length / candidate_length)
        BP = np.exp(penalty)

    return BP
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14

Computing the clipped Precision

Next, you need to define a function to calculate the geometrically averaged clipped precision. This function calculates how many of the n-grams in the candidate sentence actually appear in the reference sentence. The clipping takes care of overcounting. For example if a certain n-gram appears five times in the candidate sentence, but only twice in the reference, the value is clipped to two.

def average_clipped_precision(candidate, reference):
    """
    Calculates the precision given the candidate and reference sentences.
    """

    clipped_precision_score = []
    
    # Loop through values 1, 2, 3, 4. This is the length of n-grams
    for n_gram_length in range(1, 5):
        reference_n_gram_counts = Counter(ngrams(reference, n_gram_length))        
        candidate_n_gram_counts = Counter(ngrams(candidate, n_gram_length))                

        total_candidate_ngrams = sum(candidate_n_gram_counts.values())       
        
        for ngram in candidate_n_gram_counts: 
            # check if it is in the reference n-gram
            if ngram in reference_n_gram_counts:
                # if the count of the candidate n-gram is bigger than the corresponding
                # count in the reference n-gram, then set the count of the candidate n-gram 
                # to be equal to the reference n-gram
                
                if candidate_n_gram_counts[ngram] > reference_n_gram_counts[ngram]: 
                    candidate_n_gram_counts[ngram] = reference_n_gram_counts[ngram] # t
                                                   
            else:
                candidate_n_gram_counts[ngram] = 0 # else set the candidate n-gram equal to zero

        clipped_candidate_ngrams = sum(candidate_n_gram_counts.values())
        
        clipped_precision_score.append(clipped_candidate_ngrams / total_candidate_ngrams)
    
    # Calculate the geometric average: take the mean of elemntwise log, then exponentiate
    # This is equivalent to taking the n-th root of the product as shown in equation (1) above
    s = np.exp(np.mean(np.log(clipped_precision_score)))
    
    return s

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37

reference_n_gram_counts = Counter(ngrams(reference, n_gram_length)) 解释

这段代码使用了 NLTK(Natural Language Toolkit)和 Python 的 collections 模块来计算参考文本(reference)中 n 元组(n-grams)的数量。下面对每一行进行解释:

  1. from nltk.util import ngrams: 这行代码从 NLTK 工具包中导入了 ngrams 函数,该函数用于生成文本的 n 元组序列。

  2. from collections import Counter: 这行代码从 Python 的 collections 模块中导入了 Counter 类,用于计算可哈希对象的频率。

  3. reference_n_gram_counts = Counter(ngrams(reference, n_gram_length)): 这行代码计算了参考文本中 n 元组的数量,并将结果存储在 reference_n_gram_counts 变量中。具体地,它使用了 ngrams 函数生成了 reference 中的所有 n 元组,并使用 Counter 类对这些 n 元组进行计数。这样,reference_n_gram_counts 就是一个包含了参考文本中所有 n 元组及其出现次数的字典。

Computing the BLEU score
Finally, you can compute the BLEU score using the above two functions.

def bleu_score(candidate, reference):
    BP = brevity_penalty(candidate, reference)    
    geometric_average_precision = average_clipped_precision(candidate, reference)    
    return BP * geometric_average_precision
  • 1
  • 2
  • 3
  • 4
3.2 Testing the functions

Now you can test the functions with your Example Reference and Candidates Sentences.

result_candidate_1 = round(bleu_score(tokenized_cand_1, tokenized_ref) * 100, 1)
print(f"BLEU score of reference versus candidate 1: {result_candidate_1}")
result_candidate_2 = round(bleu_score(tokenized_cand_2, tokenized_ref) * 100, 1)
print(f"BLEU score of reference versus candidate 2: {result_candidate_2}")
  • 1
  • 2
  • 3
  • 4

Output

BLEU score of reference versus candidate 1: 27.6
BLEU score of reference versus candidate 2: 35.3
  • 1
  • 2
3.3 Comparing the Results from your Code with the Sacrebleu Library

Below you will do the same calculation, but using the sacrebleu library. Compare them with your implementation above.

result_candidate_1 = round(sacrebleu.sentence_bleu(candidate_1, [reference]).score, 1)
print(f"BLEU score of reference versus candidate 1: {result_candidate_1}")
result_candidate_2 = round(sacrebleu.sentence_bleu(candidate_2, [reference]).score, 1)
print(f"BLEU score of reference versus candidate 2: {result_candidate_2}")
  • 1
  • 2
  • 3
  • 4

Output

BLEU score of reference versus candidate 1: 27.6
BLEU score of reference versus candidate 2: 35.3
  • 1
  • 2

4. BLEU computation on a corpus

4.1 Loading Datasets for Evaluation Using the BLEU Score

In this section, you will use a simple pipeline for evaluating machine translated text. You will use English to German translations generated by Google Translate. There are three files you will need:

  1. A source text in English. In this lab, you will use the first 1671 words of the wmt19 evaluation dataset downloaded via SacreBLEU.
  2. A reference translation to German of the corresponding first 1671 words from the original English text. This is also provided by SacreBLEU.
  3. A candidate machine translation to German from the same 1671 words. This is generated by Google Translate.

With that, you can now compare the reference and candidate translation to get the BLEU Score.

# Loading the raw data
wmt19_src = open("data/wmt19_src.txt", "r")
wmt19_src_1 = wmt19_src.read()
wmt19_src.close()

wmt19_ref = open("data/wmt19_ref.txt", "r")
wmt19_ref_1 = wmt19_ref.read()
wmt19_ref.close()

wmt19_can = open("data/wmt19_can.txt", "r")
wmt19_can_1 = wmt19_can.read()
wmt19_can.close()

tokenized_corpus_src = nltk.word_tokenize(wmt19_src_1.lower())
tokenized_corpus_ref = nltk.word_tokenize(wmt19_ref_1.lower())
tokenized_corpus_cand = nltk.word_tokenize(wmt19_can_1.lower())
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16

Now that you have your data loaded, you can inspect the first sentence of each dataset.

print("English source text:\n")
print(f"{wmt19_src_1[0:170]} -> {tokenized_corpus_src[0:30]}\n\n")
print("German reference translation:\n")
print(f"{wmt19_ref_1[0:219]} -> {tokenized_corpus_ref[0:35]}\n\n")
print("German machine translation:\n")
print(f"{wmt19_can_1[0:199]} -> {tokenized_corpus_cand[0:29]}")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

Output

English source text:

Welsh AMs worried about 'looking like muppets'
There is consternation among some AMs at a suggestion their title should change to MWPs (Member of the Welsh Parliament).
 -> ['\ufeffwelsh', 'ams', 'worried', 'about', "'looking", 'like', "muppets'", 'there', 'is', 'consternation', 'among', 'some', 'ams', 'at', 'a', 'suggestion', 'their', 'title', 'should', 'change', 'to', 'mwps', '(', 'member', 'of', 'the', 'welsh', 'parliament', ')', '.']


German reference translation:

Walisische Ageordnete sorgen sich "wie Dödel auszusehen"
Es herrscht Bestürzung unter einigen Mitgliedern der Versammlung über einen Vorschlag, der ihren Titel zu MWPs (Mitglied der walisischen Parlament) ändern soll.
 -> ['\ufeffwalisische', 'ageordnete', 'sorgen', 'sich', '``', 'wie', 'dödel', 'auszusehen', "''", 'es', 'herrscht', 'bestürzung', 'unter', 'einigen', 'mitgliedern', 'der', 'versammlung', 'über', 'einen', 'vorschlag', ',', 'der', 'ihren', 'titel', 'zu', 'mwps', '(', 'mitglied', 'der', 'walisischen', 'parlament', ')', 'ändern', 'soll', '.']


German machine translation:

Walisische AMs machten sich Sorgen, dass sie wie Muppets aussehen könnten
Einige AMs sind bestürzt über den Vorschlag, ihren Titel in MWPs (Mitglied des walisischen Parlaments) zu ändern.
Es ist aufg -> ['walisische', 'ams', 'machten', 'sich', 'sorgen', ',', 'dass', 'sie', 'wie', 'muppets', 'aussehen', 'könnten', 'einige', 'ams', 'sind', 'bestürzt', 'über', 'den', 'vorschlag', ',', 'ihren', 'titel', 'in', 'mwps', '(', 'mitglied', 'des', 'walisischen', 'parlaments']
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19

And lastly, you can calculate the BLEU score of the translation.

result =  round(sacrebleu.sentence_bleu(wmt19_can_1, [wmt19_ref_1]).score, 1)
print(f"BLEU score of the reference versus candidate translation: {result}")
  • 1
  • 2

Output

BLEU score of the reference versus candidate translation: 43.2
  • 1
4.2 BLEU Score Interpretation on a Corpus

The table below (taken from here) shows the typical values of BLEU score. You can see that the translation above is of high quality according to this table and in comparison to the given reference sentence. (if you see “Hard to get the gist”, please open your workspace, delete wmt19_can.txt and get the latest version via the Lab Help button)

ScoreInterpretation
< 10Almost useless
10 - 19Hard to get the gist
20 - 29The gist is clear, but has significant grammatical errors
30 - 40Understandable to good translations
40 - 50High quality translations
50 - 60Very high quality, adequate, and fluent translations
> 60Quality often better than human

ROUGE-N Score

Previously, I introduced you to the BLEU score evaluation metric and it’s
modified version. I used it to assess the performance of machine
translation models. I also showed you some
drawbacks that’s arise because that metric ignores semantic
and sentence structure. In this video, I’ll talk
about the ROUGE score, another performance
metric that tends to estimate the quality of
machine translation systems. I’ll introduce You now to a family of metrics
called ROUGE. It stands for
Recall-Oriented Understudy of Gisting Evaluation, which is a mouthful. But lets you know,
right off the bat, that it’s more
recall-oriented by default. That means that ROUGE cares
about how much of the human created references appear in
the candidate translation. In contrast, BLEU is
precision oriented. Since you have to
determine how many words from the candidates
appear on the references. ROUGE was initially
developed to evaluate the quality of the
machine summarized texts, but is also helpful in assessing the quality
of machine translation. It works by comparing the machine candidates against reference translations
provided by humans. There are many versions
of the ROUGE score, but also the one called
ROUGE-N. For the ROUGE-N score, You have to get the counts of the n-gram overlaps between the candidates and the
reference translations, which is somewhat
similar to what you have to do for
the BLEU score.

在这里插入图片描述

To see the difference
between the two metrics, I’ll show You an example of how ROUGE-N works with uni-grams. To get the basic version of the ROUGE-N score based only on recall so you must count word matches between the
reference and the candidates, and divide by the number
of words in the reference. If you had multiple references, you would need to get a ROUGE-N score using each
reference and get the maximum. Now, let’s go through
the example that you already solved
for the BLEU score. Your candidate has the
words I two times, the word M, and
the word I again, for a total of four words. You also have a
reference translation. Younes said, “I am hungry” and another slightly
different reference. He said, “I’m hungry.” Each reference has
five words in total. You have to count
matches between the references and the
candidate translations, similar to what you did
for the BLEU score. Let’s start with the
first reference. The word Younes, doesn’t match any of the uni-grams
in the candidates, so you don’t add
anything to the counts. The word said doesn’t match any word and the
candidates either. The word I, has
multiple matches, but you need the first one. For this match, you add
only one to your counts. The word M has a match in the candidates so your
increment your counts. Now, the final word of the
first reference, hungry, doesn’t match any of the
words from the candidates. You don’t add anything
to your counts. If you repeat this process
for the second reference, you get a counts equal to 2. Finally, you divide these
counts by the number of words in each reference
and get the maximum value, which for this example
is equal to 0.4.

在这里插入图片描述

This basic version of the
ROUGE-N score is based on recall while the BLEU score you saw in the previous
lectures is precision. But why not combine both to get a metric like an F1 score? Recall, pun intended, from your introductory
machine learning courses that the F1 score is given
by this formula, two times the product of
precision and recall, divided by the sum
of both metrics. You get the following formula, if you replace precision
by the modified version of the BLEU score and recall
by the ROUGE-N score. For this example, you have
a BLEU score equal to 0.5, which you got in
previous lectures. You have a ROUGE-N score
equivalent to 0.4, that you calculated before. With these values, you will have an F1 score equal to 4
over 9, close to 0.44. You have now seen how to compute the modified BLEU and the sample ROUGE-N scores
to evaluate your model. You can view these metrics
like precision and recall. Therefore, you can use both to get an F1
score that’s could better assess the performance of your machine
translation model. In many applications, you
will see reported and F-score along with the
BLEU and ROUGE-N metric. However, you must note that’s all the evaluation metrics
you have seen so far, don’t consider the sentence
structure and semantics, only accounts for
matching n-grams between candidates and the
reference translations.

在这里插入图片描述

You now have seen how to
compute the modified BLEU and the simple ROUGE-N scores
to evaluate your model. You can view these metrics
like precision and recall. Therefore, you can use both to get an F1 score that’s good, better assess the performance of your machine
translation model. In many applications,
you’ll see reported an F-score along with the
BLEU and the ROUGE-N metrics. However, you must note that all the evaluation
metrics you have seen so far don’t consider the sentence structure
and semantics. They only account
for matching n-grams between the candidates and
reference translations.

Sampling and Decoding

Hello. You will now learn about two ways that will allow you to construct a sentence. The first approach is known as greedy decoding and
the second approach is known as random sampling. You’ll also see the pros and
the cons of each method. For example, when
choosing the word with the highest probability
at every time step, that does not necessarily
generate the best sequence. With that said, let’s dive in and explore
these two methods. By now you have reached
the final parts of this week’s lectures.
That’s awesome. I’ll show you a few methods
for sampling and decoding, as well as a discussion of an important type of parameter in sampling called temperature. First, a quick reminder on how a seq2seq model
predicts words. The output of the
decoder is produced from a dense layer and a softmax
or log softmax operation. The output at each step then is the probability
distribution over all the words and symbols
in the target vocabulary. The final output of the
model depends on how you choose the words using these probability
distributions at each step. Greedy decoding is the
simplest way to decode the model’s predictions
as it selects the most probable
word at every step. However, this approach
has limitations. When you consider the
highest probability for each prediction and concatenate all predicted tokens for the output sequence. As the greedy decoder does, you can end up with
a situation where the output instead of, “I am hungry,” gives you “I am, am, am” and so forth. You can see how this
could be a problem, but not in all cases. For shorter sequences,
it’s going to be fine. But if you have many
other words to consider, then knowing what’s
coming up next might help you better
predict the next sequence.

在这里插入图片描述

Another option is known
as random sampling. What random sampling
does is it provides probabilities for each word and sample accordingly
for the next outputs. One of the problems with this is that it could be a
little bit too random. A solution for this is to
assign more weight to the words with higher probabilities and
less weight to the others. You will see a method for doing this in just a few moment.

在这里插入图片描述

In sampling, temperature
is a parameter you can adjust to allow for more or less randomness
in your predictions. It’s measured on a scale of 0-1, indicating low to
high randomness. Let’s say you need your
model to make careful, safe decisions about
what to output. Then set you’re parameter lower and get the prediction
equivalent of a very confident but rather a boring person seated next to
you at a dinner table. If you feel like taking
more of a gamble, set your temperature
a bit higher. This has the effect of making
your network more excited. You may get some pretty
fun predictions. On the other hand, there will be probably a lot more mistakes.

在这里插入图片描述

You have seen in this video, methods for sampling
and decoding that will allow you to construct
sentences using your model. However, these methods don’t always produce the most
convincing outputs. Instead, you’ll often get a
very random set of words. In the next videos, you’ll see two methods
for sampling and decoding that tend to
produce better results.

Beam Search

Beam search is a technique that allows
you to find the best sequences over a fixed window size
known as the beam with. Since taking the output with the highest
probability at each time step is not ideal. I will show you how to
use beam search instead. So far the methods I’ve presented only
consider word probabilities one at a time. These ways of sampling might not result
in the highest probability sequences and the sentence is produced Using
these methods could not make sense. The overall most probable
translation given an input sentence is not necessarily the most
probable word at each step. For example, it’s possible that choosing
the most probable words at the start of the sequence could lead to
a worse translation compared to choosing in other words. If you had infinite computational power,
you could calculate the probabilities of every possible
output sentence and choose the best one. In the real world we use beam search. This method attempts to find the most
likely outsports sentence by choosing some number of best sequences based on
conditional probabilities at each time step.

在这里插入图片描述

Now at each time step with beam search
you have to calculate the probability of potential sequences given the outputs
of the previous time step. To avoid computing the probability for every possible sequence you have
armature beam called the beam width. At each step, you only keep the most
probable sequences and drop all others. You generate a new word until all be most
probable sentences and with the US token. So beam search consumes
a lot of memory and could be computationally costly
depending on your choice for beam. As an interesting side note,
greedy decoding is just a particular case of beam search where you set
the beam with B to be equal to 1.

在这里插入图片描述

To illustrate this method,
consider a small vocabulary consisting of the words I am hungry and
an end of sentence token. And let’s say the beam
with B is equal to 2. As with the other methods beam search
starts with the start of sentence token and gets the probabilities for
the first word in the sequence. Here I has a probability of 0.5,
am has a probability of 0.4 and hungry has a probability of 0.1 and
in the eos is at 0. Since the beam with is 2 you keep
the two highest probabilities I and am. Then you calculate the conditional
probability of all other words Given the two sequences that
you have kept so far I and am. Given the word I, the model returns
am with a probability of 0.5 and hungry with 0.3. Given the words am,
the model returns the probability of word I with a probability of 0.7 and
hungry with a probability of 0.2. These are the two conditional
probabilities given the two previous sequences of B of word 2 given word 1. Next you calculate the probability
by multiplying the conditional probability for 2 given the word I
with the probability of words I. For instance if the sequence
of II has a probability of 0.5 times 0.1 this is equal to 0.05. The sequence I am has a probability of
0.5 times 0.5 which is equal to 0.25. You have to do the same with
the sequences starting with the word am. Here the sequence MI has a probability
of 0.4 times 0.7 which is equal to 0.28. At this point there is possible sequences
but with the beam width B=2 you only keep the two sequences with
the highest probabilities. So you keep the sequence I am and
am I and drop all others. Then for the next step you use am as
the inputs were to get the conditional probabilities for the I am sequence and
I for the am I sequence. You repeat the process to calculate
the joints probabilities and again choose the best sequence. This process stops with the model
predicts an end of sentence token for all, be most probable sequences. At the end, the sequence with the largest
probability is chosen as the outputs.

在这里插入图片描述

To get the conditional probabilities at
each step you have to use your model. For now let’s focus on the decoder parts
of the model ignoring the encoder and attention mechanism. At the start of the sequence
you will use your decoder to as a vector of probabilities for
each of the words and the vocabulary. This is equivalent to computing
the conditional probability of every word given
the start of sequence token. Then you select the most probable
B outputs from the model and discard all other possibilities. After that you use your model B times to
find the conditional probabilities of all the words in the vocabulary
given the sequences that you kept in the previous step. Then you compute the sequence probability
and keep the B most probable sequences. And you keep doing this until the B
most probable sequences gets to the end of the sequence token. Notes that at each step after
the one where your sequence is only composed by the SOS token you
have to run your model B times.

在这里插入图片描述

The vanilla version of beam
search has some disadvantages. For instance, it penalizes the choice of long wear
sequences because the probability of a sequence is computed as the product
of multiple conditional probabilities. However, you could normalize the
probability of each sequence by its number four is to avoid this problem. Beam search also requires you to store
the b most probable sequences and computes conditional probabilities
given all of those sequences. Therefore, this method could be
computationally expensive and consumes lots of memory. You have a new tool that you can add to
your toolkit beam search has been widely used for quiet a period of time and
is still used a lot. In the next video, I will show you
another technique that you can use, known as NBR, or minimum based risk.

在这里插入图片描述

Beam Search(束搜索)是一种用于在生成式模型(如机器翻译、语言模型等)中寻找最优输出序列的搜索算法。它通过在每个时间步保留概率最高的 k 个候选项(称为束宽或束大小),并在下一个时间步基于这些候选项继续扩展搜索,最终找到整个输出序列的最优解。

Beam Search 的工作方式如下:

  1. 初始化:从起始状态开始,将初始概率分数设置为 1.0,将初始序列设置为空序列。
  2. 生成候选项:在每个时间步,对于当前所有的候选项,根据模型生成下一个词的概率分布,并选择概率最高的 k 个词作为下一步的候选项。
  3. 更新概率分数:对于每个候选项,将当前概率分数乘以下一个词的概率得到新的概率分数。
  4. 剪枝:保留概率分数最高的 k 个候选项,并丢弃其他候选项。
  5. 终止条件:当生成的序列达到最大长度或者所有候选项均为终止符号时,停止搜索。
  6. 回溯:从最终的候选项中选择概率最高的序列作为最终输出序列。

Beam Search 的优点是可以在一定程度上减少搜索空间,加快搜索速度,并且能够找到相对较好的解。然而,由于束宽的限制,Beam Search 可能会导致局部最优解,而无法找到全局最优解。因此,在一些情况下,Beam Search 可能需要结合其他技术(如长度惩罚、重复惩罚等)来改善搜索结果。

Minimum Bayes Risk

This is the last technique I will show
you to evaluate your NMT systems. It is very simple and works surprisingly well when compared to
the other techniques, let’s get started. >> Earlier, you encountered random
sampling to choose a probable token and the issues with that decoding method. But if you go a little further with that,
say by generating 30 samples and comparing them all against one another. You’ll see quite a bit of
improvements in your decoding. You’ll have to compare multiple
candidate translations for the minimum bayes risk decoding method,
MBR for shorts. Implementing MBR is
pretty straightforward. Begin by generating several random
samples, then compare each sample against each other using a similarity
score or a loss function. ROUGE would be a good choice that
you may recall from a bit earlier. Finally, choose the sample with
the highest average similarity or the lowest loss.

The translation that you get using this
method is the closest to all candidate translations. This process can be viewed
as some authors suggest, as finding a consensus between
all candidate translations.

在这里插入图片描述

If you decide to use ROUGE score
as a similarity metric for comparing every pair of
candidate translations, you would have MBR summarized
in the formula presented here. Your goal is to find the candidate
translation E that maximizes the average ROUGE score with
every other candidates E prime. So MBR is relatively easy to implement. You need to have multiple
candidate translations and select a way to compare them.

在这里插入图片描述

But for the sake of clarity, let’s go through an implementation
process in more detail. Here are the steps for implementing MVR with ROUGE on a small
set of four candidate translations. First, calculate the ROUGE score
between the first C subscript 1 and the second C subscript 2
candidate translations. For the 1st and 3rd and
for the 1st and 4th, then compute the average R subscript
1 using those three rules scores. Then you repeat this process for the other three candidates in your sets to
average ROUGE scores for each one of them. Finally, you select the candidate with
the highest average ROUGE score and that’s it for MBR, you’ll be implementing this method in the
assignment along with a greedy decoder.

在这里插入图片描述

In summary, MBR takes several
translation candidates and compares them against each other. Then chooses the one with the highest
average similarity as the beam search case. This method can give you a more
contextually accurate translation than random sampling and greedy decoding. >> Congratulations on finishing this week,
you now know how to implement in your own machine translation system and
you also know how to evaluate it. Next week, I’ll talk about one of
the states of the arts models known as the transformer, which also makes use
of an encoder decoder architecture.

在这里插入图片描述

Quiz

在这里插入图片描述

第一题改正:

在这里插入图片描述

在这里插入图片描述

第三题错误示例:第三题正确答案是Keys, Values, Queries, Softmax

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

Precision is defined as:

(Sum of overlapping unigrams in model and reference)/(total # of words in model)

Recall is defined as:

(Sum of overlapping unigrams in model and reference)/(total # of words in reference)

In ROUGE, precision measures how many of the words in the generated summary are also present in the reference summary, while recall measures how many of the words in the reference summary are also present in the generated summary.

在这里插入图片描述

Programming Assignment: NMT with Attention (Tensorflow)

Assignment 1: Neural Machine Translation

Welcome to the first assignment of Course 4. Here, you will build an English-to-Portuguese neural machine translation (NMT) model using Long Short-Term Memory (LSTM) networks with attention. Machine translation is an important task in natural language processing and could be useful not only for translating one language to another but also for word sense disambiguation (e.g. determining whether the word “bank” refers to the financial bank, or the land alongside a river). Implementing this using just a Recurrent Neural Network (RNN) with LSTMs can work for short to medium length sentences but can result in vanishing gradients for very long sequences. To help with this, you will be adding an attention mechanism to allow the decoder to access all relevant parts of the input sentence regardless of its length. By completing this assignment, you will:

  • Implement an encoder-decoder system with attention
  • Build the NMT model from scratch using Tensorflow
  • Generate translations using greedy and Minimum Bayes Risk (MBR) decoding

在这里插入图片描述

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # Setting this env variable prevents TF warnings from showing up

import numpy as np
import tensorflow as tf
from collections import Counter
from utils import (sentences, train_data, val_data, english_vectorizer, portuguese_vectorizer, 
                   masked_loss, masked_acc, tokens_to_text)

import w1_unittest
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

utils.py文件如下:

import numpy as np
import tensorflow as tf
import tensorflow_text as tf_text
import pathlib

path_to_file = pathlib.Path("por-eng/por.txt")

np.random.seed(1234)
tf.random.set_seed(1234)

def load_data(path):
    text = path.read_text(encoding="utf-8")

    lines = text.splitlines()
    pairs = [line.split("\t") for line in lines]

    context = np.array([context for target, context, _ in pairs])
    target = np.array([target for target, context, _ in pairs])

    return context, target


portuguese_sentences, english_sentences = load_data(path_to_file)

sentences = (portuguese_sentences, english_sentences)

BUFFER_SIZE = len(english_sentences)
BATCH_SIZE = 64

is_train = np.random.uniform(size=(len(portuguese_sentences),)) < 0.8

train_raw = (
    tf.data.Dataset.from_tensor_slices(
        (english_sentences[is_train], portuguese_sentences[is_train])
    )
    .shuffle(BUFFER_SIZE)
    .batch(BATCH_SIZE)
)
val_raw = (
    tf.data.Dataset.from_tensor_slices(
        (english_sentences[~is_train], portuguese_sentences[~is_train])
    )
    .shuffle(BUFFER_SIZE)
    .batch(BATCH_SIZE)
)


def tf_lower_and_split_punct(text):
    text = tf_text.normalize_utf8(text, "NFKD")
    text = tf.strings.lower(text)
    text = tf.strings.regex_replace(text, "[^ a-z.?!,¿]", "")
    text = tf.strings.regex_replace(text, "[.?!,¿]", r" \0 ")
    text = tf.strings.strip(text)
    text = tf.strings.join(["[SOS]", text, "[EOS]"], separator=" ")
    return text


max_vocab_size = 12000

english_vectorizer = tf.keras.layers.TextVectorization(
    standardize=tf_lower_and_split_punct, max_tokens=max_vocab_size, ragged=True
)

english_vectorizer.adapt(train_raw.map(lambda context, target: context))

portuguese_vectorizer = tf.keras.layers.TextVectorization(
    standardize=tf_lower_and_split_punct, max_tokens=max_vocab_size, ragged=True
)

portuguese_vectorizer.adapt(train_raw.map(lambda context, target: target))


def process_text(context, target):
    context = english_vectorizer(context).to_tensor()
    target = portuguese_vectorizer(target)
    targ_in = target[:, :-1].to_tensor()
    targ_out = target[:, 1:].to_tensor()
    return (context, targ_in), targ_out


train_data = train_raw.map(process_text, tf.data.AUTOTUNE)
val_data = val_raw.map(process_text, tf.data.AUTOTUNE)

del train_raw
del val_raw


def masked_loss(y_true, y_pred):
    
    loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')
    loss = loss_fn(y_true, y_pred)
    
    # Check which elements of y_true are padding
    mask = tf.cast(y_true != 0, loss.dtype)
    
    loss *= mask
    # Return the total.
    return tf.reduce_sum(loss)/tf.reduce_sum(mask)


def masked_acc(y_true, y_pred):
    y_pred = tf.argmax(y_pred, axis=-1)
    y_pred = tf.cast(y_pred, y_true.dtype)
    match = tf.cast(y_true == y_pred, tf.float32)
    mask = tf.cast(y_true != 0, tf.float32)

    return tf.reduce_sum(match)/tf.reduce_sum(mask)


def tokens_to_text(tokens, id_to_word):
    words = id_to_word(tokens)
    result = tf.strings.reduce_join(words, axis=-1, separator=" ")
    return result

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114

1. Data Preparation

The text pre-processing bits have already been taken care of (if you are interested in this be sure to check the utils.py file). The steps performed can be summarized as:

  • Reading the raw data from the text files
  • Cleaning the data (using lowercase, adding space around punctuation, trimming whitespaces, etc)
  • Splitting it into training and validation sets
  • Adding the start-of-sentence and end-of-sentence tokens to every sentence
  • Tokenizing the sentences
  • Creating a Tensorflow dataset out of the tokenized sentences

Take a moment to inspect the raw sentences:

portuguese_sentences, english_sentences = sentences

print(f"English (to translate) sentence:\n\n{english_sentences[-5]}\n")
print(f"Portuguese (translation) sentence:\n\n{portuguese_sentences[-5]}")
  • 1
  • 2
  • 3
  • 4

Output

English (to translate) sentence:

No matter how much you try to convince people that chocolate is vanilla, it'll still be chocolate, even though you may manage to convince yourself and a few others that it's vanilla.

Portuguese (translation) sentence:

Não importa o quanto você tenta convencer os outros de que chocolate é baunilha, ele ainda será chocolate, mesmo que você possa convencer a si mesmo e poucos outros de que é baunilha.
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

You don’t have much use for the raw sentences so delete them to save memory:

del portuguese_sentences
del english_sentences
del sentences
  • 1
  • 2
  • 3

Notice that you imported an english_vectorizer and a portuguese_vectorizer from utils.py. These were created using tf.keras.layers.TextVectorization and they provide interesting features such as ways to visualize the vocabulary and convert text into tokenized ids and vice versa. In fact, you can inspect the first ten words of the vocabularies for both languages:

print(f"First 10 words of the english vocabulary:\n\n{english_vectorizer.get_vocabulary()[:10]}\n")
print(f"First 10 words of the portuguese vocabulary:\n\n{portuguese_vectorizer.get_vocabulary()[:10]}")
  • 1
  • 2

Output

First 10 words of the english vocabulary:

['', '[UNK]', '[SOS]', '[EOS]', '.', 'tom', 'i', 'to', 'you', 'the']

First 10 words of the portuguese vocabulary:

['', '[UNK]', '[SOS]', '[EOS]', '.', 'tom', 'que', 'o', 'nao', 'eu']
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

Notice that the first 4 words are reserved for special words. In order, these are:

  • the empty string
  • a special token to represent an unknown word
  • a special token to represent the start of a sentence
  • a special token to represent the end of a sentence

You can see how many words are in a vocabulary by using the vocabulary_size method:

# Size of the vocabulary
vocab_size_por = portuguese_vectorizer.vocabulary_size()
vocab_size_eng = english_vectorizer.vocabulary_size()

print(f"Portuguese vocabulary is made up of {vocab_size_por} words")
print(f"English vocabulary is made up of {vocab_size_eng} words")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

Output

Portuguese vocabulary is made up of 12000 words
English vocabulary is made up of 12000 words
  • 1
  • 2

You can define tf.keras.layers.StringLookup objects that will help you map from words to ids and vice versa. Do this for the portuguese vocabulary since this will be useful later on when you decode the predictions from your model:

# This helps you convert from words to ids
word_to_id = tf.keras.layers.StringLookup(
    vocabulary=portuguese_vectorizer.get_vocabulary(), 
    mask_token="", 
    oov_token="[UNK]"
)

# This helps you convert from ids to words
id_to_word = tf.keras.layers.StringLookup(
    vocabulary=portuguese_vectorizer.get_vocabulary(),
    mask_token="",
    oov_token="[UNK]",
    invert=True,
)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14

Try it out for the special tokens and a random word:

unk_id = word_to_id("[UNK]")
sos_id = word_to_id("[SOS]")
eos_id = word_to_id("[EOS]")
baunilha_id = word_to_id("baunilha")

print(f"The id for the [UNK] token is {unk_id}")
print(f"The id for the [SOS] token is {sos_id}")
print(f"The id for the [EOS] token is {eos_id}")
print(f"The id for baunilha (vanilla) is {baunilha_id}")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

Output

The id for the [UNK] token is 1
The id for the [SOS] token is 2
The id for the [EOS] token is 3
The id for baunilha (vanilla) is 7079
  • 1
  • 2
  • 3
  • 4

Finally take a look at how the data that is going to be fed to the neural network looks like. Both train_data and val_data are of type tf.data.Dataset and are already arranged in batches of 64 examples. To get the first batch out of a tf dataset you can use the take method. To get the first example out of the batch you can slice the tensor and use the numpy method for nicer printing:

for (to_translate, sr_translation), translation in train_data.take(1):
    print(f"Tokenized english sentence:\n{to_translate[0, :].numpy()}\n\n")
    print(f"Tokenized portuguese sentence (shifted to the right):\n{sr_translation[0, :].numpy()}\n\n")
    print(f"Tokenized portuguese sentence:\n{translation[0, :].numpy()}\n\n")
  • 1
  • 2
  • 3
  • 4

Output

Tokenized english sentence:
[   2  210    9  146  123   38    9 1672    4    3    0    0    0    0]


Tokenized portuguese sentence (shifted to the right):
[   2 1085    7  128   11  389   37 2038    4    0    0    0    0    0
    0]


Tokenized portuguese sentence:
[1085    7  128   11  389   37 2038    4    3    0    0    0    0    0
    0]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12

There are a couple of important details to notice.

  • Padding has already been applied to the tensors and the value used for this is 0
  • Each example consists of 3 different tensors:
    • The sentence to translate
    • The shifted-to-the-right translation
    • The translation

The first two can be considered as the features, while the third one as the target. By doing this your model can perform Teacher Forcing as you saw in the lectures.

Now it is time to begin coding!

2. NMT model with attention

The model you will build uses an encoder-decoder architecture. This Recurrent Neural Network (RNN) takes in a tokenized version of a sentence in its encoder, then passes it on to the decoder for translation. As mentioned in the lectures, just using a a regular sequence-to-sequence model with LSTMs will work effectively for short to medium sentences but will start to degrade for longer ones. You can picture it like the figure below where all of the context of the input sentence is compressed into one vector that is passed into the decoder block. You can see how this will be an issue for very long sentences (e.g. 100 tokens or more) because the context of the first parts of the input will have very little effect on the final vector passed to the decoder.

在这里插入图片描述

Adding an attention layer to this model avoids this problem by giving the decoder access to all parts of the input sentence. To illustrate, let’s just use a 4-word input sentence as shown below. Remember that a hidden state is produced at each timestep of the encoder (represented by the orange rectangles). These are all passed to the attention layer and each are given a score given the current activation (i.e. hidden state) of the decoder. For instance, let’s consider the figure below where the first prediction “como” is already made. To produce the next prediction, the attention layer will first receive all the encoder hidden states (i.e. orange rectangles) as well as the decoder hidden state when producing the word “como” (i.e. first green rectangle). Given this information, it will score each of the encoder hidden states to know which one the decoder should focus on to produce the next word. As a result of training, the model might have learned that it should align to the second encoder hidden state and subsequently assigns a high probability to the word “você”. If we are using greedy decoding, we will output the said word as the next symbol, then restart the process to produce the next word until we reach an end-of-sentence prediction.

在这里插入图片描述

There are different ways to implement attention and the one we’ll use for this assignment is the Scaled Dot Product Attention which has the form:

A t t e n t i o n ( Q , K , V ) = s o f t m a x ( Q K T d k ) V Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V Attention(Q,K,V)=softmax(dk QKT)V

You will dive deeper into this equation in the next week but for now, you can think of it as computing scores using queries (Q) and keys (K), followed by a multiplication of values (V) to get a context vector at a particular timestep of the decoder. This context vector is fed to the decoder RNN to get a set of probabilities for the next predicted word. The division by square root of the keys dimensionality ( d k \sqrt{d_k} dk ) is for improving model performance and you’ll also learn more about it next week. For our machine translation application, the encoder activations (i.e. encoder hidden states) will be the keys and values, while the decoder activations (i.e. decoder hidden states) will be the queries.

You will see in the upcoming sections that this complex architecture and mechanism can be implemented with just a few lines of code.

First you will define two important global variables:

  • The size of the vocabulary
  • The number of units in the LSTM layers (the same number will be used for all LSTM layers)

In this assignment, the vocabulary sizes for English and Portuguese are the same. Therefore, we use a single constant VOCAB_SIZE throughout the notebook. While in other settings, vocabulary sizes could differ, that is not the case in our assignment.

VOCAB_SIZE = 12000
UNITS = 256
  • 1
  • 2

Exercise 1 - Encoder

Your first exercise is to code the encoder part of the neural network. For this, complete the Encoder class below. Notice that in the constructor (the __init__ method) you need to define all of the sublayers of the encoder and then use these sublayers during the forward pass (the call method).

The encoder consists of the following layers:

  • Embedding. For this layer you need to define the appropriate input_dim and output_dim and let it know that you are using ‘0’ as padding, which can be done by using the appropriate value for the mask_zero parameter.
  • Bidirectional LSTM. In TF you can implement bidirectional behaviour for RNN-like layers. This part is already taken care of but you will need to specify the appropriate type of layer as well as its parameters. In particular you need to set the appropriate number of units and make sure that the LSTM returns the full sequence and not only the last output, which can be done by using the appropriate value for the return_sequences parameter.

You need to define the forward pass using the syntax of TF’s functional API. What this means is that you chain function calls together to define your network like this:

encoder_input = keras.Input(shape=(28, 28, 1), name="original_img")
x = layers.Conv2D(16, 3, activation="relu")(encoder_input)
x = layers.MaxPooling2D(3)(x)
x = layers.Conv2D(16, 3, activation="relu")(x)
encoder_output = layers.GlobalMaxPooling2D()(x)
  • 1
  • 2
  • 3
  • 4
  • 5
# GRADED CLASS: Encoder
class Encoder(tf.keras.layers.Layer):
    def __init__(self, vocab_size, units):
        """Initializes an instance of this class

        Args:
            vocab_size (int): Size of the vocabulary
            units (int): Number of units in the LSTM layer
        """
        super(Encoder, self).__init__()

        ### START CODE HERE ###

        self.embedding = tf.keras.layers.Embedding(  
            input_dim=vocab_size,
            output_dim=units,
            mask_zero=True
        )  

        self.rnn = tf.keras.layers.Bidirectional(  
            merge_mode="sum",  
            layer=tf.keras.layers.LSTM(
                units=units,
                return_sequences=True
            ),  
        )  

        ### END CODE HERE ###

    def call(self, context):
        """Forward pass of this layer

        Args:
            context (tf.Tensor): The sentence to translate

        Returns:
            tf.Tensor: Encoded sentence to translate
        """

        ### START CODE HERE ###

        # Pass the context through the embedding layer
        x = self.embedding(context)

        # Pass the output of the embedding through the RNN
        x = self.rnn(x)

        ### END CODE HERE ###

        return x
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
# Do a quick check of your implementation

# Create an instance of your class
encoder = Encoder(VOCAB_SIZE, UNITS)

# Pass a batch of sentences to translate from english to portuguese
encoder_output = encoder(to_translate)

print(f'Tensor of sentences in english has shape: {to_translate.shape}\n')
print(f'Encoder output has shape: {encoder_output.shape}')
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

Output

Tensor of sentences in english has shape: (64, 14)

Encoder output has shape: (64, 14, 256)
  • 1
  • 2
  • 3

Expected Output

Tensor of sentences in english has shape: (64, 14)

Encoder output has shape: (64, 14, 256)
  • 1
  • 2
  • 3
# Test your code!

w1_unittest.test_encoder(Encoder)
  • 1
  • 2
  • 3

Output

 All tests passed!
  • 1

Exercise 2 - CrossAttention

Your next exercise is to code the layer that will perform cross attention between the original sentences and the translations. For this, complete the CrossAttention class below. Notice that in the constructor (the __init__ method) you need to define all of the sublayers and then use these sublayers during the forward pass (the call method). For this particular case some of these bits are already taken care of.

The cross attention consists of the following layers:

  • MultiHeadAttention. For this layer you need to define the appropriate key_dim, which is the size of the key and query tensors. You will also need to set the number of heads to 1 since you aren’t implementing multi head attention but attention between two tensors. The reason why this layer is preferred over Attention is that it allows simpler code during the forward pass.

A couple of things to notice:

  • You need a way to pass both the output of the attention alongside the shifted-to-the-right translation (since this cross attention happens in the decoder side). For this you will use an Add layer so that the original dimension is preserved, which would not happen if you use something like a Concatenate layer.
  • Layer normalization is also performed for better stability of the network by using a LayerNormalization layer.
  • You don’t need to worry about these last steps as these are already solved.
# GRADED CLASS: CrossAttention
class CrossAttention(tf.keras.layers.Layer):
    def __init__(self, units):
        """Initializes an instance of this class

        Args:
            units (int): Number of units in the LSTM layer
        """
        super().__init__()

        ### START CODE HERE ###

        self.mha = ( 
            tf.keras.layers.MultiHeadAttention(
                key_dim=units,
                num_heads=1
            ) 
        )  

        ### END CODE HERE ###

        self.layernorm = tf.keras.layers.LayerNormalization()
        self.add = tf.keras.layers.Add()

    def call(self, context, target):
        """Forward pass of this layer

        Args:
            context (tf.Tensor): Encoded sentence to translate
            target (tf.Tensor): The embedded shifted-to-the-right translation

        Returns:
            tf.Tensor: Cross attention between context and target
        """
        ### START CODE HERE ###

        # Call the MH attention by passing in the query and value
        # For this case the query should be the translation and the value the encoded sentence to translate
        # Hint: Check the call arguments of MultiHeadAttention in the docs
        attn_output = self.mha(
            query=target,
            value=context
        )  

        ### END CODE HERE ###

        x = self.add([target, attn_output])

        x = self.layernorm(x)

        return x
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
# Do a quick check of your implementation

# Create an instance of your class
attention_layer = CrossAttention(UNITS)

# The attention layer expects the embedded sr-translation and the context
# The context (encoder_output) is already embedded so you need to do this for sr_translation:
sr_translation_embed = tf.keras.layers.Embedding(VOCAB_SIZE, output_dim=UNITS, mask_zero=True)(sr_translation)

# Compute the cross attention
attention_result = attention_layer(encoder_output, sr_translation_embed)

print(f'Tensor of contexts has shape: {encoder_output.shape}')
print(f'Tensor of translations has shape: {sr_translation_embed.shape}')
print(f'Tensor of attention scores has shape: {attention_result.shape}')
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15

Output

Tensor of contexts has shape: (64, 14, 256)
Tensor of translations has shape: (64, 15, 256)
Tensor of attention scores has shape: (64, 15, 256)
  • 1
  • 2
  • 3

Expected Output

Tensor of contexts has shape: (64, 14, 256)
Tensor of translations has shape: (64, 15, 256)
Tensor of attention scores has shape: (64, 15, 256)
  • 1
  • 2
  • 3
# Test your code!

w1_unittest.test_cross_attention(CrossAttention)
  • 1
  • 2
  • 3

Output

All tests passed!
  • 1

Exercise 3 - Decoder

Now you will implement the decoder part of the neural network by completing the Decoder class below. Notice that in the constructor (the __init__ method) you need to define all of the sublayers of the decoder and then use these sublayers during the forward pass (the call method).

The decoder consists of the following layers:

  • Embedding. For this layer you need to define the appropriate input_dim and output_dim and let it know that you are using ‘0’ as padding, which can be done by using the appropriate value for the mask_zero parameter.
  • Pre-attention LSTM. Unlike in the encoder in which you used a Bidirectional LSTM, here you will use a vanilla LSTM. Don’t forget to set the appropriate number of units and make sure that the LSTM returns the full sequence and not only the last output, which can be done by using the appropriate value for the return_sequences parameter. It is very important that this layer returns the state since this will be needed for inference so make sure to set the return_state parameter accordingly. Notice that LSTM layers return state as a tuple of two tensors called memory_state and carry_state, however these names have been changed to better reflect what you have seen in the lectures to hidden_state and cell_state respectively.
  • The attention layer that performs cross attention between the sentence to translate and the right-shifted translation. Here you need to use the CrossAttention layer you defined in the previous exercise.
  • Post-attention LSTM. Another LSTM layer. For this one you don’t need it to return the state.
  • Finally a Dense layer. This one should have the same number of units as the size of the vocabulary since you expect it to compute the logits for every possible word in the vocabulary. Make sure to use a logsoftmax activation function for this one, which you can get as tf.nn.log_softmax.
# GRADED CLASS: Decoder
class Decoder(tf.keras.layers.Layer):
    def __init__(self, vocab_size, units):
        """Initializes an instance of this class

        Args:
            vocab_size (int): Size of the vocabulary
            units (int): Number of units in the LSTM layer
        """
        super(Decoder, self).__init__()

        ### START CODE HERE ###

        # The embedding layer
        self.embedding = tf.keras.layers.Embedding(
            input_dim=vocab_size,
            output_dim=units,
            mask_zero=True
        )  

        # The RNN before attention
        self.pre_attention_rnn = tf.keras.layers.LSTM(
            units=units,
            return_sequences=True,
            return_state=True
        )  

        # The attention layer
        self.attention = CrossAttention(units)

        # The RNN after attention
        self.post_attention_rnn = tf.keras.layers.LSTM(
            units=units,
            return_sequences=True
        )  

        # The dense layer with logsoftmax activation
        self.output_layer = tf.keras.layers.Dense(
            units=vocab_size,
            activation=tf.nn.log_softmax
        )  

        ### END CODE HERE ###

    def call(self, context, target, state=None, return_state=False):
        """Forward pass of this layer

        Args:
            context (tf.Tensor): Encoded sentence to translate
            target (tf.Tensor): The shifted-to-the-right translation
            state (list[tf.Tensor, tf.Tensor], optional): Hidden state of the pre-attention LSTM. Defaults to None.
            return_state (bool, optional): If set to true return the hidden states of the LSTM. Defaults to False.

        Returns:
            tf.Tensor: The log_softmax probabilities of predicting a particular token
        """
        ### START CODE HERE ###

        # Get the embedding of the input
        x = self.embedding(target)

        # Pass the embedded input into the pre attention LSTM
        # Hints:
        # - The LSTM you defined earlier should return the output alongside the state (made up of two tensors)
        # - Pass in the state to the LSTM (needed for inference)
        x, hidden_state, cell_state = self.pre_attention_rnn(x, initial_state=state)

        # Perform cross attention between the context and the output of the LSTM (in that order)
        x = self.attention(context, x)

        # Do a pass through the post attention LSTM
        x = self.post_attention_rnn(x)

        # Compute the logits
        logits = self.output_layer(x)

        ### END CODE HERE ###

        if return_state:
            return logits, [hidden_state, cell_state]

        return logits
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
# Do a quick check of your implementation

# Create an instance of your class
decoder = Decoder(VOCAB_SIZE, UNITS)

# Notice that you don't need the embedded version of sr_translation since this is done inside the class
logits = decoder(encoder_output, sr_translation)

print(f'Tensor of contexts has shape: {encoder_output.shape}')
print(f'Tensor of right-shifted translations has shape: {sr_translation.shape}')
print(f'Tensor of logits has shape: {logits.shape}')
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11

Output

Tensor of contexts has shape: (64, 14, 256)
Tensor of right-shifted translations has shape: (64, 15)
Tensor of logits has shape: (64, 15, 12000)
  • 1
  • 2
  • 3

Expected Output

Tensor of contexts has shape: (64, 14, 256)
Tensor of right-shifted translations has shape: (64, 15)
Tensor of logits has shape: (64, 15, 12000)
  • 1
  • 2
  • 3
# Test your code!

w1_unittest.test_decoder(Decoder, CrossAttention)
  • 1
  • 2
  • 3

Output

All tests passed!
  • 1

Exercise 4 - Translator

Now you have to put together all of the layers you previously coded into an actual model. For this, complete the Translator class below. Notice how unlike the Encoder and Decoder classes inherited from tf.keras.layers.Layer, the Translator class inherits from tf.keras.Model.

Remember that train_data will yield a tuple with the sentence to translate and the shifted-to-the-right translation, which are the “features” of the model. This means that the inputs of your network will be tuples containing context and targets.

# GRADED CLASS: Translator
class Translator(tf.keras.Model):
    def __init__(self, vocab_size, units):
        """Initializes an instance of this class

        Args:
            vocab_size (int): Size of the vocabulary
            units (int): Number of units in the LSTM layer
        """
        super().__init__()

        ### START CODE HERE ###

        # Define the encoder with the appropriate vocab_size and number of units
        self.encoder = Encoder(vocab_size, units)

        # Define the decoder with the appropriate vocab_size and number of units
        self.decoder = Decoder(vocab_size, units)

        ### END CODE HERE ###

    def call(self, inputs):
        """Forward pass of this layer

        Args:
            inputs (tuple(tf.Tensor, tf.Tensor)): Tuple containing the context (sentence to translate) and the target (shifted-to-the-right translation)

        Returns:
            tf.Tensor: The log_softmax probabilities of predicting a particular token
        """

        ### START CODE HERE ###

        # In this case inputs is a tuple consisting of the context and the target, unpack it into single variables
        context, target = inputs

        # Pass the context through the encoder
        encoded_context = self.encoder(context)

        # Compute the logits by passing the encoded context and the target to the decoder
        logits = self.decoder(encoded_context, target)

        ### END CODE HERE ###

        return logits
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
# Do a quick check of your implementation

# Create an instance of your class
translator = Translator(VOCAB_SIZE, UNITS)

# Compute the logits for every word in the vocabulary
logits = translator((to_translate, sr_translation))

print(f'Tensor of sentences to translate has shape: {to_translate.shape}')
print(f'Tensor of right-shifted translations has shape: {sr_translation.shape}')
print(f'Tensor of logits has shape: {logits.shape}')
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11

Output

Tensor of sentences to translate has shape: (64, 14)
Tensor of right-shifted translations has shape: (64, 15)
Tensor of logits has shape: (64, 15, 12000)
  • 1
  • 2
  • 3

Expected Output

Tensor of sentences to translate has shape: (64, 14)
Tensor of right-shifted translations has shape: (64, 15)
Tensor of logits has shape: (64, 15, 12000)
  • 1
  • 2
  • 3
w1_unittest.test_translator(Translator, Encoder, Decoder)
  • 1

Output

 All tests passed!
  • 1

3. Training

Now that you have an untrained instance of the NMT model, it is time to train it. You can use the compile_and_train function below to achieve this:

def compile_and_train(model, epochs=20, steps_per_epoch=500):
    model.compile(optimizer="adam", loss=masked_loss, metrics=[masked_acc, masked_loss])

    history = model.fit(
        train_data.repeat(),
        epochs=epochs,
        steps_per_epoch=steps_per_epoch,
        validation_data=val_data,
        validation_steps=50,
        callbacks=[tf.keras.callbacks.EarlyStopping(patience=3)],
    )

    return model, history
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
# Train the translator (this takes some minutes so feel free to take a break)

trained_translator, history = compile_and_train(translator)
  • 1
  • 2
  • 3

Output

Epoch 1/20
500/500 [==============================] - 48s 68ms/step - loss: 5.1397 - masked_acc: 0.2225 - masked_loss: 5.1422 - val_loss: 4.2776 - val_masked_acc: 0.3481 - val_masked_loss: 4.2784
Epoch 2/20
500/500 [==============================] - 16s 33ms/step - loss: 3.6741 - masked_acc: 0.4191 - masked_loss: 3.6749 - val_loss: 3.0218 - val_masked_acc: 0.4965 - val_masked_loss: 3.0237
Epoch 3/20
500/500 [==============================] - 16s 32ms/step - loss: 2.7304 - masked_acc: 0.5441 - masked_loss: 2.7319 - val_loss: 2.3823 - val_masked_acc: 0.5844 - val_masked_loss: 2.3826
Epoch 4/20
500/500 [==============================] - 16s 32ms/step - loss: 2.2110 - masked_acc: 0.6180 - masked_loss: 2.2119 - val_loss: 1.9733 - val_masked_acc: 0.6483 - val_masked_loss: 1.9741
Epoch 5/20
500/500 [==============================] - 16s 31ms/step - loss: 1.8569 - masked_acc: 0.6690 - masked_loss: 1.8581 - val_loss: 1.7101 - val_masked_acc: 0.6829 - val_masked_loss: 1.7105
Epoch 6/20
500/500 [==============================] - 15s 30ms/step - loss: 1.5978 - masked_acc: 0.7010 - masked_loss: 1.5987 - val_loss: 1.5794 - val_masked_acc: 0.7062 - val_masked_loss: 1.5785
Epoch 7/20
500/500 [==============================] - 15s 31ms/step - loss: 1.4876 - masked_acc: 0.7169 - masked_loss: 1.4883 - val_loss: 1.4743 - val_masked_acc: 0.7138 - val_masked_loss: 1.4749
Epoch 8/20
500/500 [==============================] - 16s 32ms/step - loss: 1.3877 - masked_acc: 0.7300 - masked_loss: 1.3886 - val_loss: 1.3770 - val_masked_acc: 0.7270 - val_masked_loss: 1.3786
Epoch 9/20
500/500 [==============================] - 16s 32ms/step - loss: 1.3113 - masked_acc: 0.7391 - masked_loss: 1.3126 - val_loss: 1.2842 - val_masked_acc: 0.7379 - val_masked_loss: 1.2841
Epoch 10/20
500/500 [==============================] - 16s 31ms/step - loss: 1.1826 - masked_acc: 0.7548 - masked_loss: 1.1840 - val_loss: 1.2217 - val_masked_acc: 0.7491 - val_masked_loss: 1.2225
Epoch 11/20
500/500 [==============================] - 15s 30ms/step - loss: 1.0788 - masked_acc: 0.7676 - masked_loss: 1.0797 - val_loss: 1.1737 - val_masked_acc: 0.7615 - val_masked_loss: 1.1739
Epoch 12/20
500/500 [==============================] - 15s 30ms/step - loss: 1.0570 - masked_acc: 0.7716 - masked_loss: 1.0579 - val_loss: 1.1306 - val_masked_acc: 0.7603 - val_masked_loss: 1.1307
Epoch 13/20
500/500 [==============================] - 15s 31ms/step - loss: 1.0358 - masked_acc: 0.7733 - masked_loss: 1.0369 - val_loss: 1.1341 - val_masked_acc: 0.7596 - val_masked_loss: 1.1348
Epoch 14/20
500/500 [==============================] - 15s 31ms/step - loss: 1.0092 - masked_acc: 0.7777 - masked_loss: 1.0103 - val_loss: 1.0890 - val_masked_acc: 0.7683 - val_masked_loss: 1.0894
Epoch 15/20
500/500 [==============================] - 16s 31ms/step - loss: 0.9015 - masked_acc: 0.7928 - masked_loss: 0.9026 - val_loss: 1.0708 - val_masked_acc: 0.7712 - val_masked_loss: 1.0722
Epoch 16/20
500/500 [==============================] - 15s 31ms/step - loss: 0.8550 - masked_acc: 0.7989 - masked_loss: 0.8556 - val_loss: 1.0757 - val_masked_acc: 0.7720 - val_masked_loss: 1.0774
Epoch 17/20
500/500 [==============================] - 15s 31ms/step - loss: 0.8635 - masked_acc: 0.7969 - masked_loss: 0.8643 - val_loss: 1.0103 - val_masked_acc: 0.7784 - val_masked_loss: 1.0115
Epoch 18/20
500/500 [==============================] - 16s 32ms/step - loss: 0.8636 - masked_acc: 0.7970 - masked_loss: 0.8643 - val_loss: 1.0002 - val_masked_acc: 0.7817 - val_masked_loss: 1.0015
Epoch 19/20
500/500 [==============================] - 15s 31ms/step - loss: 0.8465 - masked_acc: 0.7990 - masked_loss: 0.8471 - val_loss: 0.9914 - val_masked_acc: 0.7835 - val_masked_loss: 0.9912
Epoch 20/20
500/500 [==============================] - 16s 31ms/step - loss: 0.7365 - masked_acc: 0.8181 - masked_loss: 0.7373 - val_loss: 0.9863 - val_masked_acc: 0.7834 - val_masked_loss: 0.9869
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40

4. Using the model for inference

Now that your model is trained you can use it for inference. To help you with this the generate_next_token function is provided. Notice that this function is meant to be used inside a for-loop, so you feed to it the information of the previous step to generate the information of the next step. In particular you need to keep track of the state of the pre-attention LSTM in the decoder and if you are done with the translation. Also notice that a temperature variable is introduced which determines how to select the next token given the predicted logits:

def generate_next_token(decoder, context, next_token, done, state, temperature=0.0):
    """Generates the next token in the sequence

    Args:
        decoder (Decoder): The decoder
        context (tf.Tensor): Encoded sentence to translate
        next_token (tf.Tensor): The predicted next token
        done (bool): True if the translation is complete
        state (list[tf.Tensor, tf.Tensor]): Hidden states of the pre-attention LSTM layer
        temperature (float, optional): The temperature that controls the randomness of the predicted tokens. Defaults to 0.0.

    Returns:
        tuple(tf.Tensor, np.float, list[tf.Tensor, tf.Tensor], bool): The next token, log prob of said token, hidden state of LSTM and if translation is done
    """
    # Get the logits and state from the decoder
    logits, state = decoder(context, next_token, state=state, return_state=True)
    
    # Trim the intermediate dimension 
    logits = logits[:, -1, :]
        
    # If temp is 0 then next_token is the argmax of logits
    if temperature == 0.0:
        next_token = tf.argmax(logits, axis=-1)
        
    # If temp is not 0 then next_token is sampled out of logits
    else:
        logits = logits / temperature
        next_token = tf.random.categorical(logits, num_samples=1)
    
    # Trim dimensions of size 1
    logits = tf.squeeze(logits)
    next_token = tf.squeeze(next_token)
    
    # Get the logit of the selected next_token
    logit = logits[next_token].numpy()
    
    # Reshape to (1,1) since this is the expected shape for text encoded as TF tensors
    next_token = tf.reshape(next_token, shape=(1,1))
    
    # If next_token is End-of-Sentence token you are done
    if next_token == eos_id:
        done = True
    
    return next_token, logit, state, done
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44

See how it works by running the following cell:

# PROCESS SENTENCE TO TRANSLATE AND ENCODE

# A sentence you wish to translate
eng_sentence = "I love languages"

# Convert it to a tensor
texts = tf.convert_to_tensor(eng_sentence)[tf.newaxis]

# Vectorize it and pass it through the encoder
context = english_vectorizer(texts).to_tensor()
context = encoder(context)

# SET STATE OF THE DECODER

# Next token is Start-of-Sentence since you are starting fresh
next_token = tf.fill((1,1), sos_id)

# Hidden and Cell states of the LSTM can be mocked using uniform samples
state = [tf.random.uniform((1, UNITS)), tf.random.uniform((1, UNITS))]

# You are not done until next token is EOS token
done = False

# Generate next token
next_token, logit, state, done = generate_next_token(decoder, context, next_token, done, state, temperature=0.5)
print(f"Next token: {next_token}\nLogit: {logit:.4f}\nDone? {done}")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26

Output

Next token: [[8841]]
Logit: -18.7456
Done? False
  • 1
  • 2
  • 3

Exercise 5 - translate

Now you can put everything together to translate a given sentence. For this, complete the translate function below. This function will take care of the following steps:

  • Process the sentence to translate and encode it
  • Set the initial state of the decoder
  • Get predictions of the next token (starting with the <SOS> token) for a maximum of iterations (in case the <EOS> token is never returned)
  • Return the translated text (as a string), the logit of the last iteration (this helps measure how certain was that the sequence was translated in its totality) and the translation in token format.

Hints:

  • The previous cell provides a lot of insights on how this function should work, so if you get stuck refer to it.
# GRADED FUNCTION: translate
def translate(model, text, max_length=50, temperature=0.0):
    """Translate a given sentence from English to Portuguese

    Args:
        model (tf.keras.Model): The trained translator
        text (string): The sentence to translate
        max_length (int, optional): The maximum length of the translation. Defaults to 50.
        temperature (float, optional): The temperature that controls the randomness of the predicted tokens. Defaults to 0.0.

    Returns:
        tuple(str, np.float, tf.Tensor): The translation, logit that predicted <EOS> token and the tokenized translation
    """
    # Lists to save tokens and logits
    tokens, logits = [], []

    ### START CODE HERE ###
    
    # PROCESS THE SENTENCE TO TRANSLATE
    
    # Convert the original string into a tensor
    text = tf.convert_to_tensor(text)[tf.newaxis]
    
    # Vectorize the text using the correct vectorizer
    context = english_vectorizer(text).to_tensor()
    
    # Get the encoded context (pass the context through the encoder)
    # Hint: Remember you can get the encoder by using model.encoder
    context = model.encoder(context)
    
    # INITIAL STATE OF THE DECODER
    
    # First token should be SOS token with shape (1,1)
    next_token = tf.fill((1, 1), sos_id)
    
    # Initial hidden and cell states should be tensors of zeros with shape (1, UNITS)
    state = [tf.zeros((1, UNITS)), tf.zeros((1, UNITS))]
    
    # You are done when you draw a EOS token as next token (initial state is False)
    done = False

    # Iterate for max_length iterations
    for i in range(max_length):
        # Generate the next token
        try:
            next_token, logit, state, done = generate_next_token(
                decoder=model.decoder,
                context=context,
                next_token=next_token,
                done=done,
                state=state,
                temperature=temperature
            )
        except:
             raise Exception("Problem generating the next token")
        
        # If done then break out of the loop
        if done:
            break
        
        # Add next_token to the list of tokens
        tokens.append(next_token)
        
        # Add logit to the list of logits
        logits.append(logit)
    
    ### END CODE HERE ###
    
    # Concatenate all tokens into a tensor
    tokens = tf.concat(tokens, axis=-1)
    
    # Convert the translated tokens into text
    translation = tf.squeeze(tokens_to_text(tokens, id_to_word))
    translation = translation.numpy().decode()
    
    return translation, logits[-1], tokens
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76

Try your function with temperature of 0, which will yield a deterministic output and is equivalent to a greedy decoding:

# Running this cell multiple times should return the same output since temp is 0

temp = 0.0 
original_sentence = "I love languages"

translation, logit, tokens = translate(trained_translator, original_sentence, temperature=temp)

print(f"Temperature: {temp}\n\nOriginal sentence: {original_sentence}\nTranslation: {translation}\nTranslation tokens:{tokens}\nLogit: {logit:.3f}")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

Output

Temperature: 0.0

Original sentence: I love languages
Translation: eu adoro idiomas ja .
Translation tokens:[[  9 564 850  69   4]]
Logit: -1.165
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

Try your function with temperature of 0.7 (stochastic output):

# Running this cell multiple times should return different outputs since temp is not 0
# You can try different temperatures

temp = 0.7
original_sentence = "I love languages"

translation, logit, tokens = translate(trained_translator, original_sentence, temperature=temp)

print(f"Temperature: {temp}\n\nOriginal sentence: {original_sentence}\nTranslation: {translation}\nTranslation tokens:{tokens}\nLogit: {logit:.3f}")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

Output

Temperature: 0.7

Original sentence: I love languages
Translation: eu amo linguas em favor .
Translation tokens:[[   9  522 1032   22   90    4]]
Logit: -1.007
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
w1_unittest.test_translate(translate, trained_translator)
  • 1

Output

 All tests passed!
  • 1

5. Minimum Bayes-Risk Decoding

As mentioned in the lectures, getting the most probable token at each step may not necessarily produce the best results. Another approach is to do Minimum Bayes Risk Decoding or MBR. The general steps to implement this are:

  • Take several random samples
  • Score each sample against all other samples
  • Select the one with the highest score

You will be building helper functions for these steps in the following sections.

With the ability to generate different translations by setting different temperature values you can do what you saw in the lectures and generate a bunch of translations and then determine which one is the best candidate. You will now do this by using the provided generate_samples function. This function will return any desired number of candidate translations alongside the log-probability for each one:

def generate_samples(model, text, n_samples=4, temperature=0.6):
    
    samples, log_probs = [], []

    # Iterate for n_samples iterations
    for _ in range(n_samples):
        
        # Save the logit and the translated tensor
        _, logp, sample = translate(model, text, temperature=temperature)
        
        # Save the translated tensors
        samples.append(np.squeeze(sample.numpy()).tolist())
        
        # Save the logits
        log_probs.append(logp)
                
    return samples, log_probs
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
samples, log_probs = generate_samples(trained_translator, 'I love languages')

for s, l in zip(samples, log_probs):
    print(f"Translated tensor: {s} has logit: {l:.3f}")
  • 1
  • 2
  • 3
  • 4

Output

Translated tensor: [9, 9, 522, 850, 12, 256, 4] has logit: -0.625
Translated tensor: [9, 9, 522, 850, 69, 4] has logit: -1.636
Translated tensor: [9, 564, 850, 38, 8138, 4] has logit: -0.201
Translated tensor: [9, 522, 850, 12, 256, 4] has logit: -0.694
  • 1
  • 2
  • 3
  • 4

Comparing overlaps

Now that you can generate multiple translations it is time to come up with a method to measure the goodness of each one. As you saw in the lectures, one way to achieve this is by comparing each sample against the others.

There are several metrics you can use for this purpose, as shown in the lectures and you can try experimenting with any one of these. For this assignment, you will be calculating scores for unigram overlaps.

One of these metrics is the widely used yet simple Jaccard similarity which gets the intersection over union of two sets. The jaccard_similarity function returns this metric for any pair of candidate and reference translations:

def jaccard_similarity(candidate, reference):
        
    # Convert the lists to sets to get the unique tokens
    candidate_set = set(candidate)
    reference_set = set(reference)
    
    # Get the set of tokens common to both candidate and reference
    common_tokens = candidate_set.intersection(reference_set)
    
    # Get the set of all tokens found in either candidate or reference
    all_tokens = candidate_set.union(reference_set)
    
    # Compute the percentage of overlap (divide the number of common tokens by the number of all tokens)
    overlap = len(common_tokens) / len(all_tokens)
        
    return overlap
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
l1 = [1, 2, 3]
l2 = [1, 2, 3, 4]

js = jaccard_similarity(l1, l2)

print(f"jaccard similarity between lists: {l1} and {l2} is {js:.3f}")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

Output

jaccard similarity between lists: [1, 2, 3] and [1, 2, 3, 4] is 0.750
  • 1

Expected Output

jaccard similarity between tensors: [1, 2, 3] and [1, 2, 3, 4] is 0.750
  • 1

Exercise 6 - rouge1_similarity

Jaccard similarity is good but a more commonly used metric in machine translation is the ROUGE score. For unigrams, this is called ROUGE-1 and as shown in the lectures, you can output the scores for both precision and recall when comparing two samples. To get the final score, you will want to compute the F1-score as given by:

s c o r e = 2 ∗ ( p r e c i s i o n ∗ r e c a l l ) ( p r e c i s i o n + r e c a l l ) score = 2* \frac{(precision * recall)}{(precision + recall)} score=2(precision+recall)(precisionrecall)

For the implementation of the rouge1_similarity function you want to use the Counter class from the Python standard library:

# GRADED FUNCTION: rouge1_similarity
def rouge1_similarity(candidate, reference):
    """Computes the ROUGE 1 score between two token lists

    Args:
        candidate (list[int]): Tokenized candidate translation
        reference (list[int]): Tokenized reference translation

    Returns:
        float: Overlap between the two token lists
    """
    ### START CODE HERE ###
    
    # Make a frequency table of the candidate and reference tokens
    # Hint: use the Counter class (already imported)
    candidate_word_counts = Counter(candidate)
    reference_word_counts = Counter(reference)
    
    # Initialize overlap at 0
    overlap = 0
    
    # Iterate over the tokens in the candidate frequency table
    # Hint: Counter is a subclass of dict and you can get the keys 
    #       out of a dict using the keys method like this: dict.keys()
    for token in candidate_word_counts.keys():
        
        # Get the count of the current token in the candidate frequency table
        # Hint: You can access the counts of a token as you would access values of a dictionary
        token_count_candidate = candidate_word_counts[token]
        
        # Get the count of the current token in the reference frequency table
        # Hint: You can access the counts of a token as you would access values of a dictionary
        token_count_reference = reference_word_counts.get(token, 0)
        
        # Update the overlap by getting the minimum between the two token counts above
        overlap += np.minimum(token_count_candidate, token_count_reference)
    
    # Compute the precision
    # Hint: precision = overlap / (number of tokens in candidate list) 
    precision = overlap / len(candidate)
    
    # Compute the recall
    # Hint: recall = overlap / (number of tokens in reference list) 
    recall = overlap / len(reference)
    
    if precision + recall != 0:
        # Compute the Rouge1 Score
        # Hint: This is equivalent to the F1 score
        f1_score = 2 * (precision * recall) / (precision + recall)
        
        return f1_score
    
    ### END CODE HERE ###
        
    return 0 # If precision + recall = 0 then return 0
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
l1 = [1, 2, 3]
l2 = [1, 2, 3, 4]

r1s = rouge1_similarity(l1, l2)

print(f"rouge 1 similarity between lists: {l1} and {l2} is {r1s:.3f}")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

Output

rouge 1 similarity between lists: [1, 2, 3] and [1, 2, 3, 4] is 0.857
  • 1

Expected Output

rouge 1 similarity between lists: [1, 2, 3] and [1, 2, 3, 4] is 0.857
  • 1
w1_unittest.test_rouge1_similarity(rouge1_similarity)
  • 1

Output

All tests passed!
  • 1

Computing the Overall Score

You will now build a function to generate the overall score for a particular sample. As mentioned in the lectures, you need to compare each sample with all other samples. For instance, if we generated 30 sentences, we will need to compare sentence 1 to sentences 2 through 30. Then, we compare sentence 2 to sentences 1 and 3 through 30, and so forth. At each step, we get the average score of all comparisons to get the overall score for a particular sample. To illustrate, these will be the steps to generate the scores of a 4-sample list.

  • Get similarity score between sample 1 and sample 2
  • Get similarity score between sample 1 and sample 3
  • Get similarity score between sample 1 and sample 4
  • Get average score of the first 3 steps. This will be the overall score of sample 1
  • Iterate and repeat until samples 1 to 4 have overall scores.

The results will be stored in a dictionary for easy lookups.

Exercise 7 - average_overlap

Complete the average_overlap function below which should implement the process described above:

# GRADED FUNCTION: average_overlap
def average_overlap(samples, similarity_fn):
    """Computes the arithmetic mean of each candidate sentence in the samples

    Args:
        samples (list[list[int]]): Tokenized version of translated sentences
        similarity_fn (Function): Similarity function used to compute the overlap

    Returns:
        dict[int, float]: A dictionary mapping the index of each translation to its score
    """
    # Initialize dictionary
    scores = {}
    
    # Iterate through all samples (enumerate helps keep track of indexes)
    for index_candidate, candidate in enumerate(samples):    
        
        ### START CODE HERE ###
                
        # Initially overlap is zero
        overlap = 0
        
        # Iterate through all samples (enumerate helps keep track of indexes)
        for index_sample, sample in enumerate(samples):

            # Skip if the candidate index is the same as the sample index
            if index_candidate == index_sample:
                continue
                
            # Get the overlap between candidate and sample using the similarity function
            sample_overlap = similarity_fn(candidate, sample)
            
            # Add the sample overlap to the total overlap
            overlap += sample_overlap

        ### END CODE HERE ###
        
        # Get the score for the candidate by computing the average
        score = overlap / (len(samples) - 1)

        # Only use 3 decimal points
        score = round(score, 3)
        
        # Save the score in the dictionary. use index as the key.
        scores[index_candidate] = score
        
    return scores
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
# Test with Jaccard similarity

l1 = [1, 2, 3]
l2 = [1, 2, 4]
l3 = [1, 2, 4, 5]

avg_ovlp = average_overlap([l1, l2, l3], jaccard_similarity)

print(f"average overlap between lists: {l1}, {l2} and {l3} using Jaccard similarity is:\n\n{avg_ovlp}")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

Output

average overlap between lists: [1, 2, 3], [1, 2, 4] and [1, 2, 4, 5] using Jaccard similarity is:

{0: 0.45, 1: 0.625, 2: 0.575}
  • 1
  • 2
  • 3

Expected Output

average overlap between lists: [1, 2, 3], [1, 2, 4] and [1, 2, 4, 5] using Jaccard similarity is:

{0: 0.45, 1: 0.625, 2: 0.575}
  • 1
  • 2
  • 3
# Test with Rouge1 similarity

l1 = [1, 2, 3]
l2 = [1, 4]
l3 = [1, 2, 4, 5]
l4 = [5,6]

avg_ovlp = average_overlap([l1, l2, l3, l4], rouge1_similarity)

print(f"average overlap between lists: {l1}, {l2}, {l3} and {l4} using Rouge1 similarity is:\n\n{avg_ovlp}")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

Output

average overlap between lists: [1, 2, 3], [1, 4], [1, 2, 4, 5] and [5, 6] using Rouge1 similarity is:

{0: 0.324, 1: 0.356, 2: 0.524, 3: 0.111}
  • 1
  • 2
  • 3

Expected Output

average overlap between lists: [1, 2, 3], [1, 4], [1, 2, 4, 5] and [5, 6] using Rouge1 similarity is:

{0: 0.324, 1: 0.356, 2: 0.524, 3: 0.111}
  • 1
  • 2
  • 3
w1_unittest.test_average_overlap(average_overlap)
  • 1

Output

All tests passed!
  • 1

In practice, it is also common to see the weighted mean being used to calculate the overall score instead of just the arithmetic mean. This is implemented in the weighted_avg_overlap function below and you can use it in your experiments to see which one will give better results:

def weighted_avg_overlap(samples, log_probs, similarity_fn):
    
    # Scores dictionary
    scores = {}
    
    # Iterate over the samples
    for index_candidate, candidate in enumerate(samples):    
        
        # Initialize overlap and weighted sum
        overlap, weight_sum = 0.0, 0.0
        
        # Iterate over all samples and log probabilities
        for index_sample, (sample, logp) in enumerate(zip(samples, log_probs)):

            # Skip if the candidate index is the same as the sample index            
            if index_candidate == index_sample:
                continue
                
            # Convert log probability to linear scale
            sample_p = float(np.exp(logp))

            # Update the weighted sum
            weight_sum += sample_p

            # Get the unigram overlap between candidate and sample
            sample_overlap = similarity_fn(candidate, sample)
            
            # Update the overlap
            overlap += sample_p * sample_overlap
            
        # Compute the score for the candidate
        score = overlap / weight_sum

        # Only use 3 decimal points
        score = round(score, 3)
        
        # Save the score in the dictionary. use index as the key.
        scores[index_candidate] = score
    
    return scores
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
l1 = [1, 2, 3]
l2 = [1, 2, 4]
l3 = [1, 2, 4, 5]
log_probs = [0.4, 0.2, 0.5]

w_avg_ovlp = weighted_avg_overlap([l1, l2, l3], log_probs, jaccard_similarity)

print(f"weighted average overlap using Jaccard similarity is:\n\n{w_avg_ovlp}")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

Output

weighted average overlap using Jaccard similarity is:

{0: 0.443, 1: 0.631, 2: 0.558}
  • 1
  • 2
  • 3

mbr_decode

You will now put everything together in the the mbr_decode function below. This final step is not graded as this function is just a wrapper around all the cool stuff you have coded so far!

You can use it to play around, trying different numbers of samples, temperatures and similarity functions!

def mbr_decode(model, text, n_samples=5, temperature=0.6, similarity_fn=jaccard_similarity):
    
    # Generate samples
    samples, log_probs = generate_samples(model, text, n_samples=n_samples, temperature=temperature)
    
    # Compute the overlap scores
    scores = weighted_avg_overlap(samples, log_probs, similarity_fn)

    # Decode samples
    decoded_translations = [tokens_to_text(s, id_to_word).numpy().decode('utf-8') for s in samples]
    
    # Find the key with the highest score
    max_score_key = max(scores, key=lambda k: scores[k])
    
    # Get the translation 
    translation = decoded_translations[max_score_key]
    
    return translation, decoded_translations
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
english_sentence = "I love languages"

translation, candidates = mbr_decode(trained_translator, english_sentence, n_samples=10, temperature=0.6)

print("Translation candidates:")
for c in candidates:
    print(c)

print(f"\nSelected translation: {translation}")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

Output

Translation candidates:
eu adoro idiomas ja .
eu eu amo idiomas e de vida .
eu adoro idiomas ja a espera .
eu amo idiomas a gente .
eu adoro idiomas em vida .
eu adoro idiomas ja .
eu adoro idiomas ja .
eu adoro idiomas ja .
eu adoro idiomas as vezes .
eu adoro idiomas ja

Selected translation: eu adoro idiomas ja .
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13

Congratulations! Next week, you’ll dive deeper into attention models and study the Transformer architecture. You will build another network but without the recurrent part. It will show that attention is all you need! It should be fun!

Keep up the good work!

Grades

在这里插入图片描述

后记

2024年3月30日21点29分完成Week1的Lab,花费1天的时间。这第四门课开始了Attention知识的学习,下周就要讲Transformer,期待一下。

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/AllinToyou/article/detail/351278
推荐阅读
相关标签
  

闽ICP备14008679号