赞
踩
Course Certificate
本文是学习这门课 Natural Language Processing with Attention Models的学习笔记,如有侵权,请联系删除。
Discover some of the shortcomings of a traditional seq2seq model and how to solve for them by adding an attention mechanism, then build a Neural Machine Translation model with Attention that translates English sentences into German.
Good to see you again. You will now learn about
neural machine translation, and you’ll see what
the architecture of this neural
network looks like. You will also learn which words the neural network
is focusing on when translating from
one language to another. Let’s formalize this task. To get started on
this week’s material, I’ll introduce you to neural machine
translation along with the model that was traditionally used for its implementation. The seq2seq model. Then, I’ll talk about some
of this models shortcomings and the solution as they
lead into the model that you’ll be using in
this week’s assignments. Exciting stuff. Let’s go.
In neural machine translation, you’re using an
encoder and a decoder to translate from one
language to another. For example, you
could translate, it’s time for tea from English to French, C’est l’heure du the. To do this, you could use a
machine translation system that has LSTMs for both
encoding and decoding. The traditional seq2seq
model was introduced by Google in 2014 and it was a revelation
at the time. Basically, it works by
taking one sequence of items such as words and its
output, another sequence. The way this is
done is by mapping variable length sequences
to a fixed length memory, which in machine translation, encodes the overall
meaning of sentences.
For example, you can have a text of length that varies and you can encode
it into a vector or fixed dimension
like 300, for example. This feature is what’s made this model a powerhouse
for machine translation. Additionally, the
inputs and outputs don’t need to have
matching lengths, which is a desirable feature
when translating texts. Then you might recall the
vanishing and exploding gradients problems from
earlier in the specialization. In seq2seq model, LSTMs and GRUs are typically
used to avoid these problems. As I mentioned, in
a seq2seq model, you have an encoder
and a decoder.
The encoder takes
word tokens as input, and it returns its final
hidden states as outputs. This hidden state is
used by the decoder to generate the translated sentence
in the target language. Before moving on, let’s look closer at the
encoder and decoder. The encoder typically consists
of an embedding layer and an LSTM module with
one or more layers. The embedding layer
transforms words tokenized first into a vector for
input to the LSTM module. At each step in the
input sequence, the LSTM module receives inputs
from the embedding layer, as well as the hidden states
from the previous step. The encoder returns the hidden
states of the final step, shown here as h_4. This final hidden
state has information from the whole sentence and it encodes its
overall meaning.
The decoder is constructed similarly with an embedding
layer and an LSTM layer. You use the output
word of a step as the input word
for the next step. You also pass the LSTM hidden
state to the next step. You start the input sequence where there is start of sequence token denoted as SOS here. The first step, C’est, as the most probable next word. Then you use C’est as the
input word for the next step and repeat to generate the rest of the sentence
l’heure du the.
One major limitation of the
traditional seq2seq model is what’s referred to as
the information bottleneck. Since seq2seq uses a
fixed length memory for the hidden states, long sequences
become problematic. This is due to the fact that in traditional
seq2seq models, only a fixed amount of
information can be passed from the encoder to
the decoder no matter how much information is
contained in the input sequence. The power of seq2seq, which allows for inputs and outputs to be different sizes, becomes not effective when
the input sequence is long. The result is lower
model performance, a sequence size increases
and that’s no good.
The issue with having one fixed size encoder hidden states is that it struggles to compress longer sequences and it
ends up throttling itself and punishing the decoder who only wants to make
a good prediction. One workaround is to use
the encoder hidden states for each word instead of trying to smash it all into
one big vector. But this model would have flaws
with memory and contexts. How could you build a time
and memory efficient model that predicts accurately
from a long sequence? This becomes possible if the
model has a way to select and focus on the most important
words at each time step. You can think of this as giving the model a new layer to
process this information, which in the slide
is called attention. If you provide the information specific to each input word, you can give the
model a way to focus it’s attention in
the right place at each step of the
decoding process. That is good progress.
Up next, you’ll get
a conceptual idea of what this new layer
is doing and why. You now have an overview of
neural machine translation, and you have a rough idea of what attention
is looking like. You know which words the
model is focusing on when translating from one
language to another language.
Seq2Seq是一种序列到序列的模型,通常用于自然语言处理任务,比如机器翻译和文本摘要。它由两个主要部分组成:编码器(encoder)和解码器(decoder)。
编码器(Encoder):接受输入序列,并将其转换为隐藏状态向量。编码器通常使用循环神经网络(RNN)或者变种(比如长短时记忆网络(LSTM)或门控循环单元(GRU))来处理输入序列,并捕捉输入序列中的信息。
解码器(Decoder):接受编码器生成的隐藏状态向量,并利用该向量生成输出序列。解码器也通常是一个循环神经网络,它会根据输入的隐藏状态和先前生成的标记来预测下一个标记。在训练期间,解码器通过将正确的目标标记传递给下一个时间步来生成序列。在推理阶段,解码器根据前一个时间步生成的标记来生成下一个标记,直到生成特殊的终止标记或达到最大输出长度。
Seq2Seq模型已经被广泛用于许多任务,它的灵活性和强大性使得它成为了自然语言处理领域的一个重要工具。
Welcome. Attention is a
very important concepts and allows you to focus
where the model is looking at whenever
making a prediction. For example, when translating one paragraph from
English to French, you can focus on translating one sentence at a
time or even more, a couple of words at a time. Let’s dive into this concept. What we call attention
now was introduced in a landmark paper from
Dzmitry Bahdanau, KyungHyun Cho, and
Yoshua Bengio. The authors developed a method to fix the seq to seq models, and ability to translate
longer sentences. As you can see, attention was originally developed for
machine translation, but it’s since being used in many other domains
with great success. Before we move forward, I want to skip ahead
a bit and show you how well attention works. It’s surprising.
https://arxiv.org/abs/1409.0473
Title: Neural Machine Translation by Jointly Learning to Align and Translate
Here’s a comparison of
the performance between different models from
the Bahdanau paper using the bleu score, a performance metric that
you’ll learn about later. In brief, higher
scores are better, indicating more
correct translations. The dashed lines, they showed the scores for
bidirectional seq to seq model as the length of the input
sentence is increased. The 30 and 50 denotes the maximum sequence length
used to train the models. As you can see, the seq to seq models perform welfare sentences with
about 10-20 words, but they fall off beyond that. This is what you should expect. A seq to seq models
must store the meaning of the entire input sequence,
any single vector. The models developed
in this paper, RNN search 13-15, use bidirectional encoders and decoders, but with attention. First, these models
perform better than the traditional seqto seqmodels across all
sentence length. The RNN search 50 model has basically no fall off in performance as sentence
lengths increase. As you will see, this is because the models are able to focus on specific inputs to predict words in the output translation, instead of having to memorize
the entire input sentence.
Now I’ll show you the motivation behind attention
and how it works. Traditional seq to seq models, use the final hidden states of the encoder as the initial
hidden state of the decoder. This forces the encoder
to store the meaning of the entire input sequence
into this one hidden states.
Instead of using only
the final hidden states, you can pass all the hidden
states to the decoder. However, this quickly
becomes inefficient as you must retain the
hidden states for each input step in memory. To solve this, you can combine the hidden
states into one vector, typically called
the context vector. The samples operation here
is the point-wise addition. Since the hidden vectors
are all the same size, you can just add up
these vector elements by elements to produce another
vector of the same size. But now the decoder is getting information
about each step. But It really only
needs information from the first few inputs steps to predict the first word. This isn’t that much
different from using the last hidden states
from LSTM or GRU.
The solution here is to wait certain encoder vectors more than others before the
point-wise addition, [inaudible] are
more important for the next decoder outputs
would have larger weights. That this way, the
context vector holds more information about the most important words and less information
about other words. But how are these
weights calculated to determine which input words
are important at each step? The decoders previous
hidden states, denoted as S i minus 1, contains information
about the previous words in the output translation. This means, you can compare
the decoder states with each encoder state to determine the most
important inputs. Intuitively, the decoder can
set the weights such that if it focuses on only the
most important inputs words for the next prediction, it decides which parts of the input sequence
to pay attention to.
Now step into the
attention layer to examine how the weights and context
vector are calculated. The goal of the attention
layer is to return a context vector that contains the relevant information
from the encoder states.
The first step is to
calculate the alignments, E_IJ, which is a
score of how well the inputs around J match
the expected output its I. The more the much, the higher of his score we will expect. This is done using the
feedforward neural network with the encoder and decoder
hidden states as inputs, where the weights for the
feedforward network are learned along with the rest
of the seq to seq model. The scores are then
turned into weights which range from zero to one
using the softmax function. This means the weights
can be thought of as a probability distribution
which sum to one.
Finally, each encoder
states is multiplied by its respective weights and sum together into one
context vector. Since the weights are the
probability distribution, this is equivalent
to calculating an expected value
across word alignments.
Next up, you’ll get a better understanding
of how all this works by implementing a simple version of the attention operation
from the Bahdanau paper. I have now shown
you how attention works and why it is important. In the next video, I will define what our keys, queries and values, and show you how to use
them in attention.
Seq2Seq模型的一个改进版本是带有注意力机制(Attention Mechanism)的Seq2Seq模型。在传统的Seq2Seq模型中,编码器将整个输入序列编码为一个固定长度的向量,然后解码器使用这个向量来生成输出序列。然而,这种固定长度的表示可能会丢失输入序列中重要的信息,特别是当输入序列很长时。
引入注意力机制可以解决这个问题。注意力机制允许解码器在生成每个输出标记时都可以“注意到”输入序列的不同部分,并根据需要分配不同的注意力权重。这样,解码器可以根据当前要生成的输出标记,动态地选择性地关注输入序列的不同部分,从而更好地捕捉输入序列中的重要信息。
具体来说,带有注意力机制的Seq2Seq模型包括以下几个关键组件:
编码器(Encoder):与传统的Seq2Seq模型相同,将输入序列编码为一系列隐藏状态向量。
解码器(Decoder):与传统的Seq2Seq模型相同,使用编码器最后的隐藏状态向量作为初始隐藏状态,并生成输出序列。
注意力机制(Attention Mechanism):在解码器的每个时间步,计算注意力权重,用于加权编码器的隐藏状态向量,以生成上下文向量。这个上下文向量会结合当前解码器的隐藏状态向量,用于生成当前时间步的输出。
带有注意力机制的Seq2Seq模型在处理长序列和捕捉序列中的局部依赖关系方面通常表现更好,因为它可以在生成每个输出标记时根据需要动态地关注输入序列的不同部分。这使得它成为许多序列到序列任务(如机器翻译、文本摘要等)中的首选模型之一。
Basic Attention Operation: Ungraded Lab
As you’ve learned, attention allows a seq2seq decoder to use information from each encoder step instead of just the final encoder hidden state. In the attention operation, the encoder outputs are weighted based on the decoder hidden state, then combined into one context vector. This vector is then used as input to the decoder to predict the next output step.
In this ungraded lab, you’ll implement a basic attention operation as described in Bhadanau, et al (2014) using Numpy.
This is a practice notebook, where you can train writing your code. All of the solutions are provided at the end of the notebook.
# Import the libraries and define the functions you will need for this lab
import numpy as np
def softmax(x, axis=0):
""" Calculate softmax function for an array x along specified axis
axis=0 calculates softmax across rows which means each column sums to 1
axis=1 calculates softmax across columns which means each row sums to 1
"""
return np.exp(x) / np.expand_dims(np.sum(np.exp(x), axis=axis), axis)
The first step is to calculate the alignment scores. This is a measure of similarity between the decoder hidden state and each encoder hidden state. From the paper, this operation looks like
e i j = v a ⊤ tanh ( W a s i − 1 + U a h j ) \large e_{ij} = v_a^\top \tanh{\left(W_a s_{i-1} + U_a h_j\right)} eij=va⊤tanh(Wasi−1+Uahj)
where
W
a
∈
R
n
×
m
W_a \in \mathbb{R}^{n\times m}
Wa∈Rn×m,
U
a
∈
R
n
×
m
U_a \in \mathbb{R}^{n \times m}
Ua∈Rn×m, and
v
a
∈
R
m
v_a \in \mathbb{R}^m
va∈Rm
are the weight matrices and
n
n
n is the hidden state size. In practice, this is implemented as a feedforward neural network with two layers, where
m
m
m is the size of the layers in the alignment network. It looks something like:
Here h j h_j hj are the encoder hidden states for each input step j j j and s i − 1 s_{i - 1} si−1 is the decoder hidden state of the previous step. The first layer corresponds to W a W_a Wa and U a U_a Ua, while the second layer corresponds to v a v_a va.
To implement this, first concatenate the encoder and decoder hidden states to produce an array with size
K
×
2
n
K \times 2n
K×2n where
K
K
K is the number of encoder states/steps. For this, use np.concatenate
(docs). Note that there is only one decoder state so you’ll need to reshape it to successfully concatenate the arrays. The easiest way is to use decoder_state.repeat
(docs) to match the hidden state array size.
Then, apply the first layer as a matrix multiplication between the weights and the concatenated input. Use the tanh function to get the activations. Finally, compute the matrix multiplication of the second layer weights and the activations. This returns the alignment scores.
hidden_size = 16 attention_size = 10 input_length = 5 np.random.seed(42) # Synthetic vectors used to test encoder_states = np.random.randn(input_length, hidden_size) decoder_state = np.random.randn(1, hidden_size) #print(decoder_state.repeat(input_length, axis=0)) # Weights for the neural network, these are typically learned through training # Use these in the alignment function below as the layer weights layer_1 = np.random.randn(2 * hidden_size, attention_size) layer_2 = np.random.randn(attention_size, 1) # Implement this function. Replace None with your code. Solution at the bottom of the notebook def alignment(encoder_states, decoder_state): # First, concatenate the encoder states and the decoder state inputs = np.concatenate((encoder_states, decoder_state.repeat(input_length, axis=0)), axis=1) assert inputs.shape == (input_length, 2 * hidden_size) # Matrix multiplication of the concatenated inputs and layer_1, with tanh activation activations = np.tanh(np.dot(inputs, layer_1)) assert activations.shape == (input_length, attention_size) # Matrix multiplication of the activations with layer_2. Remember that you don't need tanh here scores = np.dot(activations, layer_2) assert scores.shape == (input_length, 1) return scores
# Run this to test your alignment function
scores = alignment(encoder_states, decoder_state)
print(scores)
Output
[[4.35790943]
[5.92373433]
[4.18673175]
[2.11437202]
[0.95767155]]
If you implemented the function correctly, you should get these scores:
[[4.35790943]
[5.92373433]
[4.18673175]
[2.11437202]
[0.95767155]]
The next step is to calculate the weights from the alignment scores. These weights determine the encoder outputs that are the most important for the decoder output. These weights should be between 0 and 1. You can use the softmax function (which is already implemented above) to get these weights from the attention scores. Pass the attention scores vector to the softmax function to get the weights. Mathematically,
α i j = exp ( e i j ) ∑ k = 1 K exp ( e i k ) \large \alpha_{ij} = \frac{\exp{\left(e_{ij}\right)}}{\sum_{k=1}^K \exp{\left(e_{ik}\right)}} αij=∑k=1Kexp(eik)exp(eij)
The weights tell you the importance of each input word with respect to the decoder state. In this step, you use the weights to modulate the magnitude of the encoder vectors. Words with little importance will be scaled down relative to important words. Multiply each encoder vector by its respective weight to get the alignment vectors, then sum up the weighted alignment vectors to get the context vector. Mathematically,
c i = ∑ j = 1 K α i j h j \large c_i = \sum_{j=1}^K\alpha_{ij} h_{j} ci=j=1∑Kαijhj
Implement these steps in the attention
function below.
# Implement this function. Replace None with your code. def attention(encoder_states, decoder_state): """ Example function that calculates attention, returns the context vector Arguments: encoder_vectors: NxM numpy array, where N is the number of vectors and M is the vector length decoder_vector: 1xM numpy array, M is the vector length, much be the same M as encoder_vectors """ # First, calculate the alignment scores scores = alignment(encoder_states, decoder_state) # Then take the softmax of the alignment scores to get a weight distribution weights = softmax(scores) # 5x1 # Multiply each encoder state by its respective weight weighted_scores = encoder_states * weights # 广播机制,逐元素相乘 5x16 vs. 5x1,后者变成5x16 print(weighted_scores.shape) #print(weighted_scores) # Sum up weighted alignment vectors to get the context vector and return it context = np.sum(weighted_scores, axis=0) return context context_vector = attention(encoder_states, decoder_state) print(context_vector)
Output
(5, 16)
[-0.63514569 0.04917298 -0.43930867 -0.9268003 1.01903919 -0.43181409
0.13365099 -0.84746874 -0.37572203 0.18279832 -0.90452701 0.17872958
-0.58015282 -0.58294027 -0.75457577 1.32985756]
If you implemented the attention
function correctly, the context vector should be
[-0.63514569 0.04917298 -0.43930867 -0.9268003 1.01903919 -0.43181409
0.13365099 -0.84746874 -0.37572203 0.18279832 -0.90452701 0.17872958
-0.58015282 -0.58294027 -0.75457577 1.32985756]
# Solution def alignment(encoder_states, decoder_state): # First, concatenate the encoder states and the decoder state. inputs = np.concatenate((encoder_states, decoder_state.repeat(input_length, axis=0)), axis=1) assert inputs.shape == (input_length, 2*hidden_size) # Matrix multiplication of the concatenated inputs and the first layer, with tanh activation activations = np.tanh(np.matmul(inputs, layer_1)) assert activations.shape == (input_length, attention_size) # Matrix multiplication of the activations with the second layer. Remember that you don't need tanh here scores = np.matmul(activations, layer_2) assert scores.shape == (input_length, 1) return scores # Run this to test your alignment function scores = alignment(encoder_states, decoder_state) print(scores)
# Solution def attention(encoder_states, decoder_state): """ Example function that calculates attention, returns the context vector Arguments: encoder_vectors: NxM numpy array, where N is the number of vectors and M is the vector length decoder_vector: 1xM numpy array, M is the vector length, much be the same M as encoder_vectors """ # First, calculate the dot product of each encoder vector with the decoder vector scores = alignment(encoder_states, decoder_state) # Then take the softmax of those scores to get a weight distribution weights = softmax(scores) # Multiply each encoder state by its respective weight weighted_scores = encoder_states * weights # Sum up the weights encoder states context = np.sum(weighted_scores, axis=0) return context context_vector = attention(encoder_states, decoder_state) print(context_vector)
Recurrent models typically take in a sequence in the order it is written and use that to output a sequence. Each elementin the sequence is associated with its step in computation time t t t.(i.e.if a word is in the third element, it will be computed at t 3 ) t_3) t3). These models generate a sequence of hidden states h t h_t ht, as afunction of the previous hidden state h t − 1 h_{t-1} ht−1 and the input for position t.
The sequential nature of models you learned in the previous course (RNNs, LSTMs, GRUs) does not allow for parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. In other words, if you rely on sequences and you need to know the beginning of a text before being able to compute something about the ending of it, then you can not use parallel computing. You would have to wait until the initial computations are complete. This is not good, because if your text is too long, then 1) it will take a long time for you to process it and 2) you will lose a good amount of information mentioned earlier in the text as you approach the end.
Therefore, attention mechanisms have become critical for sequence modeling in various tasks, allowing modeling of dependencies without caring too much about their distance in the input or output sequences.
In this course, you will learn about these attention mechanisms and see how they are implemented. Welcome to Course 4!
Queries, keys and values are terms
that you will be using for attention in this video. I will define them for you and
show you how they could be used. Let’s get started. The original attention paper
was published in 2014. Since then there have been multiple
variations on attention with some models that don’t rely on
recurrent neural networks. For example, the 2017 paper attention is all you need
to introduce the transformer model and the form of attention based on information
retrieval, using queries, keys and values. This is an efficient and powerful form
of attention that you’ll be using in this week’s assignment in this video. I’ll show you how this type of attention
works as well as the concept of alignments between languages. Conceptually, you can think of keys and
values as a look up table. The query is matched to a key and the value associated with
that key is returned. For example,
if we are translating between french and english heure matches with time. So we’d like to get the value for
time, in practice to the queries, keys and
values are all represented by vectors. Embedding vectors for example.
Due to this, you don’t get exact matches
but the model can learn which words are the most similar between
the source and target languages. The similarity between
words is called alignment. The query and key vectors are used
to calculate alignment scores that are measures of how well the query and
keys match. These alignment scores are then
turned into weights used for a weighted sum of the value vectors, this weighted sum of the value vectors
is returned as the attention vector.
This process can be performed
using scale dot-product attention. The queries for each step are packed
together into a matrix Q. So attention can be computed
simultaneously for each query. The keys and values are also
packed into matrices K and V. These matrices are the inputs for the
attention function shown as a diagram on the left and mathematically on the rights. First, the queries and keys matrices are multiplied together
to get a matrix of alignments course. These are then scaled by the square
root of the key vector dimension, dk the scaling improves
the model performance for larger model sizes and could be
seen as a regularization constants. Next the scale scores are converted to
weights using the softmax function. Such that the weights for
each query sum to one. Finally the weights and the value matrices
are multiplied to get the attention vectors for each query, you can think of
the keys and the values as being the same. So when you multiply the softmax
output with V you are taking a linear combination of your initial input which
is then being fed to the decoder. Take a minute to make sure
what I just said makes sense.
No, that unlike the original form of
attention, scale dot-product attention consists of only two Matrix
multiplications and no neural networks. Since matrix multiplication is highly
optimized in modern deep learning frameworks. This form of attention is
much faster to compute but this also means that the alignments
between the source and target languages must
be learned elsewhere. Typically, alignment is learned
in the input embeddings or in other linear layers
before the attention layer.
Before moving on,
I want to look a bit closer at alignment. The alignment weights form a matrix with
queries, targets words on the rows and keys or source words on the columns. Each entry in this matrix is
the weight for the correspondent query, key pair word pairs that have similar
meanings, K and T, for example, will have larger weights than
the similar words like day and time. Through training, the model learns
which words have similar meanings and encodes that information and
the query and key vectors.
Learning alignment like
this is beneficial for translating between languages with
different grammatical structures. Since attention looks at the entire
input and target sentences at once and calculates alignments based on word pairs, weights are assigned appropriately
regardless of word order. For example, In the sentence, the
agreement on the European Economic Area was signed in August 1992 and this other
sentence lack of lasagne economic open. I mean you’re not meeting of sangatte
revenues, you can see that zone in the area are at different positions,
let’s have the same meaning. The model has learned to align them
appropriately, allowing the decoder to focus on the appropriate inputs
words despite different ordering.
Congrats on absorbing
all these new concepts. I introduced you to the purpose
of an attention layer. You saw how it is related with
information retrieval and I showed you how well it works even for
languages with very different structures. In the next video, I’ll be talking
about neural machine translation and show you what the setup looks like for
the system. I’ll show you what the data set looks
like and the steps required for pre processing your data sets. You have now seen what key square ease and
values are. These are important because if
you read a research paper you might come across these terms and
you will understand them. In the next video. I will talk about the setup for
machine translation.
Scaled Dot-Product Attention: Ungraded Lab
The 2017 paper Attention Is All You Need introduced the Transformer model and scaled dot-product attention, sometimes also called QKV (Queries, Keys, Values) attention. Since then, Transformers have come to dominate large-scale natural language applications. Scaled dot-product attention can be used to improve seq2seq models as well. In this ungraded lab, you’ll implement a simplified version of scaled dot-product attention and replicate word alignment between English and French, as shown in Bhadanau, et al. (2014).
The Transformer model learns how to align words in different languages. You won’t be training any weights here, so instead you will use pre-trained aligned word embeddings from here. Run the cell below to load the embeddings and set up the rest of the notebook.
This is a practice notebook, where you can train writing your code. All of the solutions are provided at the end of the notebook.
# Import the libraries
import pickle
import matplotlib.pyplot as plt
import numpy as np
# Load the word2int dictionaries
with open("./data/word2int_en.pkl", "rb") as f:
en_words = pickle.load(f)
with open("./data/word2int_fr.pkl", "rb") as f:
fr_words = pickle.load(f)
# Load the word embeddings
en_embeddings = np.load("./data/embeddings_en.npz")["embeddings"]
fr_embeddings = np.load("./data/embeddings_fr.npz")["embeddings"]
# Define some helper functions def tokenize(sentence, token_mapping): tokenized = [] for word in sentence.lower().split(" "): try: tokenized.append(token_mapping[word]) except KeyError: # Using -1 to indicate an unknown word tokenized.append(-1) return tokenized def embed(tokens, embeddings): embed_size = embeddings.shape[1] output = np.zeros((len(tokens), embed_size)) for i, token in enumerate(tokens): if token == -1: output[i] = np.zeros((1, embed_size)) else: output[i] = embeddings[token] return output
The scaled-dot product attention consists of two matrix multiplications and a softmax scaling as shown in the diagram below from Vaswani, et al. (2017). It takes three input matrices, the queries, keys, and values.
Mathematically, this is expressed as
A t t e n t i o n ( Q , K , V ) = s o f t m a x ( Q K ⊤ d k ) V \large \mathrm{Attention}\left(Q, K, V\right) = \mathrm{softmax}\left(\frac{QK^{\top}}{\sqrt{d_k}}\right)V Attention(Q,K,V)=softmax(dk QK⊤)V
where Q Q Q, K K K, and V V V are the queries, keys, and values matrices respectively, and d k d_k dk is the dimension of the keys. In practice, Q, K, and V all have the same dimensions. This form of attention is faster and more space-efficient than what you implemented before since it consists of only matrix multiplications instead of a learned feed-forward layer.
Conceptually, the first matrix multiplication is a measure of the similarity between the queries and the keys. This is transformed into weights using the softmax function. These weights are then applied to the values with the second matrix multiplication resulting in output attention vectors. Typically, decoder states are used as the queries while encoder states are the keys and values.
Implement the softmax function with Numpy and use it to calculate the weights from the queries and keys. Assume the queries and keys are 2D arrays (matrices). Note that since the dot-product of Q and K will be a matrix, you’ll need to calculate softmax over a specific axis. See the end of the notebook for solutions.
def softmax(x, axis=0): """ Calculate softmax function for an array x axis=0 calculates softmax across rows which means each column sums to 1 axis=1 calculates softmax across columns which means each row sums to 1 """ # Replace pass with your code. y = np.exp(x) return y / np.expand_dims(np.sum(y, axis=axis), axis) def calculate_weights(queries, keys): """ Calculate the weights for scaled dot-product attention""" # Replace None with your code. dot = np.dot(queries, keys.T)/ np.sqrt(keys.shape[1]) weights = softmax(dot, axis=1) assert weights.sum(axis=1)[0] == 1, "Each row in weights must sum to 1" # Replace pass with your code. return weights
在这段代码中,np.sum(y, axis=axis)
计算了y
数组沿着指定轴的和。然后,np.expand_dims()
函数用于在这个和的基础上扩展一个维度,使得结果与y
数组具有相同的维度,但在指定的轴上增加了一个长度为1的维度。
具体来说,假设y
是一个二维数组,axis=1
。np.sum(y, axis=1)
将对每一行求和,得到一个形状为(y.shape[0],)
的一维数组。然后,np.expand_dims(np.sum(y, axis=1), axis=1)
将这个一维数组在第二个轴上扩展,得到一个形状为(y.shape[0], 1)
的二维数组,其中每行的和仍然保持不变。
这个操作通常用于在计算softmax函数时,将每个元素除以对应行(或列)的总和,以确保每行(或列)的元素之和为1。这是因为softmax函数的结果通常被解释为概率分布,所以每行(或列)的和应该为1。
# Tokenize example sentences in English and French, then get their embeddings sentence_en = "The agreement on the European Economic Area was signed in August 1992 ." tokenized_en = tokenize(sentence_en, en_words) embedded_en = embed(tokenized_en, en_embeddings) sentence_fr = "L accord sur la zone économique européenne a été signé en août 1992 ." tokenized_fr = tokenize(sentence_fr, fr_words) embedded_fr = embed(tokenized_fr, fr_embeddings) # These weights indicate alignment between words in English and French alignment = calculate_weights(embedded_fr, embedded_en) # Visualize weights to check for alignment fig, ax = plt.subplots(figsize=(7,7)) ax.imshow(alignment, cmap='gray') ax.xaxis.tick_top() ax.set_xticks(np.arange(alignment.shape[1])) ax.set_xticklabels(sentence_en.split(" "), rotation=90, size=16); ax.set_yticks(np.arange(alignment.shape[0])); ax.set_yticklabels(sentence_fr.split(" "), size=16);
If you implemented the weights calculations correctly, the alignment matrix should look like this:
This is a demonstration of alignment where the model has learned which words in English correspond to words in French. For example, the words signed and signé have a large weight because they have the same meaning. Typically, these alignments are learned using linear layers in the model, but you’ve used pre-trained embeddings here.
Complete the implementation of scaled dot-product attention using your calculate_weights
function (ignore the mask).
def attention_qkv(queries, keys, values):
""" Calculate scaled dot-product attention from queries, keys, and values matrices """
# Replace pass with your code.
attention = np.dot(calculate_weights(queries, keys), values)
return attention
attention_qkv_result = attention_qkv(embedded_fr, embedded_en, embedded_en)
print(f"The shape of the attention_qkv function is {attention_qkv_result.shape}")
print(f"Some elements of the attention_qkv function are \n{attention_qkv_result[0:2,:10]}")
Output
The shape of the attention_qkv function is (14, 300)
Some elements of the attention_qkv function are
[[-0.04039161 -0.00275749 0.00389873 0.04842744 -0.02472726 0.01435613
-0.00370253 -0.0619686 -0.00206159 0.01615228]
[-0.04083253 -0.00245985 0.00409068 0.04830341 -0.02479128 0.01447497
-0.00355203 -0.06196036 -0.00241327 0.01582606]]
Expected output
The shape of the attention_qkv function is (14, 300)
Some elements of the attention_qkv function are
[[-0.04039161 -0.00275749 0.00389873 0.04842744 -0.02472726 0.01435613
-0.00370253 -0.0619686 -0.00206159 0.01615228]
[-0.04083253 -0.00245985 0.00409068 0.04830341 -0.02479128 0.01447497
-0.00355203 -0.06196036 -0.00241327 0.01582606]]
def softmax(x, axis=0): """ Calculate softmax function for an array x axis=0 calculates softmax across rows which means each column sums to 1 axis=1 calculates softmax across columns which means each row sums to 1 """ y = np.exp(x) return y / np.expand_dims(np.sum(y, axis=axis), axis) def calculate_weights(queries, keys): """ Calculate the weights for scaled dot-product attention""" dot = np.matmul(queries, keys.T)/np.sqrt(keys.shape[1]) weights = softmax(dot, axis=1) assert weights.sum(axis=1)[0] == 1, "Each row in weights must sum to 1" return weights def attention_qkv(queries, keys, values): """ Calculate scaled dot-product attention from queries, keys, and values matrices """ weights = calculate_weights(queries, keys) return np.matmul(weights, values)
You will now learn about
how words are being represented in the neural
machine translation setting. You will also see what
the dataset looks like. When implementing
the systems I’ll show you that you need to
keep track of a few things. For example, which words
correspond to what sectors. With that said let’s dive in. This is an example
of the type of input data that you will have for your
assignments this week. Over here you have the
sequence, I’m hungry, and on the right you have the corresponding
French equivalent. Further down, I watch the soccer game and the
corresponding French equivalent. You’re going to have a
great many of these inputs. You should know
that the state of the art models use
pretrained vectors. But otherwise, the first
thing you’ll do is to use a one-hot vector
to represent the words. Usually you’ll keep track of your mappings with
the word to index, and index to word dictionary. Given any input, you
transform it into indices and then vice versa when you make
the predictions. You’ll also normally use
an end of sequence token. You will pad your token vectors with zeros to match the length of the longest sequence.
Here’s an example. This is an English sentence and the tokenized version of
the English sentence. You can see that
it has an index of 4,546 for the word both. After the initial tokenization, just add EOS token
shown here is one, and pad with zeros to match the length of
the longest sequence. Now let’s go to the
French translation of that sequence along with the tokenized version of
the French translation. Notice that one is the end
of sentence token here to. It’s also followed by a
series of padding zeros. Given now that you know
how to represent words, how to initialize your model, and how to structure
your dataset, you can go ahead and start
training your model. In the next video, I’ll show
you how you can do this.
Hello. You’ll now learn how to train your neural machine
translation system. You will learn about
certain concepts like teacher forcing, and you’ll see some of its
advantages. Let’s dive in. In this section, you’ll see how to train your neural
machine translation, NMT for sorts, model
with attention. I’ll introduce you to the
concepts of teacher forcing. As you learned before, seek to seek models generate
translations by feeding the output of the decoder
back in as the next inputs. This way there is no set
length on the output sequence. When training the
model, intuitively, you would compare the
decoder output sequence with the target sequence
to calculate the loss. That is, you would calculate the cross entropy
loss for each step, then sum the steps together
for the total loss. However, in practice, this
doesn’t work too well. The problem is that in the
early stages of training, the model is naive. It’ll make wrong predictions
early in the sequence. This problem compounds as the model keeps making
wrong predictions and the translated sequence gets further and further from
the target sequence.
The problem is illustrated
in this slide, where the final
outputs word duveteux has a similar word to the
word fluffy in English, which has a very different
meaning from the word team. To avoid this problem, you can use the
ground truth words as decoder inputs instead
of the decoder outputs. Even if the model makes
a wrong prediction, it pretends as if it’s made the correct one and
this can continue. This method makes training much faster and has a special
name, teacher forcing. There are some
variations on this tool. For example, you can slowly start using decoder
outputs over time, so that leads into training, you are no longer feeding
in the target words. This is known as
curriculum learning. You are now familiar
with teacher forcing, and you can add this
technique to your toolbox, to help you with
training your model, and to help you get
a better accuracy.
Teacher forcing 是一种训练循环神经网络(RNN)等序列模型的技术,它在训练过程中使用真实的(或者模型自己生成的)前一步输出作为当前步的输入,而不是使用上一步的预测结果。这样可以加快模型的训练速度和提高收敛性,尤其是在训练初期。
在使用Teacher forcing时,模型在训练过程中可以更快地学习到输入序列和输出序列之间的映射关系,因为它可以直接观察到正确的输出。然而,这种方法也存在一个问题,就是在实际推理阶段(即不使用Teacher forcing时),因为模型在训练过程中始终依赖于前一步的真实输出,可能导致模型在推理阶段表现不佳,即所谓的“曝光偏差”(exposure bias)问题。
为了解决这个问题,可以在训练过程中以一定的概率使用模型自己生成的前一步输出作为当前步的输入,这样可以更好地模拟实际推理时的情况,称为“Scheduled Sampling”。通过逐渐增加使用模型自己生成的输出的概率,可以平衡训练和推理之间的差异,提高模型在推理阶段的性能。
Welcome. I will now
show you how to train a neural machine
translation system from scratch. I’ll go through every step
slowly so you can understand what is going on behind the
scenes. Let’s get started. In this video, I’ll show you how everything you have
seen this week fits together into the
model architecture you will implement in
this week’s assignments. First, I’ll give you
a general overview before I go into the
more intricate details. You will implement
a model similar to the one you have seen
in previous lectures. You will have an encoder that
gets the input sequence, a decoder which is supposed
to do the translation, and an Attention Mechanism
which would help the decoder focus on the important parts of
the input sequence. Recall that the decoder
is supposed to pass hidden states to the
Attention Mechanism to get context vectors. The pass of the hidden
states from the decoder to the Attention Mechanism could
not be easy to implement. Instead, you will be
using two decoders, a pre-attention decoder
to provide hidden states, and a post-attention decoder which will provide
the translation.
A general overview of the modified model
looks as follows. You will have the encoder
and a pre-attention decoder that’s got the inputs
and target sequences. Then for the
pre-attention decoder, the target sequence
is shifted right, which is how you’ll be
implementing the teacher forcing. From the encoder and
pre-attention decoder, you will retrieve
the hidden states at each step and use them as inputs for the
Attention Mechanism. You will use the
hidden states from the encoder as the
keys and values, while those from the
decoder are the queries. As you have seen in
previous lectures, the Attention Mechanism will use these values to compute
the context vectors. Finally, the post-attention
decoder will use the context vectors as inputs to provide the
predicted sequence.
Now, let’s take a closer look at each piece of the model. The initial step is
to make two copies of the input tokens and
the target tokens because you will need them in different places of the model. One copy of the input tokens
is fed into the encoder, which is used to transform them into the key
and value vectors, while a copy of
the target tokens goes into the
pre-attention decoder. Note that the
computations done in the encoder and
pre-attention decoder could be done in parallel, since they don’t
depend on each other. Within the
pre-attention decoder, you shift each
sequence to the right and add a start of
sentence token. In the encoder and
pre-attention decoder, the inputs and
targets go through an embedding layer
before going to LSTMs. After getting the query
key and value vectors, you have to prepare them
for the attention layer. You’ll use a function
to help you get a padding mask to help the attention layer determine
the padding tokens. This step is where you will use the copy of
the input tokens. Now, everything is
ready for attention. You pass the queries,
keys, values, and the mask to the
attention layer that outputs the context
vector and the mask. Before going through the
decoder, you drop the mask. You then pass the
context vectors through the decoder composed of an LSTM, a dense layer, and a LogSoftmax. In the end, your model returns log probabilities and the copy of the target tokens that
you made at the beginning. There you have it,
the model you’ll be building and the intuition
behind all the steps. Take a break and just
let all that sink in. You now have an overview
of how NMT is implemented. If you did not
understand everything, do not worry about it. We will go in more detail in this week’s programming
assignments. In the next video, I will talk about how to
evaluate your system.
After building and
training your model, it is essential to assess
how well it performs. For machine translation, you have different metrics that were engineered
just for this task. In this lecture, I will
show you the BLEU score and some of its issues
for evaluating machine translation models. The BLEU score, a bilingual
evaluation under study, is an algorithm designed
to evaluate some of the most challenging problems in NLP, including
machine translation. It evaluates the quality of
machine-translated text by comparing a candidate
translation to one or more references, which are often
human translations. The closer the BLEU
score is to one, the better your model is, the closer to zero,
the worse it is.
With that said, what is the BLEU score and why is
this an important metric? To get the BLEU score, you have to compute the
precision of the candidates by comparing its end-grams
with reference translations. To demonstrate, I’ll use
unigrams as an example. Let’s say that you have a
candidate sequence that you got from your model
composed of I, I, am, I. You also have one
reference translation which contains the words, Eunice said, I’m hungry. A second reference translation
that includes the words, he said, I’m hungry. To get the BLEU score, you count how many words from the candidate appear in any of the references and
divide that count by the total number of words in
the candidate translation. You can view it as
a precision metric.
You have to go
through all the words in the candidate translation. First, you have the word I, which appears in both
reference translations. You add one to your count. Then you have again the word I, which you already know
appears on both references, and you add one to your count. After that, you have the word am which also appears
in both references. You add that word to your count. At the end, you have
the word I again, which appears on
both references. You can add one to your count. Finally, you can get the
BLEU score by dividing your count by the number of words in the candidate
translation, which in this case
is equal to 4. The whole process gives you
a BLEU score equal to 1. Weird? This translation that is far from being equal to the references got
a perfect score. With this vanilla BLEU score, a model that always outputs
common words will do great.
Let’s try a modified
version that will give you a better estimate of
your model’s performance. For the modified version
of the BLEU score, after you find a word from the candidates in one or
more of the references, you stop considering
that word from the reference for the following
words in the candidates. In other words, you
exhaust the words in the references after you match them with a word
in the candidates. Let’s start from the beginning of the candidate translation. You have the word I that
appears in both references. You add one to your count and exhaust the word I
from both references. Then you have the word I again, but you don’t have that word
in the references because it was taken out for the
previous word in the candidate. You don’t add anything
to your count. Then you have the word M, which appears in
both references. You add one to your counts and eliminate the word M
from both references. After that, you have
the word I again, but no left occurrences
in the references. You don’t add anything
to your counts. Finally, you divide your count
by the number of words in the candidate translation
to get BLEU score of 2/4 or 0.5. As you can note, this version of the BLEU score makes more sense than the vanilla implementation.
However, like anything in life, using the BLEU score as an evaluation metric
has some caveats. For one, it doesn’t consider the semantic
meaning of the words. It also doesn’t consider the
structure of the sentence. Imagine getting
this translation. Ate I was hungry because. If the reference sentence is
I ate because I was hungry, this would get a
perfect BLEU score. BLEU score is the most widely
adopted evaluation metric for machine translation. But you should be aware of these drawbacks before using it.
You now know how to evaluate your machine translation
model using the BLEU score. I also showed you that this
metric has some issues because it doesn’t care about semantics and
sentence structure. In the following video, you’ll see another metric
for machine translation. That metric could be used to better estimate your
model performance.
BLEU(Bilingual Evaluation Understudy)和ROUGE(Recall-Oriented Understudy for Gisting Evaluation)都是用于评估自然语言处理任务中生成文本质量的指标,但它们在应用和计算方式上有一些不同之处。
用途:
计算方式:
评价指标:
总的来说,BLEU和ROUGE都是用于评估生成文本质量的重要指标,但它们适用于不同的任务,并且在计算方式和评价指标上存在一些差异。
Calculating the Bilingual Evaluation Understudy (BLEU) score: Ungraded Lab
In this ungraded lab, you will implement a popular metric for evaluating the quality of machine-translated text: the BLEU score proposed by Kishore Papineni, et al. in their 2002 paper “BLEU: a Method for Automatic Evaluation of Machine Translation”. The BLEU score works by comparing a “candidate” text to one or more “reference” texts. The score is higher the better the result. In the following sections you will calculate this value using your own implementation as well as using functions from a library.
You will start by importing the Python libraries. First, you will implement your own version of the BLEU Score using NumPy. To verify that your implementation is correct, you will compare the results with those generated by the SacreBLEU library. This package provides hassle-free computation of shareable, comparable, and reproducible BLEU scores. It also knows all the standard test sets and handles downloading, processing, and tokenization.
import numpy as np # import numpy to make numerical computations.
import nltk # import NLTK to handle simple NL tasks like tokenization.
nltk.download("punkt")
from nltk.util import ngrams
from collections import Counter # import a counter.
!pip3 install 'sacrebleu' # install the sacrebleu package.
import sacrebleu # import sacrebleu in order compute the BLEU score.
import matplotlib.pyplot as plt # import pyplot in order to make some illustrations.
You have seen how to calculate the BLEU score in this week’s lectures. Formally, you can express the BLEU score as:
B L E U = B P × ( ∏ i = 1 n p r e c i s i o n i ) ( 1 / n ) . (1) BLEU = BP\times\Bigl(\prod_{i=1}^{n}precision_i\Bigr)^{(1/n)}.\tag{1} BLEU=BP×(i=1∏nprecisioni)(1/n).(1)
The BLEU score depends on the B P BP BP, which stands for Brevity Penalty, and the weighted geometric mean precision for different lengths of n-grams, both of which are described below. The product runs from i = 1 i=1 i=1 to i = n i=n i=n to account for 1-grams to n-grams and the exponent of 1 / n 1/n 1/n is there to calculate the geometrical average. In this notebook, you will use n = 4 n=4 n=4
The Brevity Penalty is defined as an exponential decay:
B P = m i n ( 1 , e ( 1 − ( l e n ( r e f ) / l e n ( c a n d ) ) ) ) , (2) BP = min\Bigl(1, e^{(1-({len(ref)}/{len(cand)}))}\Bigr),\tag{2} BP=min(1,e(1−(len(ref)/len(cand)))),(2)
where l e n ( r e f ) {len(ref)} len(ref) and l e n ( c a n d ) {len(cand)} len(cand) refer to the length or count of words in the reference and candidate translations. The brevity penalty helps to handle very short translations.
The precision is defined as :
p r e c i s i o n i = ∑ s i ∈ c a n d m i n ( C ( s i , c a n d ) , C ( s i , r e f ) ) ∑ s i ∈ c a n d C ( s i , c a n d ) . (3) precision_i = \frac {\sum_{s_i \in{cand}}min\Bigl(C(s_i, cand), C(s_i, ref)\Bigr)}{\sum_{s_i \in{cand}} C(s_i, cand)}.\tag{3} precisioni=∑si∈candC(si,cand)∑si∈candmin(C(si,cand),C(si,ref)).(3)
The sum goes over all the i-grams s i s_i si in the candidate sentence c a n d cand cand. C ( s i , c a n d ) C(s_i, cand) C(si,cand) and C ( s i , r e f ) C(s_i, ref) C(si,ref) are the counts of the i-grams in the candidate and reference sentences respectively. So the sum counts all the n-grams in the candidate sentence that also appear in the reference sentence, but only counts them as many times as they appear in the reference sentence and not more. This is then divided by the total number of i-grams in the candidate sentence.
The brevity penalty penalizes generated translations that are shorter than the reference sentence. It compensates for the fact that the BLEU score has no recall term.
reference_length = 1
candidate_length = np.linspace(1.5, 0.5, 100)
length_ratio = reference_length / candidate_length
BP = np.minimum(1, np.exp(1 - length_ratio))
# Plot the data
fig, ax = plt.subplots(1)
lines = ax.plot(length_ratio, BP)
ax.set(
xlabel="Ratio of the length of the reference to the candidate text",
ylabel="Brevity Penalty",
)
plt.show()
Output
The n-gram precision counts how many n-grams (in your case unigrams, bigrams, trigrams, and four-grams for i =1 , … , 4) match their n-gram counterpart in the reference translations. This term acts as a precision metric. Unigrams account for adequacy while longer n-grams account for fluency of the translation. To avoid overcounting, the n-gram counts are clipped to the maximal n-gram count occurring in the reference ( m n r e f m_{n}^{ref} mnref). Typically precision shows exponential decay with the degree of the n-gram.
# Mocked dataset showing the precision for different n-grams
data = {"1-gram": 0.8, "2-gram": 0.7, "3-gram": 0.6, "4-gram": 0.5}
# Plot the datapoints defined above
fig, ax = plt.subplots(1)
bars = ax.bar(*zip(*data.items()))
ax.set(ylabel="N-gram precision")
plt.show()
Output
When the n-gram precision is normalized by the brevity penalty (BP), then the exponential decay of n-grams is almost fully compensated. The BLEU score corresponds to a geometric average of this modified n-gram precision.
# Mocked dataset showing the precision multiplied by the BP for different n-grams
data = {"1-gram": 0.8, "2-gram": 0.77, "3-gram": 0.74, "4-gram": 0.71}
# Plot the datapoints defined above
fig, ax = plt.subplots(1)
bars = ax.bar(*zip(*data.items()))
ax.set(ylabel="Modified N-gram precision")
plt.show()
Output
In this example you will have a reference sentence and 2 candidate sentences. You will tokenize all sentences using the NLTK package. Then you will compare the two candidates to the reference using BLEU score.
First you define and tokenize the sentences.
reference = "The NASA Opportunity rover is battling a massive dust storm on planet Mars."
candidate_1 = "The Opportunity rover is combating a big sandstorm on planet Mars."
candidate_2 = "A NASA rover is fighting a massive storm on planet Mars."
tokenized_ref = nltk.word_tokenize(reference.lower())
tokenized_cand_1 = nltk.word_tokenize(candidate_1.lower())
tokenized_cand_2 = nltk.word_tokenize(candidate_2.lower())
print(f"{reference} -> {tokenized_ref}")
print("\n")
print(f"{candidate_1} -> {tokenized_cand_1}")
print("\n")
print(f"{candidate_2} -> {tokenized_cand_2}")
Output
The NASA Opportunity rover is battling a massive dust storm on planet Mars. -> ['the', 'nasa', 'opportunity', 'rover', 'is', 'battling', 'a', 'massive', 'dust', 'storm', 'on', 'planet', 'mars', '.']
The Opportunity rover is combating a big sandstorm on planet Mars. -> ['the', 'opportunity', 'rover', 'is', 'combating', 'a', 'big', 'sandstorm', 'on', 'planet', 'mars', '.']
A NASA rover is fighting a massive storm on planet Mars. -> ['a', 'nasa', 'rover', 'is', 'fighting', 'a', 'massive', 'storm', 'on', 'planet', 'mars', '.']
Computing the Brevity Penalty
You will start by defining the function for brevity penalty according to the equation (2) in section 2.1.
def brevity_penalty(candidate, reference):
"""
Calculates the brevity penalty given the candidate and reference sentences.
"""
reference_length = len(reference)
candidate_length = len(candidate)
if reference_length < candidate_length:
BP = 1
else:
penalty = 1 - (reference_length / candidate_length)
BP = np.exp(penalty)
return BP
Computing the clipped Precision
Next, you need to define a function to calculate the geometrically averaged clipped precision. This function calculates how many of the n-grams in the candidate sentence actually appear in the reference sentence. The clipping takes care of overcounting. For example if a certain n-gram appears five times in the candidate sentence, but only twice in the reference, the value is clipped to two.
def average_clipped_precision(candidate, reference): """ Calculates the precision given the candidate and reference sentences. """ clipped_precision_score = [] # Loop through values 1, 2, 3, 4. This is the length of n-grams for n_gram_length in range(1, 5): reference_n_gram_counts = Counter(ngrams(reference, n_gram_length)) candidate_n_gram_counts = Counter(ngrams(candidate, n_gram_length)) total_candidate_ngrams = sum(candidate_n_gram_counts.values()) for ngram in candidate_n_gram_counts: # check if it is in the reference n-gram if ngram in reference_n_gram_counts: # if the count of the candidate n-gram is bigger than the corresponding # count in the reference n-gram, then set the count of the candidate n-gram # to be equal to the reference n-gram if candidate_n_gram_counts[ngram] > reference_n_gram_counts[ngram]: candidate_n_gram_counts[ngram] = reference_n_gram_counts[ngram] # t else: candidate_n_gram_counts[ngram] = 0 # else set the candidate n-gram equal to zero clipped_candidate_ngrams = sum(candidate_n_gram_counts.values()) clipped_precision_score.append(clipped_candidate_ngrams / total_candidate_ngrams) # Calculate the geometric average: take the mean of elemntwise log, then exponentiate # This is equivalent to taking the n-th root of the product as shown in equation (1) above s = np.exp(np.mean(np.log(clipped_precision_score))) return s
reference_n_gram_counts = Counter(ngrams(reference, n_gram_length)) 解释
这段代码使用了 NLTK(Natural Language Toolkit)和 Python 的 collections 模块来计算参考文本(reference)中 n 元组(n-grams)的数量。下面对每一行进行解释:
from nltk.util import ngrams
: 这行代码从 NLTK 工具包中导入了 ngrams 函数,该函数用于生成文本的 n 元组序列。
from collections import Counter
: 这行代码从 Python 的 collections 模块中导入了 Counter 类,用于计算可哈希对象的频率。
reference_n_gram_counts = Counter(ngrams(reference, n_gram_length))
: 这行代码计算了参考文本中 n 元组的数量,并将结果存储在 reference_n_gram_counts 变量中。具体地,它使用了 ngrams 函数生成了 reference 中的所有 n 元组,并使用 Counter 类对这些 n 元组进行计数。这样,reference_n_gram_counts 就是一个包含了参考文本中所有 n 元组及其出现次数的字典。
Computing the BLEU score
Finally, you can compute the BLEU score using the above two functions.
def bleu_score(candidate, reference):
BP = brevity_penalty(candidate, reference)
geometric_average_precision = average_clipped_precision(candidate, reference)
return BP * geometric_average_precision
Now you can test the functions with your Example Reference and Candidates Sentences.
result_candidate_1 = round(bleu_score(tokenized_cand_1, tokenized_ref) * 100, 1)
print(f"BLEU score of reference versus candidate 1: {result_candidate_1}")
result_candidate_2 = round(bleu_score(tokenized_cand_2, tokenized_ref) * 100, 1)
print(f"BLEU score of reference versus candidate 2: {result_candidate_2}")
Output
BLEU score of reference versus candidate 1: 27.6
BLEU score of reference versus candidate 2: 35.3
Below you will do the same calculation, but using the sacrebleu
library. Compare them with your implementation above.
result_candidate_1 = round(sacrebleu.sentence_bleu(candidate_1, [reference]).score, 1)
print(f"BLEU score of reference versus candidate 1: {result_candidate_1}")
result_candidate_2 = round(sacrebleu.sentence_bleu(candidate_2, [reference]).score, 1)
print(f"BLEU score of reference versus candidate 2: {result_candidate_2}")
Output
BLEU score of reference versus candidate 1: 27.6
BLEU score of reference versus candidate 2: 35.3
In this section, you will use a simple pipeline for evaluating machine translated text. You will use English to German translations generated by Google Translate. There are three files you will need:
With that, you can now compare the reference and candidate translation to get the BLEU Score.
# Loading the raw data wmt19_src = open("data/wmt19_src.txt", "r") wmt19_src_1 = wmt19_src.read() wmt19_src.close() wmt19_ref = open("data/wmt19_ref.txt", "r") wmt19_ref_1 = wmt19_ref.read() wmt19_ref.close() wmt19_can = open("data/wmt19_can.txt", "r") wmt19_can_1 = wmt19_can.read() wmt19_can.close() tokenized_corpus_src = nltk.word_tokenize(wmt19_src_1.lower()) tokenized_corpus_ref = nltk.word_tokenize(wmt19_ref_1.lower()) tokenized_corpus_cand = nltk.word_tokenize(wmt19_can_1.lower())
Now that you have your data loaded, you can inspect the first sentence of each dataset.
print("English source text:\n")
print(f"{wmt19_src_1[0:170]} -> {tokenized_corpus_src[0:30]}\n\n")
print("German reference translation:\n")
print(f"{wmt19_ref_1[0:219]} -> {tokenized_corpus_ref[0:35]}\n\n")
print("German machine translation:\n")
print(f"{wmt19_can_1[0:199]} -> {tokenized_corpus_cand[0:29]}")
Output
English source text: Welsh AMs worried about 'looking like muppets' There is consternation among some AMs at a suggestion their title should change to MWPs (Member of the Welsh Parliament). -> ['\ufeffwelsh', 'ams', 'worried', 'about', "'looking", 'like', "muppets'", 'there', 'is', 'consternation', 'among', 'some', 'ams', 'at', 'a', 'suggestion', 'their', 'title', 'should', 'change', 'to', 'mwps', '(', 'member', 'of', 'the', 'welsh', 'parliament', ')', '.'] German reference translation: Walisische Ageordnete sorgen sich "wie Dödel auszusehen" Es herrscht Bestürzung unter einigen Mitgliedern der Versammlung über einen Vorschlag, der ihren Titel zu MWPs (Mitglied der walisischen Parlament) ändern soll. -> ['\ufeffwalisische', 'ageordnete', 'sorgen', 'sich', '``', 'wie', 'dödel', 'auszusehen', "''", 'es', 'herrscht', 'bestürzung', 'unter', 'einigen', 'mitgliedern', 'der', 'versammlung', 'über', 'einen', 'vorschlag', ',', 'der', 'ihren', 'titel', 'zu', 'mwps', '(', 'mitglied', 'der', 'walisischen', 'parlament', ')', 'ändern', 'soll', '.'] German machine translation: Walisische AMs machten sich Sorgen, dass sie wie Muppets aussehen könnten Einige AMs sind bestürzt über den Vorschlag, ihren Titel in MWPs (Mitglied des walisischen Parlaments) zu ändern. Es ist aufg -> ['walisische', 'ams', 'machten', 'sich', 'sorgen', ',', 'dass', 'sie', 'wie', 'muppets', 'aussehen', 'könnten', 'einige', 'ams', 'sind', 'bestürzt', 'über', 'den', 'vorschlag', ',', 'ihren', 'titel', 'in', 'mwps', '(', 'mitglied', 'des', 'walisischen', 'parlaments']
And lastly, you can calculate the BLEU score of the translation.
result = round(sacrebleu.sentence_bleu(wmt19_can_1, [wmt19_ref_1]).score, 1)
print(f"BLEU score of the reference versus candidate translation: {result}")
Output
BLEU score of the reference versus candidate translation: 43.2
The table below (taken from here) shows the typical values of BLEU score. You can see that the translation above is of high quality according to this table and in comparison to the given reference sentence. (if you see “Hard to get the gist”, please open your workspace, delete wmt19_can.txt
and get the latest version via the Lab Help button)
Score | Interpretation |
---|---|
< 10 | Almost useless |
10 - 19 | Hard to get the gist |
20 - 29 | The gist is clear, but has significant grammatical errors |
30 - 40 | Understandable to good translations |
40 - 50 | High quality translations |
50 - 60 | Very high quality, adequate, and fluent translations |
> 60 | Quality often better than human |
Previously, I introduced you to the BLEU score evaluation metric and it’s
modified version. I used it to assess the performance of machine
translation models. I also showed you some
drawbacks that’s arise because that metric ignores semantic
and sentence structure. In this video, I’ll talk
about the ROUGE score, another performance
metric that tends to estimate the quality of
machine translation systems. I’ll introduce You now to a family of metrics
called ROUGE. It stands for
Recall-Oriented Understudy of Gisting Evaluation, which is a mouthful. But lets you know,
right off the bat, that it’s more
recall-oriented by default. That means that ROUGE cares
about how much of the human created references appear in
the candidate translation. In contrast, BLEU is
precision oriented. Since you have to
determine how many words from the candidates
appear on the references. ROUGE was initially
developed to evaluate the quality of the
machine summarized texts, but is also helpful in assessing the quality
of machine translation. It works by comparing the machine candidates against reference translations
provided by humans. There are many versions
of the ROUGE score, but also the one called
ROUGE-N. For the ROUGE-N score, You have to get the counts of the n-gram overlaps between the candidates and the
reference translations, which is somewhat
similar to what you have to do for
the BLEU score.
To see the difference
between the two metrics, I’ll show You an example of how ROUGE-N works with uni-grams. To get the basic version of the ROUGE-N score based only on recall so you must count word matches between the
reference and the candidates, and divide by the number
of words in the reference. If you had multiple references, you would need to get a ROUGE-N score using each
reference and get the maximum. Now, let’s go through
the example that you already solved
for the BLEU score. Your candidate has the
words I two times, the word M, and
the word I again, for a total of four words. You also have a
reference translation. Younes said, “I am hungry” and another slightly
different reference. He said, “I’m hungry.” Each reference has
five words in total. You have to count
matches between the references and the
candidate translations, similar to what you did
for the BLEU score. Let’s start with the
first reference. The word Younes, doesn’t match any of the uni-grams
in the candidates, so you don’t add
anything to the counts. The word said doesn’t match any word and the
candidates either. The word I, has
multiple matches, but you need the first one. For this match, you add
only one to your counts. The word M has a match in the candidates so your
increment your counts. Now, the final word of the
first reference, hungry, doesn’t match any of the
words from the candidates. You don’t add anything
to your counts. If you repeat this process
for the second reference, you get a counts equal to 2. Finally, you divide these
counts by the number of words in each reference
and get the maximum value, which for this example
is equal to 0.4.
This basic version of the
ROUGE-N score is based on recall while the BLEU score you saw in the previous
lectures is precision. But why not combine both to get a metric like an F1 score? Recall, pun intended, from your introductory
machine learning courses that the F1 score is given
by this formula, two times the product of
precision and recall, divided by the sum
of both metrics. You get the following formula, if you replace precision
by the modified version of the BLEU score and recall
by the ROUGE-N score. For this example, you have
a BLEU score equal to 0.5, which you got in
previous lectures. You have a ROUGE-N score
equivalent to 0.4, that you calculated before. With these values, you will have an F1 score equal to 4
over 9, close to 0.44. You have now seen how to compute the modified BLEU and the sample ROUGE-N scores
to evaluate your model. You can view these metrics
like precision and recall. Therefore, you can use both to get an F1
score that’s could better assess the performance of your machine
translation model. In many applications, you
will see reported and F-score along with the
BLEU and ROUGE-N metric. However, you must note that’s all the evaluation metrics
you have seen so far, don’t consider the sentence
structure and semantics, only accounts for
matching n-grams between candidates and the
reference translations.
You now have seen how to
compute the modified BLEU and the simple ROUGE-N scores
to evaluate your model. You can view these metrics
like precision and recall. Therefore, you can use both to get an F1 score that’s good, better assess the performance of your machine
translation model. In many applications,
you’ll see reported an F-score along with the
BLEU and the ROUGE-N metrics. However, you must note that all the evaluation
metrics you have seen so far don’t consider the sentence structure
and semantics. They only account
for matching n-grams between the candidates and
reference translations.
Hello. You will now learn about two ways that will allow you to construct a sentence. The first approach is known as greedy decoding and
the second approach is known as random sampling. You’ll also see the pros and
the cons of each method. For example, when
choosing the word with the highest probability
at every time step, that does not necessarily
generate the best sequence. With that said, let’s dive in and explore
these two methods. By now you have reached
the final parts of this week’s lectures.
That’s awesome. I’ll show you a few methods
for sampling and decoding, as well as a discussion of an important type of parameter in sampling called temperature. First, a quick reminder on how a seq2seq model
predicts words. The output of the
decoder is produced from a dense layer and a softmax
or log softmax operation. The output at each step then is the probability
distribution over all the words and symbols
in the target vocabulary. The final output of the
model depends on how you choose the words using these probability
distributions at each step. Greedy decoding is the
simplest way to decode the model’s predictions
as it selects the most probable
word at every step. However, this approach
has limitations. When you consider the
highest probability for each prediction and concatenate all predicted tokens for the output sequence. As the greedy decoder does, you can end up with
a situation where the output instead of, “I am hungry,” gives you “I am, am, am” and so forth. You can see how this
could be a problem, but not in all cases. For shorter sequences,
it’s going to be fine. But if you have many
other words to consider, then knowing what’s
coming up next might help you better
predict the next sequence.
Another option is known
as random sampling. What random sampling
does is it provides probabilities for each word and sample accordingly
for the next outputs. One of the problems with this is that it could be a
little bit too random. A solution for this is to
assign more weight to the words with higher probabilities and
less weight to the others. You will see a method for doing this in just a few moment.
In sampling, temperature
is a parameter you can adjust to allow for more or less randomness
in your predictions. It’s measured on a scale of 0-1, indicating low to
high randomness. Let’s say you need your
model to make careful, safe decisions about
what to output. Then set you’re parameter lower and get the prediction
equivalent of a very confident but rather a boring person seated next to
you at a dinner table. If you feel like taking
more of a gamble, set your temperature
a bit higher. This has the effect of making
your network more excited. You may get some pretty
fun predictions. On the other hand, there will be probably a lot more mistakes.
You have seen in this video, methods for sampling
and decoding that will allow you to construct
sentences using your model. However, these methods don’t always produce the most
convincing outputs. Instead, you’ll often get a
very random set of words. In the next videos, you’ll see two methods
for sampling and decoding that tend to
produce better results.
Beam search is a technique that allows
you to find the best sequences over a fixed window size
known as the beam with. Since taking the output with the highest
probability at each time step is not ideal. I will show you how to
use beam search instead. So far the methods I’ve presented only
consider word probabilities one at a time. These ways of sampling might not result
in the highest probability sequences and the sentence is produced Using
these methods could not make sense. The overall most probable
translation given an input sentence is not necessarily the most
probable word at each step. For example, it’s possible that choosing
the most probable words at the start of the sequence could lead to
a worse translation compared to choosing in other words. If you had infinite computational power,
you could calculate the probabilities of every possible
output sentence and choose the best one. In the real world we use beam search. This method attempts to find the most
likely outsports sentence by choosing some number of best sequences based on
conditional probabilities at each time step.
Now at each time step with beam search
you have to calculate the probability of potential sequences given the outputs
of the previous time step. To avoid computing the probability for every possible sequence you have
armature beam called the beam width. At each step, you only keep the most
probable sequences and drop all others. You generate a new word until all be most
probable sentences and with the US token. So beam search consumes
a lot of memory and could be computationally costly
depending on your choice for beam. As an interesting side note,
greedy decoding is just a particular case of beam search where you set
the beam with B to be equal to 1.
To illustrate this method,
consider a small vocabulary consisting of the words I am hungry and
an end of sentence token. And let’s say the beam
with B is equal to 2. As with the other methods beam search
starts with the start of sentence token and gets the probabilities for
the first word in the sequence. Here I has a probability of 0.5,
am has a probability of 0.4 and hungry has a probability of 0.1 and
in the eos is at 0. Since the beam with is 2 you keep
the two highest probabilities I and am. Then you calculate the conditional
probability of all other words Given the two sequences that
you have kept so far I and am. Given the word I, the model returns
am with a probability of 0.5 and hungry with 0.3. Given the words am,
the model returns the probability of word I with a probability of 0.7 and
hungry with a probability of 0.2. These are the two conditional
probabilities given the two previous sequences of B of word 2 given word 1. Next you calculate the probability
by multiplying the conditional probability for 2 given the word I
with the probability of words I. For instance if the sequence
of II has a probability of 0.5 times 0.1 this is equal to 0.05. The sequence I am has a probability of
0.5 times 0.5 which is equal to 0.25. You have to do the same with
the sequences starting with the word am. Here the sequence MI has a probability
of 0.4 times 0.7 which is equal to 0.28. At this point there is possible sequences
but with the beam width B=2 you only keep the two sequences with
the highest probabilities. So you keep the sequence I am and
am I and drop all others. Then for the next step you use am as
the inputs were to get the conditional probabilities for the I am sequence and
I for the am I sequence. You repeat the process to calculate
the joints probabilities and again choose the best sequence. This process stops with the model
predicts an end of sentence token for all, be most probable sequences. At the end, the sequence with the largest
probability is chosen as the outputs.
To get the conditional probabilities at
each step you have to use your model. For now let’s focus on the decoder parts
of the model ignoring the encoder and attention mechanism. At the start of the sequence
you will use your decoder to as a vector of probabilities for
each of the words and the vocabulary. This is equivalent to computing
the conditional probability of every word given
the start of sequence token. Then you select the most probable
B outputs from the model and discard all other possibilities. After that you use your model B times to
find the conditional probabilities of all the words in the vocabulary
given the sequences that you kept in the previous step. Then you compute the sequence probability
and keep the B most probable sequences. And you keep doing this until the B
most probable sequences gets to the end of the sequence token. Notes that at each step after
the one where your sequence is only composed by the SOS token you
have to run your model B times.
The vanilla version of beam
search has some disadvantages. For instance, it penalizes the choice of long wear
sequences because the probability of a sequence is computed as the product
of multiple conditional probabilities. However, you could normalize the
probability of each sequence by its number four is to avoid this problem. Beam search also requires you to store
the b most probable sequences and computes conditional probabilities
given all of those sequences. Therefore, this method could be
computationally expensive and consumes lots of memory. You have a new tool that you can add to
your toolkit beam search has been widely used for quiet a period of time and
is still used a lot. In the next video, I will show you
another technique that you can use, known as NBR, or minimum based risk.
Beam Search(束搜索)是一种用于在生成式模型(如机器翻译、语言模型等)中寻找最优输出序列的搜索算法。它通过在每个时间步保留概率最高的 k 个候选项(称为束宽或束大小),并在下一个时间步基于这些候选项继续扩展搜索,最终找到整个输出序列的最优解。
Beam Search 的工作方式如下:
Beam Search 的优点是可以在一定程度上减少搜索空间,加快搜索速度,并且能够找到相对较好的解。然而,由于束宽的限制,Beam Search 可能会导致局部最优解,而无法找到全局最优解。因此,在一些情况下,Beam Search 可能需要结合其他技术(如长度惩罚、重复惩罚等)来改善搜索结果。
This is the last technique I will show
you to evaluate your NMT systems. It is very simple and works surprisingly well when compared to
the other techniques, let’s get started. >> Earlier, you encountered random
sampling to choose a probable token and the issues with that decoding method. But if you go a little further with that,
say by generating 30 samples and comparing them all against one another. You’ll see quite a bit of
improvements in your decoding. You’ll have to compare multiple
candidate translations for the minimum bayes risk decoding method,
MBR for shorts. Implementing MBR is
pretty straightforward. Begin by generating several random
samples, then compare each sample against each other using a similarity
score or a loss function. ROUGE would be a good choice that
you may recall from a bit earlier. Finally, choose the sample with
the highest average similarity or the lowest loss.
The translation that you get using this
method is the closest to all candidate translations. This process can be viewed
as some authors suggest, as finding a consensus between
all candidate translations.
If you decide to use ROUGE score
as a similarity metric for comparing every pair of
candidate translations, you would have MBR summarized
in the formula presented here. Your goal is to find the candidate
translation E that maximizes the average ROUGE score with
every other candidates E prime. So MBR is relatively easy to implement. You need to have multiple
candidate translations and select a way to compare them.
But for the sake of clarity, let’s go through an implementation
process in more detail. Here are the steps for implementing MVR with ROUGE on a small
set of four candidate translations. First, calculate the ROUGE score
between the first C subscript 1 and the second C subscript 2
candidate translations. For the 1st and 3rd and
for the 1st and 4th, then compute the average R subscript
1 using those three rules scores. Then you repeat this process for the other three candidates in your sets to
average ROUGE scores for each one of them. Finally, you select the candidate with
the highest average ROUGE score and that’s it for MBR, you’ll be implementing this method in the
assignment along with a greedy decoder.
In summary, MBR takes several
translation candidates and compares them against each other. Then chooses the one with the highest
average similarity as the beam search case. This method can give you a more
contextually accurate translation than random sampling and greedy decoding. >> Congratulations on finishing this week,
you now know how to implement in your own machine translation system and
you also know how to evaluate it. Next week, I’ll talk about one of
the states of the arts models known as the transformer, which also makes use
of an encoder decoder architecture.
第一题改正:
第三题错误示例:第三题正确答案是Keys, Values, Queries, Softmax
Precision is defined as:
(Sum of overlapping unigrams in model and reference)/(total # of words in model)
Recall is defined as:
(Sum of overlapping unigrams in model and reference)/(total # of words in reference)
In ROUGE, precision measures how many of the words in the generated summary are also present in the reference summary, while recall measures how many of the words in the reference summary are also present in the generated summary.
Assignment 1: Neural Machine Translation
Welcome to the first assignment of Course 4. Here, you will build an English-to-Portuguese neural machine translation (NMT) model using Long Short-Term Memory (LSTM) networks with attention. Machine translation is an important task in natural language processing and could be useful not only for translating one language to another but also for word sense disambiguation (e.g. determining whether the word “bank” refers to the financial bank, or the land alongside a river). Implementing this using just a Recurrent Neural Network (RNN) with LSTMs can work for short to medium length sentences but can result in vanishing gradients for very long sequences. To help with this, you will be adding an attention mechanism to allow the decoder to access all relevant parts of the input sentence regardless of its length. By completing this assignment, you will:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # Setting this env variable prevents TF warnings from showing up
import numpy as np
import tensorflow as tf
from collections import Counter
from utils import (sentences, train_data, val_data, english_vectorizer, portuguese_vectorizer,
masked_loss, masked_acc, tokens_to_text)
import w1_unittest
utils.py文件如下:
import numpy as np import tensorflow as tf import tensorflow_text as tf_text import pathlib path_to_file = pathlib.Path("por-eng/por.txt") np.random.seed(1234) tf.random.set_seed(1234) def load_data(path): text = path.read_text(encoding="utf-8") lines = text.splitlines() pairs = [line.split("\t") for line in lines] context = np.array([context for target, context, _ in pairs]) target = np.array([target for target, context, _ in pairs]) return context, target portuguese_sentences, english_sentences = load_data(path_to_file) sentences = (portuguese_sentences, english_sentences) BUFFER_SIZE = len(english_sentences) BATCH_SIZE = 64 is_train = np.random.uniform(size=(len(portuguese_sentences),)) < 0.8 train_raw = ( tf.data.Dataset.from_tensor_slices( (english_sentences[is_train], portuguese_sentences[is_train]) ) .shuffle(BUFFER_SIZE) .batch(BATCH_SIZE) ) val_raw = ( tf.data.Dataset.from_tensor_slices( (english_sentences[~is_train], portuguese_sentences[~is_train]) ) .shuffle(BUFFER_SIZE) .batch(BATCH_SIZE) ) def tf_lower_and_split_punct(text): text = tf_text.normalize_utf8(text, "NFKD") text = tf.strings.lower(text) text = tf.strings.regex_replace(text, "[^ a-z.?!,¿]", "") text = tf.strings.regex_replace(text, "[.?!,¿]", r" \0 ") text = tf.strings.strip(text) text = tf.strings.join(["[SOS]", text, "[EOS]"], separator=" ") return text max_vocab_size = 12000 english_vectorizer = tf.keras.layers.TextVectorization( standardize=tf_lower_and_split_punct, max_tokens=max_vocab_size, ragged=True ) english_vectorizer.adapt(train_raw.map(lambda context, target: context)) portuguese_vectorizer = tf.keras.layers.TextVectorization( standardize=tf_lower_and_split_punct, max_tokens=max_vocab_size, ragged=True ) portuguese_vectorizer.adapt(train_raw.map(lambda context, target: target)) def process_text(context, target): context = english_vectorizer(context).to_tensor() target = portuguese_vectorizer(target) targ_in = target[:, :-1].to_tensor() targ_out = target[:, 1:].to_tensor() return (context, targ_in), targ_out train_data = train_raw.map(process_text, tf.data.AUTOTUNE) val_data = val_raw.map(process_text, tf.data.AUTOTUNE) del train_raw del val_raw def masked_loss(y_true, y_pred): loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none') loss = loss_fn(y_true, y_pred) # Check which elements of y_true are padding mask = tf.cast(y_true != 0, loss.dtype) loss *= mask # Return the total. return tf.reduce_sum(loss)/tf.reduce_sum(mask) def masked_acc(y_true, y_pred): y_pred = tf.argmax(y_pred, axis=-1) y_pred = tf.cast(y_pred, y_true.dtype) match = tf.cast(y_true == y_pred, tf.float32) mask = tf.cast(y_true != 0, tf.float32) return tf.reduce_sum(match)/tf.reduce_sum(mask) def tokens_to_text(tokens, id_to_word): words = id_to_word(tokens) result = tf.strings.reduce_join(words, axis=-1, separator=" ") return result
The text pre-processing bits have already been taken care of (if you are interested in this be sure to check the utils.py
file). The steps performed can be summarized as:
Take a moment to inspect the raw sentences:
portuguese_sentences, english_sentences = sentences
print(f"English (to translate) sentence:\n\n{english_sentences[-5]}\n")
print(f"Portuguese (translation) sentence:\n\n{portuguese_sentences[-5]}")
Output
English (to translate) sentence:
No matter how much you try to convince people that chocolate is vanilla, it'll still be chocolate, even though you may manage to convince yourself and a few others that it's vanilla.
Portuguese (translation) sentence:
Não importa o quanto você tenta convencer os outros de que chocolate é baunilha, ele ainda será chocolate, mesmo que você possa convencer a si mesmo e poucos outros de que é baunilha.
You don’t have much use for the raw sentences so delete them to save memory:
del portuguese_sentences
del english_sentences
del sentences
Notice that you imported an english_vectorizer
and a portuguese_vectorizer
from utils.py
. These were created using tf.keras.layers.TextVectorization and they provide interesting features such as ways to visualize the vocabulary and convert text into tokenized ids and vice versa. In fact, you can inspect the first ten words of the vocabularies for both languages:
print(f"First 10 words of the english vocabulary:\n\n{english_vectorizer.get_vocabulary()[:10]}\n")
print(f"First 10 words of the portuguese vocabulary:\n\n{portuguese_vectorizer.get_vocabulary()[:10]}")
Output
First 10 words of the english vocabulary:
['', '[UNK]', '[SOS]', '[EOS]', '.', 'tom', 'i', 'to', 'you', 'the']
First 10 words of the portuguese vocabulary:
['', '[UNK]', '[SOS]', '[EOS]', '.', 'tom', 'que', 'o', 'nao', 'eu']
Notice that the first 4 words are reserved for special words. In order, these are:
You can see how many words are in a vocabulary by using the vocabulary_size
method:
# Size of the vocabulary
vocab_size_por = portuguese_vectorizer.vocabulary_size()
vocab_size_eng = english_vectorizer.vocabulary_size()
print(f"Portuguese vocabulary is made up of {vocab_size_por} words")
print(f"English vocabulary is made up of {vocab_size_eng} words")
Output
Portuguese vocabulary is made up of 12000 words
English vocabulary is made up of 12000 words
You can define tf.keras.layers.StringLookup objects that will help you map from words to ids and vice versa. Do this for the portuguese vocabulary since this will be useful later on when you decode the predictions from your model:
# This helps you convert from words to ids
word_to_id = tf.keras.layers.StringLookup(
vocabulary=portuguese_vectorizer.get_vocabulary(),
mask_token="",
oov_token="[UNK]"
)
# This helps you convert from ids to words
id_to_word = tf.keras.layers.StringLookup(
vocabulary=portuguese_vectorizer.get_vocabulary(),
mask_token="",
oov_token="[UNK]",
invert=True,
)
Try it out for the special tokens and a random word:
unk_id = word_to_id("[UNK]")
sos_id = word_to_id("[SOS]")
eos_id = word_to_id("[EOS]")
baunilha_id = word_to_id("baunilha")
print(f"The id for the [UNK] token is {unk_id}")
print(f"The id for the [SOS] token is {sos_id}")
print(f"The id for the [EOS] token is {eos_id}")
print(f"The id for baunilha (vanilla) is {baunilha_id}")
Output
The id for the [UNK] token is 1
The id for the [SOS] token is 2
The id for the [EOS] token is 3
The id for baunilha (vanilla) is 7079
Finally take a look at how the data that is going to be fed to the neural network looks like. Both train_data
and val_data
are of type tf.data.Dataset
and are already arranged in batches of 64 examples. To get the first batch out of a tf dataset you can use the take
method. To get the first example out of the batch you can slice the tensor and use the numpy
method for nicer printing:
for (to_translate, sr_translation), translation in train_data.take(1):
print(f"Tokenized english sentence:\n{to_translate[0, :].numpy()}\n\n")
print(f"Tokenized portuguese sentence (shifted to the right):\n{sr_translation[0, :].numpy()}\n\n")
print(f"Tokenized portuguese sentence:\n{translation[0, :].numpy()}\n\n")
Output
Tokenized english sentence:
[ 2 210 9 146 123 38 9 1672 4 3 0 0 0 0]
Tokenized portuguese sentence (shifted to the right):
[ 2 1085 7 128 11 389 37 2038 4 0 0 0 0 0
0]
Tokenized portuguese sentence:
[1085 7 128 11 389 37 2038 4 3 0 0 0 0 0
0]
There are a couple of important details to notice.
The first two can be considered as the features, while the third one as the target. By doing this your model can perform Teacher Forcing as you saw in the lectures.
Now it is time to begin coding!
The model you will build uses an encoder-decoder architecture. This Recurrent Neural Network (RNN) takes in a tokenized version of a sentence in its encoder, then passes it on to the decoder for translation. As mentioned in the lectures, just using a a regular sequence-to-sequence model with LSTMs will work effectively for short to medium sentences but will start to degrade for longer ones. You can picture it like the figure below where all of the context of the input sentence is compressed into one vector that is passed into the decoder block. You can see how this will be an issue for very long sentences (e.g. 100 tokens or more) because the context of the first parts of the input will have very little effect on the final vector passed to the decoder.
Adding an attention layer to this model avoids this problem by giving the decoder access to all parts of the input sentence. To illustrate, let’s just use a 4-word input sentence as shown below. Remember that a hidden state is produced at each timestep of the encoder (represented by the orange rectangles). These are all passed to the attention layer and each are given a score given the current activation (i.e. hidden state) of the decoder. For instance, let’s consider the figure below where the first prediction “como” is already made. To produce the next prediction, the attention layer will first receive all the encoder hidden states (i.e. orange rectangles) as well as the decoder hidden state when producing the word “como” (i.e. first green rectangle). Given this information, it will score each of the encoder hidden states to know which one the decoder should focus on to produce the next word. As a result of training, the model might have learned that it should align to the second encoder hidden state and subsequently assigns a high probability to the word “você”. If we are using greedy decoding, we will output the said word as the next symbol, then restart the process to produce the next word until we reach an end-of-sentence prediction.
There are different ways to implement attention and the one we’ll use for this assignment is the Scaled Dot Product Attention which has the form:
A t t e n t i o n ( Q , K , V ) = s o f t m a x ( Q K T d k ) V Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V Attention(Q,K,V)=softmax(dk QKT)V
You will dive deeper into this equation in the next week but for now, you can think of it as computing scores using queries (Q) and keys (K), followed by a multiplication of values (V) to get a context vector at a particular timestep of the decoder. This context vector is fed to the decoder RNN to get a set of probabilities for the next predicted word. The division by square root of the keys dimensionality ( d k \sqrt{d_k} dk ) is for improving model performance and you’ll also learn more about it next week. For our machine translation application, the encoder activations (i.e. encoder hidden states) will be the keys and values, while the decoder activations (i.e. decoder hidden states) will be the queries.
You will see in the upcoming sections that this complex architecture and mechanism can be implemented with just a few lines of code.
First you will define two important global variables:
In this assignment, the vocabulary sizes for English and Portuguese are the same. Therefore, we use a single constant VOCAB_SIZE throughout the notebook. While in other settings, vocabulary sizes could differ, that is not the case in our assignment.
VOCAB_SIZE = 12000
UNITS = 256
Your first exercise is to code the encoder part of the neural network. For this, complete the Encoder
class below. Notice that in the constructor (the __init__
method) you need to define all of the sublayers of the encoder and then use these sublayers during the forward pass (the call
method).
The encoder consists of the following layers:
input_dim
and output_dim
and let it know that you are using ‘0’ as padding, which can be done by using the appropriate value for the mask_zero
parameter.return_sequences
parameter.You need to define the forward pass using the syntax of TF’s functional API. What this means is that you chain function calls together to define your network like this:
encoder_input = keras.Input(shape=(28, 28, 1), name="original_img")
x = layers.Conv2D(16, 3, activation="relu")(encoder_input)
x = layers.MaxPooling2D(3)(x)
x = layers.Conv2D(16, 3, activation="relu")(x)
encoder_output = layers.GlobalMaxPooling2D()(x)
# GRADED CLASS: Encoder class Encoder(tf.keras.layers.Layer): def __init__(self, vocab_size, units): """Initializes an instance of this class Args: vocab_size (int): Size of the vocabulary units (int): Number of units in the LSTM layer """ super(Encoder, self).__init__() ### START CODE HERE ### self.embedding = tf.keras.layers.Embedding( input_dim=vocab_size, output_dim=units, mask_zero=True ) self.rnn = tf.keras.layers.Bidirectional( merge_mode="sum", layer=tf.keras.layers.LSTM( units=units, return_sequences=True ), ) ### END CODE HERE ### def call(self, context): """Forward pass of this layer Args: context (tf.Tensor): The sentence to translate Returns: tf.Tensor: Encoded sentence to translate """ ### START CODE HERE ### # Pass the context through the embedding layer x = self.embedding(context) # Pass the output of the embedding through the RNN x = self.rnn(x) ### END CODE HERE ### return x
# Do a quick check of your implementation
# Create an instance of your class
encoder = Encoder(VOCAB_SIZE, UNITS)
# Pass a batch of sentences to translate from english to portuguese
encoder_output = encoder(to_translate)
print(f'Tensor of sentences in english has shape: {to_translate.shape}\n')
print(f'Encoder output has shape: {encoder_output.shape}')
Output
Tensor of sentences in english has shape: (64, 14)
Encoder output has shape: (64, 14, 256)
Expected Output
Tensor of sentences in english has shape: (64, 14)
Encoder output has shape: (64, 14, 256)
# Test your code!
w1_unittest.test_encoder(Encoder)
Output
All tests passed!
Your next exercise is to code the layer that will perform cross attention between the original sentences and the translations. For this, complete the CrossAttention
class below. Notice that in the constructor (the __init__
method) you need to define all of the sublayers and then use these sublayers during the forward pass (the call
method). For this particular case some of these bits are already taken care of.
The cross attention consists of the following layers:
key_dim
, which is the size of the key and query tensors. You will also need to set the number of heads to 1 since you aren’t implementing multi head attention but attention between two tensors. The reason why this layer is preferred over Attention is that it allows simpler code during the forward pass.A couple of things to notice:
# GRADED CLASS: CrossAttention class CrossAttention(tf.keras.layers.Layer): def __init__(self, units): """Initializes an instance of this class Args: units (int): Number of units in the LSTM layer """ super().__init__() ### START CODE HERE ### self.mha = ( tf.keras.layers.MultiHeadAttention( key_dim=units, num_heads=1 ) ) ### END CODE HERE ### self.layernorm = tf.keras.layers.LayerNormalization() self.add = tf.keras.layers.Add() def call(self, context, target): """Forward pass of this layer Args: context (tf.Tensor): Encoded sentence to translate target (tf.Tensor): The embedded shifted-to-the-right translation Returns: tf.Tensor: Cross attention between context and target """ ### START CODE HERE ### # Call the MH attention by passing in the query and value # For this case the query should be the translation and the value the encoded sentence to translate # Hint: Check the call arguments of MultiHeadAttention in the docs attn_output = self.mha( query=target, value=context ) ### END CODE HERE ### x = self.add([target, attn_output]) x = self.layernorm(x) return x
# Do a quick check of your implementation
# Create an instance of your class
attention_layer = CrossAttention(UNITS)
# The attention layer expects the embedded sr-translation and the context
# The context (encoder_output) is already embedded so you need to do this for sr_translation:
sr_translation_embed = tf.keras.layers.Embedding(VOCAB_SIZE, output_dim=UNITS, mask_zero=True)(sr_translation)
# Compute the cross attention
attention_result = attention_layer(encoder_output, sr_translation_embed)
print(f'Tensor of contexts has shape: {encoder_output.shape}')
print(f'Tensor of translations has shape: {sr_translation_embed.shape}')
print(f'Tensor of attention scores has shape: {attention_result.shape}')
Output
Tensor of contexts has shape: (64, 14, 256)
Tensor of translations has shape: (64, 15, 256)
Tensor of attention scores has shape: (64, 15, 256)
Expected Output
Tensor of contexts has shape: (64, 14, 256)
Tensor of translations has shape: (64, 15, 256)
Tensor of attention scores has shape: (64, 15, 256)
# Test your code!
w1_unittest.test_cross_attention(CrossAttention)
Output
All tests passed!
Now you will implement the decoder part of the neural network by completing the Decoder
class below. Notice that in the constructor (the __init__
method) you need to define all of the sublayers of the decoder and then use these sublayers during the forward pass (the call
method).
The decoder consists of the following layers:
input_dim
and output_dim
and let it know that you are using ‘0’ as padding, which can be done by using the appropriate value for the mask_zero
parameter.return_sequences
parameter. It is very important that this layer returns the state since this will be needed for inference so make sure to set the return_state
parameter accordingly. Notice that LSTM layers return state as a tuple of two tensors called memory_state
and carry_state
, however these names have been changed to better reflect what you have seen in the lectures to hidden_state
and cell_state
respectively.CrossAttention
layer you defined in the previous exercise.logsoftmax
activation function for this one, which you can get as tf.nn.log_softmax.# GRADED CLASS: Decoder class Decoder(tf.keras.layers.Layer): def __init__(self, vocab_size, units): """Initializes an instance of this class Args: vocab_size (int): Size of the vocabulary units (int): Number of units in the LSTM layer """ super(Decoder, self).__init__() ### START CODE HERE ### # The embedding layer self.embedding = tf.keras.layers.Embedding( input_dim=vocab_size, output_dim=units, mask_zero=True ) # The RNN before attention self.pre_attention_rnn = tf.keras.layers.LSTM( units=units, return_sequences=True, return_state=True ) # The attention layer self.attention = CrossAttention(units) # The RNN after attention self.post_attention_rnn = tf.keras.layers.LSTM( units=units, return_sequences=True ) # The dense layer with logsoftmax activation self.output_layer = tf.keras.layers.Dense( units=vocab_size, activation=tf.nn.log_softmax ) ### END CODE HERE ### def call(self, context, target, state=None, return_state=False): """Forward pass of this layer Args: context (tf.Tensor): Encoded sentence to translate target (tf.Tensor): The shifted-to-the-right translation state (list[tf.Tensor, tf.Tensor], optional): Hidden state of the pre-attention LSTM. Defaults to None. return_state (bool, optional): If set to true return the hidden states of the LSTM. Defaults to False. Returns: tf.Tensor: The log_softmax probabilities of predicting a particular token """ ### START CODE HERE ### # Get the embedding of the input x = self.embedding(target) # Pass the embedded input into the pre attention LSTM # Hints: # - The LSTM you defined earlier should return the output alongside the state (made up of two tensors) # - Pass in the state to the LSTM (needed for inference) x, hidden_state, cell_state = self.pre_attention_rnn(x, initial_state=state) # Perform cross attention between the context and the output of the LSTM (in that order) x = self.attention(context, x) # Do a pass through the post attention LSTM x = self.post_attention_rnn(x) # Compute the logits logits = self.output_layer(x) ### END CODE HERE ### if return_state: return logits, [hidden_state, cell_state] return logits
# Do a quick check of your implementation
# Create an instance of your class
decoder = Decoder(VOCAB_SIZE, UNITS)
# Notice that you don't need the embedded version of sr_translation since this is done inside the class
logits = decoder(encoder_output, sr_translation)
print(f'Tensor of contexts has shape: {encoder_output.shape}')
print(f'Tensor of right-shifted translations has shape: {sr_translation.shape}')
print(f'Tensor of logits has shape: {logits.shape}')
Output
Tensor of contexts has shape: (64, 14, 256)
Tensor of right-shifted translations has shape: (64, 15)
Tensor of logits has shape: (64, 15, 12000)
Expected Output
Tensor of contexts has shape: (64, 14, 256)
Tensor of right-shifted translations has shape: (64, 15)
Tensor of logits has shape: (64, 15, 12000)
# Test your code!
w1_unittest.test_decoder(Decoder, CrossAttention)
Output
All tests passed!
Now you have to put together all of the layers you previously coded into an actual model. For this, complete the Translator
class below. Notice how unlike the Encoder and Decoder classes inherited from tf.keras.layers.Layer
, the Translator class inherits from tf.keras.Model
.
Remember that train_data
will yield a tuple with the sentence to translate and the shifted-to-the-right translation, which are the “features” of the model. This means that the inputs of your network will be tuples containing context and targets.
# GRADED CLASS: Translator class Translator(tf.keras.Model): def __init__(self, vocab_size, units): """Initializes an instance of this class Args: vocab_size (int): Size of the vocabulary units (int): Number of units in the LSTM layer """ super().__init__() ### START CODE HERE ### # Define the encoder with the appropriate vocab_size and number of units self.encoder = Encoder(vocab_size, units) # Define the decoder with the appropriate vocab_size and number of units self.decoder = Decoder(vocab_size, units) ### END CODE HERE ### def call(self, inputs): """Forward pass of this layer Args: inputs (tuple(tf.Tensor, tf.Tensor)): Tuple containing the context (sentence to translate) and the target (shifted-to-the-right translation) Returns: tf.Tensor: The log_softmax probabilities of predicting a particular token """ ### START CODE HERE ### # In this case inputs is a tuple consisting of the context and the target, unpack it into single variables context, target = inputs # Pass the context through the encoder encoded_context = self.encoder(context) # Compute the logits by passing the encoded context and the target to the decoder logits = self.decoder(encoded_context, target) ### END CODE HERE ### return logits
# Do a quick check of your implementation
# Create an instance of your class
translator = Translator(VOCAB_SIZE, UNITS)
# Compute the logits for every word in the vocabulary
logits = translator((to_translate, sr_translation))
print(f'Tensor of sentences to translate has shape: {to_translate.shape}')
print(f'Tensor of right-shifted translations has shape: {sr_translation.shape}')
print(f'Tensor of logits has shape: {logits.shape}')
Output
Tensor of sentences to translate has shape: (64, 14)
Tensor of right-shifted translations has shape: (64, 15)
Tensor of logits has shape: (64, 15, 12000)
Expected Output
Tensor of sentences to translate has shape: (64, 14)
Tensor of right-shifted translations has shape: (64, 15)
Tensor of logits has shape: (64, 15, 12000)
w1_unittest.test_translator(Translator, Encoder, Decoder)
Output
All tests passed!
Now that you have an untrained instance of the NMT model, it is time to train it. You can use the compile_and_train
function below to achieve this:
def compile_and_train(model, epochs=20, steps_per_epoch=500):
model.compile(optimizer="adam", loss=masked_loss, metrics=[masked_acc, masked_loss])
history = model.fit(
train_data.repeat(),
epochs=epochs,
steps_per_epoch=steps_per_epoch,
validation_data=val_data,
validation_steps=50,
callbacks=[tf.keras.callbacks.EarlyStopping(patience=3)],
)
return model, history
# Train the translator (this takes some minutes so feel free to take a break)
trained_translator, history = compile_and_train(translator)
Output
Epoch 1/20 500/500 [==============================] - 48s 68ms/step - loss: 5.1397 - masked_acc: 0.2225 - masked_loss: 5.1422 - val_loss: 4.2776 - val_masked_acc: 0.3481 - val_masked_loss: 4.2784 Epoch 2/20 500/500 [==============================] - 16s 33ms/step - loss: 3.6741 - masked_acc: 0.4191 - masked_loss: 3.6749 - val_loss: 3.0218 - val_masked_acc: 0.4965 - val_masked_loss: 3.0237 Epoch 3/20 500/500 [==============================] - 16s 32ms/step - loss: 2.7304 - masked_acc: 0.5441 - masked_loss: 2.7319 - val_loss: 2.3823 - val_masked_acc: 0.5844 - val_masked_loss: 2.3826 Epoch 4/20 500/500 [==============================] - 16s 32ms/step - loss: 2.2110 - masked_acc: 0.6180 - masked_loss: 2.2119 - val_loss: 1.9733 - val_masked_acc: 0.6483 - val_masked_loss: 1.9741 Epoch 5/20 500/500 [==============================] - 16s 31ms/step - loss: 1.8569 - masked_acc: 0.6690 - masked_loss: 1.8581 - val_loss: 1.7101 - val_masked_acc: 0.6829 - val_masked_loss: 1.7105 Epoch 6/20 500/500 [==============================] - 15s 30ms/step - loss: 1.5978 - masked_acc: 0.7010 - masked_loss: 1.5987 - val_loss: 1.5794 - val_masked_acc: 0.7062 - val_masked_loss: 1.5785 Epoch 7/20 500/500 [==============================] - 15s 31ms/step - loss: 1.4876 - masked_acc: 0.7169 - masked_loss: 1.4883 - val_loss: 1.4743 - val_masked_acc: 0.7138 - val_masked_loss: 1.4749 Epoch 8/20 500/500 [==============================] - 16s 32ms/step - loss: 1.3877 - masked_acc: 0.7300 - masked_loss: 1.3886 - val_loss: 1.3770 - val_masked_acc: 0.7270 - val_masked_loss: 1.3786 Epoch 9/20 500/500 [==============================] - 16s 32ms/step - loss: 1.3113 - masked_acc: 0.7391 - masked_loss: 1.3126 - val_loss: 1.2842 - val_masked_acc: 0.7379 - val_masked_loss: 1.2841 Epoch 10/20 500/500 [==============================] - 16s 31ms/step - loss: 1.1826 - masked_acc: 0.7548 - masked_loss: 1.1840 - val_loss: 1.2217 - val_masked_acc: 0.7491 - val_masked_loss: 1.2225 Epoch 11/20 500/500 [==============================] - 15s 30ms/step - loss: 1.0788 - masked_acc: 0.7676 - masked_loss: 1.0797 - val_loss: 1.1737 - val_masked_acc: 0.7615 - val_masked_loss: 1.1739 Epoch 12/20 500/500 [==============================] - 15s 30ms/step - loss: 1.0570 - masked_acc: 0.7716 - masked_loss: 1.0579 - val_loss: 1.1306 - val_masked_acc: 0.7603 - val_masked_loss: 1.1307 Epoch 13/20 500/500 [==============================] - 15s 31ms/step - loss: 1.0358 - masked_acc: 0.7733 - masked_loss: 1.0369 - val_loss: 1.1341 - val_masked_acc: 0.7596 - val_masked_loss: 1.1348 Epoch 14/20 500/500 [==============================] - 15s 31ms/step - loss: 1.0092 - masked_acc: 0.7777 - masked_loss: 1.0103 - val_loss: 1.0890 - val_masked_acc: 0.7683 - val_masked_loss: 1.0894 Epoch 15/20 500/500 [==============================] - 16s 31ms/step - loss: 0.9015 - masked_acc: 0.7928 - masked_loss: 0.9026 - val_loss: 1.0708 - val_masked_acc: 0.7712 - val_masked_loss: 1.0722 Epoch 16/20 500/500 [==============================] - 15s 31ms/step - loss: 0.8550 - masked_acc: 0.7989 - masked_loss: 0.8556 - val_loss: 1.0757 - val_masked_acc: 0.7720 - val_masked_loss: 1.0774 Epoch 17/20 500/500 [==============================] - 15s 31ms/step - loss: 0.8635 - masked_acc: 0.7969 - masked_loss: 0.8643 - val_loss: 1.0103 - val_masked_acc: 0.7784 - val_masked_loss: 1.0115 Epoch 18/20 500/500 [==============================] - 16s 32ms/step - loss: 0.8636 - masked_acc: 0.7970 - masked_loss: 0.8643 - val_loss: 1.0002 - val_masked_acc: 0.7817 - val_masked_loss: 1.0015 Epoch 19/20 500/500 [==============================] - 15s 31ms/step - loss: 0.8465 - masked_acc: 0.7990 - masked_loss: 0.8471 - val_loss: 0.9914 - val_masked_acc: 0.7835 - val_masked_loss: 0.9912 Epoch 20/20 500/500 [==============================] - 16s 31ms/step - loss: 0.7365 - masked_acc: 0.8181 - masked_loss: 0.7373 - val_loss: 0.9863 - val_masked_acc: 0.7834 - val_masked_loss: 0.9869
Now that your model is trained you can use it for inference. To help you with this the generate_next_token
function is provided. Notice that this function is meant to be used inside a for-loop, so you feed to it the information of the previous step to generate the information of the next step. In particular you need to keep track of the state of the pre-attention LSTM in the decoder and if you are done with the translation. Also notice that a temperature
variable is introduced which determines how to select the next token given the predicted logits:
def generate_next_token(decoder, context, next_token, done, state, temperature=0.0): """Generates the next token in the sequence Args: decoder (Decoder): The decoder context (tf.Tensor): Encoded sentence to translate next_token (tf.Tensor): The predicted next token done (bool): True if the translation is complete state (list[tf.Tensor, tf.Tensor]): Hidden states of the pre-attention LSTM layer temperature (float, optional): The temperature that controls the randomness of the predicted tokens. Defaults to 0.0. Returns: tuple(tf.Tensor, np.float, list[tf.Tensor, tf.Tensor], bool): The next token, log prob of said token, hidden state of LSTM and if translation is done """ # Get the logits and state from the decoder logits, state = decoder(context, next_token, state=state, return_state=True) # Trim the intermediate dimension logits = logits[:, -1, :] # If temp is 0 then next_token is the argmax of logits if temperature == 0.0: next_token = tf.argmax(logits, axis=-1) # If temp is not 0 then next_token is sampled out of logits else: logits = logits / temperature next_token = tf.random.categorical(logits, num_samples=1) # Trim dimensions of size 1 logits = tf.squeeze(logits) next_token = tf.squeeze(next_token) # Get the logit of the selected next_token logit = logits[next_token].numpy() # Reshape to (1,1) since this is the expected shape for text encoded as TF tensors next_token = tf.reshape(next_token, shape=(1,1)) # If next_token is End-of-Sentence token you are done if next_token == eos_id: done = True return next_token, logit, state, done
See how it works by running the following cell:
# PROCESS SENTENCE TO TRANSLATE AND ENCODE # A sentence you wish to translate eng_sentence = "I love languages" # Convert it to a tensor texts = tf.convert_to_tensor(eng_sentence)[tf.newaxis] # Vectorize it and pass it through the encoder context = english_vectorizer(texts).to_tensor() context = encoder(context) # SET STATE OF THE DECODER # Next token is Start-of-Sentence since you are starting fresh next_token = tf.fill((1,1), sos_id) # Hidden and Cell states of the LSTM can be mocked using uniform samples state = [tf.random.uniform((1, UNITS)), tf.random.uniform((1, UNITS))] # You are not done until next token is EOS token done = False # Generate next token next_token, logit, state, done = generate_next_token(decoder, context, next_token, done, state, temperature=0.5) print(f"Next token: {next_token}\nLogit: {logit:.4f}\nDone? {done}")
Output
Next token: [[8841]]
Logit: -18.7456
Done? False
Now you can put everything together to translate a given sentence. For this, complete the translate
function below. This function will take care of the following steps:
Hints:
Some useful docs:
# GRADED FUNCTION: translate def translate(model, text, max_length=50, temperature=0.0): """Translate a given sentence from English to Portuguese Args: model (tf.keras.Model): The trained translator text (string): The sentence to translate max_length (int, optional): The maximum length of the translation. Defaults to 50. temperature (float, optional): The temperature that controls the randomness of the predicted tokens. Defaults to 0.0. Returns: tuple(str, np.float, tf.Tensor): The translation, logit that predicted <EOS> token and the tokenized translation """ # Lists to save tokens and logits tokens, logits = [], [] ### START CODE HERE ### # PROCESS THE SENTENCE TO TRANSLATE # Convert the original string into a tensor text = tf.convert_to_tensor(text)[tf.newaxis] # Vectorize the text using the correct vectorizer context = english_vectorizer(text).to_tensor() # Get the encoded context (pass the context through the encoder) # Hint: Remember you can get the encoder by using model.encoder context = model.encoder(context) # INITIAL STATE OF THE DECODER # First token should be SOS token with shape (1,1) next_token = tf.fill((1, 1), sos_id) # Initial hidden and cell states should be tensors of zeros with shape (1, UNITS) state = [tf.zeros((1, UNITS)), tf.zeros((1, UNITS))] # You are done when you draw a EOS token as next token (initial state is False) done = False # Iterate for max_length iterations for i in range(max_length): # Generate the next token try: next_token, logit, state, done = generate_next_token( decoder=model.decoder, context=context, next_token=next_token, done=done, state=state, temperature=temperature ) except: raise Exception("Problem generating the next token") # If done then break out of the loop if done: break # Add next_token to the list of tokens tokens.append(next_token) # Add logit to the list of logits logits.append(logit) ### END CODE HERE ### # Concatenate all tokens into a tensor tokens = tf.concat(tokens, axis=-1) # Convert the translated tokens into text translation = tf.squeeze(tokens_to_text(tokens, id_to_word)) translation = translation.numpy().decode() return translation, logits[-1], tokens
Try your function with temperature of 0, which will yield a deterministic output and is equivalent to a greedy decoding:
# Running this cell multiple times should return the same output since temp is 0
temp = 0.0
original_sentence = "I love languages"
translation, logit, tokens = translate(trained_translator, original_sentence, temperature=temp)
print(f"Temperature: {temp}\n\nOriginal sentence: {original_sentence}\nTranslation: {translation}\nTranslation tokens:{tokens}\nLogit: {logit:.3f}")
Output
Temperature: 0.0
Original sentence: I love languages
Translation: eu adoro idiomas ja .
Translation tokens:[[ 9 564 850 69 4]]
Logit: -1.165
Try your function with temperature of 0.7 (stochastic output):
# Running this cell multiple times should return different outputs since temp is not 0
# You can try different temperatures
temp = 0.7
original_sentence = "I love languages"
translation, logit, tokens = translate(trained_translator, original_sentence, temperature=temp)
print(f"Temperature: {temp}\n\nOriginal sentence: {original_sentence}\nTranslation: {translation}\nTranslation tokens:{tokens}\nLogit: {logit:.3f}")
Output
Temperature: 0.7
Original sentence: I love languages
Translation: eu amo linguas em favor .
Translation tokens:[[ 9 522 1032 22 90 4]]
Logit: -1.007
w1_unittest.test_translate(translate, trained_translator)
Output
All tests passed!
As mentioned in the lectures, getting the most probable token at each step may not necessarily produce the best results. Another approach is to do Minimum Bayes Risk Decoding or MBR. The general steps to implement this are:
You will be building helper functions for these steps in the following sections.
With the ability to generate different translations by setting different temperature values you can do what you saw in the lectures and generate a bunch of translations and then determine which one is the best candidate. You will now do this by using the provided generate_samples
function. This function will return any desired number of candidate translations alongside the log-probability for each one:
def generate_samples(model, text, n_samples=4, temperature=0.6): samples, log_probs = [], [] # Iterate for n_samples iterations for _ in range(n_samples): # Save the logit and the translated tensor _, logp, sample = translate(model, text, temperature=temperature) # Save the translated tensors samples.append(np.squeeze(sample.numpy()).tolist()) # Save the logits log_probs.append(logp) return samples, log_probs
samples, log_probs = generate_samples(trained_translator, 'I love languages')
for s, l in zip(samples, log_probs):
print(f"Translated tensor: {s} has logit: {l:.3f}")
Output
Translated tensor: [9, 9, 522, 850, 12, 256, 4] has logit: -0.625
Translated tensor: [9, 9, 522, 850, 69, 4] has logit: -1.636
Translated tensor: [9, 564, 850, 38, 8138, 4] has logit: -0.201
Translated tensor: [9, 522, 850, 12, 256, 4] has logit: -0.694
Now that you can generate multiple translations it is time to come up with a method to measure the goodness of each one. As you saw in the lectures, one way to achieve this is by comparing each sample against the others.
There are several metrics you can use for this purpose, as shown in the lectures and you can try experimenting with any one of these. For this assignment, you will be calculating scores for unigram overlaps.
One of these metrics is the widely used yet simple Jaccard similarity which gets the intersection over union of two sets. The jaccard_similarity
function returns this metric for any pair of candidate and reference translations:
def jaccard_similarity(candidate, reference): # Convert the lists to sets to get the unique tokens candidate_set = set(candidate) reference_set = set(reference) # Get the set of tokens common to both candidate and reference common_tokens = candidate_set.intersection(reference_set) # Get the set of all tokens found in either candidate or reference all_tokens = candidate_set.union(reference_set) # Compute the percentage of overlap (divide the number of common tokens by the number of all tokens) overlap = len(common_tokens) / len(all_tokens) return overlap
l1 = [1, 2, 3]
l2 = [1, 2, 3, 4]
js = jaccard_similarity(l1, l2)
print(f"jaccard similarity between lists: {l1} and {l2} is {js:.3f}")
Output
jaccard similarity between lists: [1, 2, 3] and [1, 2, 3, 4] is 0.750
Expected Output
jaccard similarity between tensors: [1, 2, 3] and [1, 2, 3, 4] is 0.750
Jaccard similarity is good but a more commonly used metric in machine translation is the ROUGE score. For unigrams, this is called ROUGE-1 and as shown in the lectures, you can output the scores for both precision and recall when comparing two samples. To get the final score, you will want to compute the F1-score as given by:
s c o r e = 2 ∗ ( p r e c i s i o n ∗ r e c a l l ) ( p r e c i s i o n + r e c a l l ) score = 2* \frac{(precision * recall)}{(precision + recall)} score=2∗(precision+recall)(precision∗recall)
For the implementation of the rouge1_similarity
function you want to use the Counter class from the Python standard library:
# GRADED FUNCTION: rouge1_similarity def rouge1_similarity(candidate, reference): """Computes the ROUGE 1 score between two token lists Args: candidate (list[int]): Tokenized candidate translation reference (list[int]): Tokenized reference translation Returns: float: Overlap between the two token lists """ ### START CODE HERE ### # Make a frequency table of the candidate and reference tokens # Hint: use the Counter class (already imported) candidate_word_counts = Counter(candidate) reference_word_counts = Counter(reference) # Initialize overlap at 0 overlap = 0 # Iterate over the tokens in the candidate frequency table # Hint: Counter is a subclass of dict and you can get the keys # out of a dict using the keys method like this: dict.keys() for token in candidate_word_counts.keys(): # Get the count of the current token in the candidate frequency table # Hint: You can access the counts of a token as you would access values of a dictionary token_count_candidate = candidate_word_counts[token] # Get the count of the current token in the reference frequency table # Hint: You can access the counts of a token as you would access values of a dictionary token_count_reference = reference_word_counts.get(token, 0) # Update the overlap by getting the minimum between the two token counts above overlap += np.minimum(token_count_candidate, token_count_reference) # Compute the precision # Hint: precision = overlap / (number of tokens in candidate list) precision = overlap / len(candidate) # Compute the recall # Hint: recall = overlap / (number of tokens in reference list) recall = overlap / len(reference) if precision + recall != 0: # Compute the Rouge1 Score # Hint: This is equivalent to the F1 score f1_score = 2 * (precision * recall) / (precision + recall) return f1_score ### END CODE HERE ### return 0 # If precision + recall = 0 then return 0
l1 = [1, 2, 3]
l2 = [1, 2, 3, 4]
r1s = rouge1_similarity(l1, l2)
print(f"rouge 1 similarity between lists: {l1} and {l2} is {r1s:.3f}")
Output
rouge 1 similarity between lists: [1, 2, 3] and [1, 2, 3, 4] is 0.857
Expected Output
rouge 1 similarity between lists: [1, 2, 3] and [1, 2, 3, 4] is 0.857
w1_unittest.test_rouge1_similarity(rouge1_similarity)
Output
All tests passed!
You will now build a function to generate the overall score for a particular sample. As mentioned in the lectures, you need to compare each sample with all other samples. For instance, if we generated 30 sentences, we will need to compare sentence 1 to sentences 2 through 30. Then, we compare sentence 2 to sentences 1 and 3 through 30, and so forth. At each step, we get the average score of all comparisons to get the overall score for a particular sample. To illustrate, these will be the steps to generate the scores of a 4-sample list.
The results will be stored in a dictionary for easy lookups.
Complete the average_overlap
function below which should implement the process described above:
# GRADED FUNCTION: average_overlap def average_overlap(samples, similarity_fn): """Computes the arithmetic mean of each candidate sentence in the samples Args: samples (list[list[int]]): Tokenized version of translated sentences similarity_fn (Function): Similarity function used to compute the overlap Returns: dict[int, float]: A dictionary mapping the index of each translation to its score """ # Initialize dictionary scores = {} # Iterate through all samples (enumerate helps keep track of indexes) for index_candidate, candidate in enumerate(samples): ### START CODE HERE ### # Initially overlap is zero overlap = 0 # Iterate through all samples (enumerate helps keep track of indexes) for index_sample, sample in enumerate(samples): # Skip if the candidate index is the same as the sample index if index_candidate == index_sample: continue # Get the overlap between candidate and sample using the similarity function sample_overlap = similarity_fn(candidate, sample) # Add the sample overlap to the total overlap overlap += sample_overlap ### END CODE HERE ### # Get the score for the candidate by computing the average score = overlap / (len(samples) - 1) # Only use 3 decimal points score = round(score, 3) # Save the score in the dictionary. use index as the key. scores[index_candidate] = score return scores
# Test with Jaccard similarity
l1 = [1, 2, 3]
l2 = [1, 2, 4]
l3 = [1, 2, 4, 5]
avg_ovlp = average_overlap([l1, l2, l3], jaccard_similarity)
print(f"average overlap between lists: {l1}, {l2} and {l3} using Jaccard similarity is:\n\n{avg_ovlp}")
Output
average overlap between lists: [1, 2, 3], [1, 2, 4] and [1, 2, 4, 5] using Jaccard similarity is:
{0: 0.45, 1: 0.625, 2: 0.575}
Expected Output
average overlap between lists: [1, 2, 3], [1, 2, 4] and [1, 2, 4, 5] using Jaccard similarity is:
{0: 0.45, 1: 0.625, 2: 0.575}
# Test with Rouge1 similarity
l1 = [1, 2, 3]
l2 = [1, 4]
l3 = [1, 2, 4, 5]
l4 = [5,6]
avg_ovlp = average_overlap([l1, l2, l3, l4], rouge1_similarity)
print(f"average overlap between lists: {l1}, {l2}, {l3} and {l4} using Rouge1 similarity is:\n\n{avg_ovlp}")
Output
average overlap between lists: [1, 2, 3], [1, 4], [1, 2, 4, 5] and [5, 6] using Rouge1 similarity is:
{0: 0.324, 1: 0.356, 2: 0.524, 3: 0.111}
Expected Output
average overlap between lists: [1, 2, 3], [1, 4], [1, 2, 4, 5] and [5, 6] using Rouge1 similarity is:
{0: 0.324, 1: 0.356, 2: 0.524, 3: 0.111}
w1_unittest.test_average_overlap(average_overlap)
Output
All tests passed!
In practice, it is also common to see the weighted mean being used to calculate the overall score instead of just the arithmetic mean. This is implemented in the weighted_avg_overlap
function below and you can use it in your experiments to see which one will give better results:
def weighted_avg_overlap(samples, log_probs, similarity_fn): # Scores dictionary scores = {} # Iterate over the samples for index_candidate, candidate in enumerate(samples): # Initialize overlap and weighted sum overlap, weight_sum = 0.0, 0.0 # Iterate over all samples and log probabilities for index_sample, (sample, logp) in enumerate(zip(samples, log_probs)): # Skip if the candidate index is the same as the sample index if index_candidate == index_sample: continue # Convert log probability to linear scale sample_p = float(np.exp(logp)) # Update the weighted sum weight_sum += sample_p # Get the unigram overlap between candidate and sample sample_overlap = similarity_fn(candidate, sample) # Update the overlap overlap += sample_p * sample_overlap # Compute the score for the candidate score = overlap / weight_sum # Only use 3 decimal points score = round(score, 3) # Save the score in the dictionary. use index as the key. scores[index_candidate] = score return scores
l1 = [1, 2, 3]
l2 = [1, 2, 4]
l3 = [1, 2, 4, 5]
log_probs = [0.4, 0.2, 0.5]
w_avg_ovlp = weighted_avg_overlap([l1, l2, l3], log_probs, jaccard_similarity)
print(f"weighted average overlap using Jaccard similarity is:\n\n{w_avg_ovlp}")
Output
weighted average overlap using Jaccard similarity is:
{0: 0.443, 1: 0.631, 2: 0.558}
You will now put everything together in the the mbr_decode
function below. This final step is not graded as this function is just a wrapper around all the cool stuff you have coded so far!
You can use it to play around, trying different numbers of samples, temperatures and similarity functions!
def mbr_decode(model, text, n_samples=5, temperature=0.6, similarity_fn=jaccard_similarity): # Generate samples samples, log_probs = generate_samples(model, text, n_samples=n_samples, temperature=temperature) # Compute the overlap scores scores = weighted_avg_overlap(samples, log_probs, similarity_fn) # Decode samples decoded_translations = [tokens_to_text(s, id_to_word).numpy().decode('utf-8') for s in samples] # Find the key with the highest score max_score_key = max(scores, key=lambda k: scores[k]) # Get the translation translation = decoded_translations[max_score_key] return translation, decoded_translations
english_sentence = "I love languages"
translation, candidates = mbr_decode(trained_translator, english_sentence, n_samples=10, temperature=0.6)
print("Translation candidates:")
for c in candidates:
print(c)
print(f"\nSelected translation: {translation}")
Output
Translation candidates:
eu adoro idiomas ja .
eu eu amo idiomas e de vida .
eu adoro idiomas ja a espera .
eu amo idiomas a gente .
eu adoro idiomas em vida .
eu adoro idiomas ja .
eu adoro idiomas ja .
eu adoro idiomas ja .
eu adoro idiomas as vezes .
eu adoro idiomas ja
Selected translation: eu adoro idiomas ja .
Congratulations! Next week, you’ll dive deeper into attention models and study the Transformer architecture. You will build another network but without the recurrent part. It will show that attention is all you need! It should be fun!
Keep up the good work!
2024年3月30日21点29分完成Week1的Lab,花费1天的时间。这第四门课开始了Attention知识的学习,下周就要讲Transformer,期待一下。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。