赞
踩
- Speech Recognition
- Machine Translation
- Speech Translation(language without text)
- NLP problem can be solved by seq2seq
- Syntactic Parsing
output is a tree which can be regarded as seq2seq problems
An object can belong to multiple classes
1 the block in the net:
(1) residual: add the input to the handled data
(2) norm: x i ′ = x i − m σ x'_i=\frac{x_i-m}{\sigma} xi′=σxi−m
the final block architecture:
- Autoregressive
例如是处理一个语音辨识的问题:
encoder的输入就是语言的波形,输出对应的一些向量;
decoder的输入则是encoder的输出和one-hot表示的向量,首先会有一个one-hot形式的begin符号作为decoder的输入,放入decoder后得到一个输出经过softmax后得到一个向量(机),然后(机)作为下一个decoder的输入继续进行。
其中decoder和encoder的不同就在于一开始的输入经过了一层masked self-attention,ta和原先的self-attention不同的地方在于,每个向量都只能考虑之前的所有向量,而非前后所有的向量,因为我们考虑前面语音辨识所说的decoder输入是一个向量一个向量输入,所以每个向量暂时看不到后面的输入,因此无法考虑之后的向量。
Masked Self-attention
From the graph above, we have the input one by one, so we can not input all a i a^i ai together
另外一个问题是,seq2seq中输入和输出长度不一样,因此我们要怎么决定输出什么时候停止。实际上就是准备一个特别的终止符号(end),若输入达到这样的end时,就停止输出。
- nonAutoregressive
对于NAT,实际上就是将输入变成一整列的begin,而非顺序的序列。
- We do not know the output length:
(1) another predictor for output length
(2) output a long sequence include END, ignore the tokens after END- Advantages:
(1) parallel
(2) controllable output length- NAT is usually worse than AT
3.Encoder-Decoder
接下来就是encoder和decoder之间的连接问题,
input all encoder inputs and one decoder input to the cross attention
1.cross attention
这里面就是cross attention的操作,我们看到上面的图中可以看出,中间层的self-attention的输入是由encoder的输入+decoder的输入组成。例如下图中,我们首先decoder输入BEGIN,transform后得到q,然后对于 α ‘ 1 \alpha`_1 α‘1 α ‘ 2 \alpha`_2 α‘2 α ‘ 3 \alpha`_3 α‘3和 q q q我们进行一个cross attention的处理得到新的输出v,v就是我们最终放入FC的输入。然后下一个(机)再作为输入进入decoder中,依次和左边的encoder输出进行cross attention处理,最终得到decoder的输出。
- Teacher Forcing
在我们decoder的训练过程中,实际上我们decoder的输入就是一个正确答案,输出也是一个正确答案,这种方式就叫做teacher forcing。
然后在testing的过程中,我们decoder的第一个输入是begin符号。
- tips:
(1) char-bot: copy some words in the question
(2) guided attention
ex: speech recognition
we should focus on the first word in the beginning
solution: monotonic attention, location-aware attention
(3) Beam Search
greedy decoding: select the max every path
solution: we select the bream search
(4) optimizing evaluation metrics
In the training process, we use cross entropy. But finally we use BLEU score to evaluate the result.
BLEU score: compare the output with correct answer, but it can not be differentiate.
solution: If we would like to use BLEU score, use RL method.
(5) Exposure bias
For decoder, if every training data is true, once test data is wrong ,the output must be wrong
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。