当前位置:   article > 正文

理解LSTM网络(Understanding LSTM Networks)原文与翻译_lstm论文原文

lstm论文原文

原文链接:

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Recurrent Neural Networks

循环神经网络

Humans don’t start their thinking from scratch every second.

人不会总是从头开始思考。

As you read this essay, you understand each word based on your understanding of previous words.

就像你读这篇文章,你认识的每一个词,都是基于你之前对这个词的理解。

You don’t throw everything away and start thinking from scratch again.

你不会抛掉一切,所有都从头开始。

Your thoughts have persistence.

你的思想是持续存在的。

Traditional neural networks can’t do this, and it seems like a major shortcoming.

传统的神经网络做不到这一点,这似乎是一个主要的缺点。

For example, imagine you want to classify what kind of event is happening at every point in a movie. 

例如,假设您想要对电影中每一点发生的事件进行分类。

It’s unclear how a traditional neural network could use its reasoning about previous events in the film to inform later ones.

如何运用传统的神经网络,理解电影中前序事件对后序事件有影响,目前还不清楚。

Recurrent neural networks address this issue.

循环神经网络解决了这个问题。

They are networks with loops in them, allowing information to persist.

它们是包含了循环的网络,能够保持信息。

 Recurrent Neural Networks have loops.

循环神经网络包含循环

In the above diagram, a chunk of neural network, A, looks at some input x_{t} and outputs a value h_{t}.

上图是一个神经网络组块,A,观察到某个输入 x_{t} 和输出值 h_{t}

 A loop allows information to be passed from one step of the network to the next.

循环允许信息从网络的一个步骤传递到下一步。

These loops make recurrent neural networks seem kind of mysterious.

这些循环使得循环神经网络看起来有些不容易解释。

However, if you think a bit more, it turns out that they aren’t all that different than a normal neural network. 

然而,如果你多想一下,就会发现它们和普通的神经网络并没有太大的区别。

A recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a successor.

循环神经网络可以看做是相同网络的多重复制,每个副本都向后继者传递一条消息。

Consider what happens if we unroll the loop:

思考一下我们如果展开这些循环发生什么:

An unrolled recurrent neural network.

An unrolled recurrent neural network. 

一个展开的循环神经网络

This chain-like nature reveals that recurrent neural networks are intimately related to sequences and lists.

这种链状的特征表明,循环神经网络与序列和列表紧密关联。

They’re the natural architecture of neural network to use for such data.

他们是处理这种数据的天然神经网络结构。

And they certainly are used!

并且他们确实很有用!

In the last few years, there have been incredible success applying RNNs to a variety of problems: speech recognition, language modeling, translation, image captioning… The list goes on.

在过去的几年里,将循环神经网络(RNN)应用在各种问题上,取得了令人难以置信的成功:语音识别,语言建模,翻译,图像字幕,等等。

I’ll leave discussion of the amazing feats one can achieve with RNNs to Andrej Karpathy’s excellent blog post, The Unreasonable Effectiveness of Recurrent Neural Networks.

我将把关于RNNs可以实现的惊人壮举的讨论留给Andrej Karpathy的博客文章《The Unreasonable Effectiveness of Recurrent Neural Networks》。

But they really are pretty amazing.

但它们真的很神奇。

Essential to these successes is the use of “LSTMs,” a very special kind of recurrent neural network which works, for many tasks, much much better than the standard version.

这些成功的关键是使用“LSTM”,一种非常特殊的循环神经网络,在许多任务上都能比标准的RNN好得多。

Almost all exciting results based on recurrent neural networks are achieved with them.

几乎所有使用RNN取得的令人兴奋的成果,都有他们。

It’s these LSTMs that this essay will explore.

这些就是本文将要探索的LSTM。

The Problem of Long-Term Dependencies

长期依赖的问题

One of the appeals of RNNs is the idea that they might be able to connect previous information to the present task, such as using previous video frames might inform the understanding of the present frame.

RNN的一个有吸引力的地方在于,他们也许能够将以前的信息与现在的任务联系起来,例如使用视频前面的几帧画面,可能有助于理解现在这一帧的画面。

If RNNs could do this, they’d be extremely useful. But can they? It depends.

如果RNN可以做到这一点,他们将非常的有用吗,但是他们可以么?这要看情况。

Sometimes, we only need to look at recent information to perform the present task.

有时,我们只需要最近的信息来完成当前的任务。

For example, consider a language model trying to predict the next word based on the previous ones.

例如,设想

一个语言模型试图基于前一个单词预测下一个单词。

If we are trying to predict the last word in “the clouds are in the sky,” we don’t need any further context – it’s pretty obvious the next word is going to be sky.

如果我们试图预测“the clouds are in the sky,”这句话的最后一个词,我们不需要更远的信息--很明显下一个词就是sky。

In such cases, where the gap between the relevant information and the place that it’s needed is small, RNNs can learn to use the past information.

在这个例子中,相关信息与所需信息的差距很小,RNN可以学到这些过去的信息。

But there are also cases where we need more context.

但是也有一些情况,我们需要更多的上下文。

Consider trying to predict the last word in the text “I grew up in France… I speak fluent French.”

设想试图预测 “I grew up in France… I speak fluent French.”中的最后一个词。

Recent information suggests that the next word is probably the name of a language, but if we want to narrow down which language, we need the context of France, from further back.

最近的信息表明,下一个词可能是一种语言的名称,但是我们想要缩小到那种语言的范围,我们需要上文中的France,在很早之前。

It’s entirely possible for the gap between the relevant information and the point where it is needed to become very large.

相关信息和需要它的点之间的距离,完全有可能变得非常大。

Unfortunately, as that gap grows, RNNs become unable to learn to connect the information.

不幸的是,随着间距的增长,RNN变得无法学习并连接这些信息。

Neural networks struggle with long term dependencies.

In theory, RNNs are absolutely capable of handling such “long-term dependencies.”

理论上,RNN完全有能力处理这种“长期的依赖”

A human could carefully pick parameters for them to solve toy problems of this form.

人们可以精心的选择参数,以解决这类问题。

Sadly, in practice, RNNs don’t seem to be able to learn them.

遗憾的是,在实践中,RNN似乎无法学到他们。

The problem was explored in depth by Hochreiter (1991) [German] and Bengio, et al. (1994), who found some pretty fundamental reasons why it might be difficult.

Hochreiter (1991) [German] 和 Bengio, et al. (1994), 对这个问题进行了深入的探索,他们发现了RNN很难做到这些的根本原因。

Thankfully, LSTMs don’t have this problem!

谢天谢地,LSTM没有这些问题!

LSTM Networks

长短记忆网络

Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies.

长短记忆网络通常被称为LSTM,这是一种特殊的RNN,有能力学习长期的依赖。

They were introduced by Hochreiter & Schmidhuber (1997), and were refined and popularized by many people in following work.1 

 他们由 Hochreiter & Schmidhuber (1997),引入,并且在随后的工作中,被许多人改善和推广 1 。

They work tremendously well on a large variety of problems, and are now widely used.

 他们的工作在各种问题上表现的非常好,并被广泛的使用。

LSTMs are explicitly designed to avoid the long-term dependency problem.

LSTM的设计是为了避免长期以来的问题。

Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!

长时间的记住信息实际上是他们的默认行为,并不是他们努力学习的东西。

All recurrent neural networks have the form of a chain of repeating modules of neural network.

所有的循环神经网络,都具有神经网络的重复模块链的形式。

In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer.

标准的RNN,重复的模块是一种非常简单的结构,例如单个的tanh层。

The repeating module in a standard RNN contains a single layer.

包含简单层的标准RNN中的重复模块

LSTMs also have this chain like structure, but the repeating module has a different structure.

LSTM也有这样的链状结构,只不过它们重复模块的结构不同。

Instead of having a single neural network layer, there are four, interacting in a very special way.

它们有四个特殊交互方式的层,而不是只有一个简单神经网络层。

A LSTM neural network.

 The repeating module in an LSTM contains four interacting layers.

包含四个交互层的LSTM中的重复模块

Don’t worry about the details of what’s going on.

不必担心这些将要发生的细节。

We’ll walk through the LSTM diagram step by step later.

稍后,我们将逐步介绍LSTM图。

For now, let’s just try to get comfortable with the notation we’ll be using.

现在,我们试着熟悉一下我们将要使用的符号。

In the above diagram, each line carries an entire vector, from the output of one node to the inputs of others.

上图中,每条线都承载这一个完整的向量,从一个节点的输出到另一个节点的输入。

The pink circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers.

粉色的圈表示逐点操作,例如向量加法,而黄色的框表示学习的神经网络层。

Lines merging denote concatenation, while a line forking denote its content being copied and the copies going to different locations.

合并的线表示向量拼接,而分叉的线表示向量将要被复制,并达到不同的位置。

The Core Idea Behind LSTMs

LSTM背后的核心思想

The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.

LSTM的关键是单元状态,也就是图顶部穿过的水平的线。

The cell state is kind of like a conveyor belt.

单元状态有点儿像传送带。

It runs straight down the entire chain, with only some minor linear interactions.

它沿着整个链条直线延伸,只做了一些小的线性交互。

It’s very easy for information to just flow along it unchanged.

信息不加改变地流动,非常容易。

The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.

LSTM确实有向单元状态添加或者删除信息的能力,这些功能由称为门的结构精心调节。

Gates are a way to optionally let information through.

门是一种选择性的让信息通过的方法。

They are composed out of a sigmoid neural net layer and a pointwise multiplication operation.

他们由一个sigmoid神经网络层和一个逐点乘法操作组成。

The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through.

sigmoid层输出0到1之间的数字,描述每个组件允许通过多少。

A value of zero means “let nothing through,” while a value of one means “let everything through!”

0值意思是“不让任何内容通过”,而1值表示“让所有内容通过!”

An LSTM has three of these gates, to protect and control the cell state.

一个LSTM包含三个这样的门,用来保护和控制单元状态。

Step-by-Step LSTM Walk Through

逐步穿越LSTM

The first step in our LSTM is to decide what information we’re going to throw away from the cell state.

LSTM的第一步是决定要从单元状态中丢弃什么信息。

This decision is made by a sigmoid layer called the “forget gate layer.”

这个决定由一个叫做“遗忘门”的sigmoid层来做出。

It looks at  h_{t-1} and x_{t}, and outputs a number between 0 and 1 for each number in the cell state C_{t-1}.

它查看  h_{t-1} 和 x_{t},并且为单元状态C_{t-1}中的每一个数字,都输出一个0到1之间的数字。

A 1 represents “completely keep this” while a 0 represents “completely get rid of this.”

一个1表示“完全保留此状态”,一个0表示“完全舍弃此状态”。

Let’s go back to our example of a language model trying to predict the next word based on all the previous ones.

让我们回到语言模型例子,该模型基于先前所有词预测下一个词。

In such a problem, the cell state might include the gender of the present subject, so that the correct pronouns can be used.

在这样一个问题中,单元状态有可能包含当前主语的性别,从而使用正确的代词。

When we see a new subject, we want to forget the gender of the old subject.

当我门看到一个新的主语,我们想要忘记旧主语的性别。

The next step is to decide what new information we’re going to store in the cell state.

下一步是决定我们将什么信息储存在单元状态中。

This has two parts.

这包括两部分。

First, a sigmoid layer called the “input gate layer” decides which values we’ll update.

首先,一个叫做“输入门”的sigmoid层决定了我们将要更新哪些值。

Next, a tanh layer creates a vector of new candidate values, \widetilde{C}_{t}, that could be added to the state.

其次,一个tanh层创建一个向量作为新的候选值,\widetilde{C}_{t}, 这可能会被添加到状态中。

In the next step, we’ll combine these two to create an update to the state.

在下一步中,我们两者结合起来以创建状态的更新。

In the example of our language model, we’d want to add the gender of the new subject to the cell state, to replace the old one we’re forgetting.

在我们的语言模型例子中,我们想要添加新的主语性别到单元状态,就替换掉已经遗忘的旧主语。

在这里插入图片描述

It’s now time to update the old cell state, C_{t-1}, into the new cell state C_{t}.

此时更新旧的单元状态, C_{t-1},到新的单元状态  C_{t}

The previous steps already decided what to do, we just need to actually do it.

前一步已经决定了要做什么,我们只需要实际执行即可。

We multiply the old state by f_{t}, forgetting the things we decided to forget earlier.

我们将就的状态乘以f_{t},忘掉之前决定遗忘的事情。

Then we add i_{t}*\widetilde{C}_{t}.  

然后我们再加上i_{t}*\widetilde{C}_{t}

This is the new candidate values, scaled by how much we decided to update each state value.

这就是新的候选值,根据我们决定更新每一个状态值的大小进行缩放。

In the case of the language model, this is where we’d actually drop the information about the old subject’s gender and add the new information, as we decided in the previous steps.

在语言模型的例子中,这就是我们实际删除旧的主语性别信息,并添加新信息的地方,正如在先前步骤中决定的那样。

Finally, we need to decide what we’re going to output.

最后,我们需要决定我们将要输出什么。

This output will be based on our cell state, but will be a filtered version.

这个输出将基于我们的单元状态,但将是过滤后的版本。

First, we run a sigmoid layer which decides what parts of the cell state we’re going to output.

首先,我们运行一个sigmoid层,来决定我们要输出哪部分单元状态。

Then, we put the cell state through tanh (to push the values to be between −1and 1) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.

然后,我们将单元状态通过tanh计算(将值置为-1到1之间),并且乘以sigmoid门输出的值,这样我们只输出我们决定输出的那部分值。

For the language model example, since it just saw a subject, it might want to output information relevant to a verb, in case that’s what is coming next.

对于语言模型的例子,由于只看到了一个主语,它可能希望输出与动词相关的信息,以防接下来会出现这样的情况。

For example, it might output whether the subject is singular or plural, so that we know what form a verb should be conjugated into if that’s what follows next.

例如,它将输出主语是单数还是复数,这样我们就知道后面结合的动词应该是什么形式。

Variants on Long Short Term Memory

长短记忆的变体

What I’ve described so far is a pretty normal LSTM.

But not all LSTMs are the same as the above.

In fact, it seems like almost every paper involving LSTMs uses a slightly different version.

The differences are minor, but it’s worth mentioning some of them.

One popular LSTM variant, introduced by Gers & Schmidhuber (2000), is adding “peephole connections.”

This means that we let the gate layers look at the cell state.

The above diagram adds peepholes to all the gates, but many papers will give some peepholes and not others.

Another variation is to use coupled forget and input gates.

Instead of separately deciding what to forget and what we should add new information to, we make those decisions together.

We only forget when we’re going to input something in its place.

We only input new values to the state when we forget something older.

A slightly more dramatic variation on the LSTM is the Gated Recurrent Unit, or GRU, introduced by Cho, et al. (2014).

It combines the forget and input gates into a single “update gate.”

It also merges the cell state and hidden state, and makes some other changes.

The resulting model is simpler than standard LSTM models, and has been growing increasingly popular.

A gated recurrent unit neural network.

These are only a few of the most notable LSTM variants.

There are lots of others, like Depth Gated RNNs by Yao, et al. (2015).

There’s also some completely different approach to tackling long-term dependencies, like Clockwork RNNs by Koutnik, et al. (2014).

Which of these variants is best? Do the differences matter? Greff, et al. (2015)do a nice comparison of popular variants, finding that they’re all about the same.


Jozefowicz, et al. (2015) tested more than ten thousand RNN architectures, finding some that worked better than LSTMs on certain tasks.

Conclusion

结论

Earlier, I mentioned the remarkable results people are achieving with RNNs.

Essentially all of these are achieved using LSTMs.

They really work a lot better for most tasks!

Written down as a set of equations, LSTMs look pretty intimidating.

Hopefully, walking through them step by step in this essay has made them a bit more approachable.

LSTMs were a big step in what we can accomplish with RNNs.

It’s natural to wonder: is there another big step?

A common opinion among researchers is: “Yes! There is a next step and it’s attention!”

The idea is to let every step of an RNN pick information to look at from some larger collection of information.

For example, if you are using an RNN to create a caption describing an image, it might pick a part of the image to look at for every word it outputs. In fact, Xu, et al. (2015) do exactly this – it might be a fun starting point if you want to explore attention!

There’s been a number of really exciting results using attention, and it seems like a lot more are around the corner…

Attention isn’t the only exciting thread in RNN research.

For example, Grid LSTMs by Kalchbrenner, et al. (2015) seem extremely promising.

Work using RNNs in generative models – such as Gregor, et al. (2015)Chung, et al. (2015), or Bayer & Osendorfer (2015) – also seems very interesting.

The last few years have been an exciting time for recurrent neural networks, and the coming ones promise to only be more so!

本文内容由网友自发贡献,转载请注明出处:https://www.wpsshop.cn/w/我家小花儿/article/detail/496481
推荐阅读
相关标签
  

闽ICP备14008679号