深度学习 -- >NLP -- > Deep contextualized word representations(ELMo)_nlp contextualized

作者：人工智能uu | 2024-07-09 04:26:39

踩

nlp contextualized

博客内容将首发在微信公众号"跟我一起读论文啦啦"，上面会定期分享机器学习、深度学习、数据挖掘、自然语言处理等高质量论文，欢迎关注！

本文将分享发表在2018年的NAACL上，outstanding paper。论文链接ELMo。该论文旨在提出一种新的词表征方法，并且超越以往的方法，例如word2vec、glove等。

论文贡献点

能捕捉到更复杂的语法、语义信息。
用语言模型（双向LSTM）训练，能更好的结合上下文内容，对多义词做更好的表征。（以往的词表征方法，例如word2vec（简单三层线性模型，可能无法解决长距离依赖问题，无法很好的结合很远的上下文内容生成词向量）等可能无法很好的解决这个问题）
能非常容易的将这种词表征方法整合进现有的模型中，在多种NLP任务中，都极大了提高了state of the art。

Embedding from Language Models(ELMo)

论文中所提出的词表征方法是基于语言模型的。

双向语言模型

在这里插入图片描述
如上图所示，左侧是正向的L层的语言模型，右边是 $L$ 层的反向网络。SF表示 $S o f t m a x$ 结构。

其中 $x_i^{LM}$ 表示词 $t_i$ 经过 $c h a r C N N$ 后的向量， $\overrightarrow{h}^{LM}_{i,j}$ 、 $\overleftarrow{h}^{LM}_{i,j}$ 分别表示正向反向 $L S T M$ 的第 $j$ 层的第 $i$ 个 $L S T M C e l l$ 的隐藏层状态。

其中正向语言模型（从前往后预测）：
$p(t_1,t_2, ..,t_n) = \prod_{k=1}^np(t_k|t_1,t_2,...,t_{k-1})$
反向语言模型（从后往前预测）：
$p(t_1,t_2, ..,t_n) = \prod_{k=1}^np(k_k|t_{k+1},t_{k+2},...,t_n)$

需要注意的是，正向反向语言模型会共享部分参数。，两者并不是完全独立的。

优化目标，联合正反向网络：
$(\sum_{k=1}^{n}(log\ p(t_k|t_1,t_2,...,t_{k-1}: \theta_x,\overrightarrow{\theta}_{LSTM}, \theta_s) + log\ p(t_k|t_{k+1},t_{k+2},...,t_n: \theta_x,\overleftarrow{\theta}_{LSTM}, \theta_s)) )$

其中 $\theta_x$ 表示词表征参数， $\theta_s$ 表示 $s o f t m a x$ 参数， $\theta_{LSTM}$ 表示 $L S T M$ 网络参数。
显然模型采用的损失函数就是 $M L E$ ，这样我们就可以训练这个网络。

ELMO

对于训练好的 $L$ 层的双向语言模型，每个 $t o k e n$ ，例如第 $k$ 个词 $t_k$ 都可以用 $2 L + 1$ 个向量集合表示，如下：
$R_k = {\{x_k^{LM},\overrightarrow{h}^{LM}_{k,j}, \overleftarrow{h}^{LM}_{k,j} | j = 1,..,L }\}$

为了下游的模型能更好的使用上面得出的词表征，我们需要将 $2 L + 1$ 向量集合压缩成一个向量表示。
${ELMo}_k=E(R_k;\theta_e)$
例如，我们可以仅仅使用双向语言模型的最后一层的输出：
$E(R_k) = h^{LM}_{k,L}$
其中 $h^{LM}_{k,L} = [\overrightarrow{h}^{LM}_{k,j}: \overleftarrow{h}^{LM}_{k,j}]$

更普遍的做法是：
${ELMo}_{k}^{task}=E(R_k;\theta^{task})=\gamma^{task}\sum_{j=0}^{L}s_j^{task}h_{k,j}^{LM}$

$s_j^{task}$ 表示按特定任务对权重做 $s o f t m a x - n o r m a l i z e d$ ， $\gamma$ 是需要根据经验调试的超参数。每层做个集成，能捕捉到不同的语义信息。

将ELMo应用到有监督学习

将 $E L M o$ 词向量直接与普通的词向量（例如本文中经过 $C h a r - C N N$ 得到的词向量）拼接，即： $x_k: {ELMo}_k^{task}]$ 。
将 $E L M o$ 词向量直接与双向的语言模型的隐层状态 $h_k$ 拼接，即： $h_k：{ELMo}_k^{task}]$ 拼接，论文中说这种拼接操作，在效果上更好。论文中有提到，对于网络中不同层能表示出词的不同含义。比如：High-level LSTM可以捕捉词语上下文独立的语义信息，适合做监督的词义消歧任务；Lower-level的可以捕捉句法信息，适合做词性标注。那么这里与不同层的状态拼接操作，就相当于Ensemble的操作，所以效果会比较好？
对 $E L M o$ 做正则操作，也就是对双向的 $L S T M$ 中权重做正则处理。即对模型 $l o s s$ 函数加上 $\lambda ||w||_2^2$ 。

实验分析

论文的实验部分验证了加入 $M L M o$ 的词向量在各个NLP任务上都得到了很好的表现。这里面举一些个人觉得比较有亮点的部分分析下。

使用所有层信息、使用最后一层信息、以及正则的实验效果

在这里插入图片描述
$B a s e l i n e$ 是没有使用 $E L M o$ 的词向量， $last\ only$ 是仅仅使用双向语言模型的最后一层的词向量， $All\ layers$ 使用了双向语言模型所有层的词向量集成，显然我们可以看出当正则 $\lambda = 0.001$ 并且使用了所有层的词向量集成，效果答复提升。说明这种集成的做法是比较有效的。

将ELMo词向量用在模型的输入处和输出处的效果对比

在这里插入图片描述

关键代码

构建带残差的双向语言模型

 with tf.variable_scope("elmo_rnn_cell"):
      self.forward_cell = tf.nn.rnn_cell.LSTMCell(self.hidden_size, reuse=tf.AUTO_REUSE)
      self.backward_cell = tf.nn.rnn_cell.LSTMCell(self.hidden_size, reuse=tf.AUTO_REUSE)

  if config.get("use_skip_connection"):## 残差连接
      self.forward_cell = tf.nn.rnn_cell.ResidualWrapper(self.forward_cell)
      self.backward_cell = tf.nn.rnn_cell.ResidualWrapper(self.backward_cell)

  with tf.variable_scope("elmo_softmax"):## 下面的forward_softmax_w  就是上面所讲的$s^{task}_j$
      softmax_weight_shape = [config["word_vocab_size"], config["elmo_hidden"]]

      self.forward_softmax_w = tf.get_variable("forward_softmax_w", softmax_weight_shape, dtype=tf.float32)
      self.backward_softmax_w = tf.get_variable("backward_softmax_w", softmax_weight_shape, dtype=tf.float32)

      self.forward_softmax_b = tf.get_variable("forward_softmax_b", [config["word_vocab_size"]])
      self.backward_softmax_b = tf.get_variable("backward_softmax_b", [config["word_vocab_size"]])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

embedding_output = self.embedding.forward(data)## 将data经过char-cnn得到普通词向量
        with tf.variable_scope("elmo_rnn_forward"):
            forward_outputs, forward_states = tf.nn.dynamic_rnn(self.forward_cell,
                                                                inputs=embedding_output,
                                                                sequence_length=data["input_len"],
                                                                dtype=tf.float32)

        with tf.variable_scope("elmo_rnn_backward"):
            backward_outputs, backward_states = tf.nn.dynamic_rnn(self.backward_cell,
                                                                  inputs=embedding_output,
                                                                  sequence_length=data["input_len"],
                                                                  dtype=tf.float32)

        # #将正反向模型链接起来
        forward_projection = tf.matmul(forward_outputs, tf.expand_dims(tf.transpose(self.forward_softmax_w), 0))
        forward_projection = tf.nn.bias_add(forward_projection, self.forward_softmax_b)

        backward_projection = tf.matmul(backward_outputs, tf.expand_dims(tf.transpose(self.backward_softmax_w), 0))
        backward_projection = tf.nn.bias_add(backward_projection, self.backward_softmax_b)

        return forward_outputs, backward_outputs, forward_projection, backward_projection
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

上面只是定义了一层的LSTM网络，但是通过残差连接，把一层的输入和输出连接在一起了。就相当于将不同层的隐状态集成到一起了。

模型训练

def train(self, data, global_step_variable=None):
        forward_output, backward_output, _, _ = self.forward(data)

		## 注意data[target]只是比输入延后了一步
        forward_target = data["target"]
        forward_pred = tf.cast(tf.argmax(tf.nn.softmax(forward_output, -1), -1), tf.int32)
        forward_correct = tf.equal(forward_pred, forward_target)
        forward_padding = tf.sequence_mask(data["target_len"], maxlen=self.seq_len, dtype=tf.float32)

        forward_softmax_target = tf.cast(tf.reshape(forward_target, [-1, 1]), tf.int64)
        forward_softmax_input = tf.reshape(forward_output, [-1, self.hidden_size])
        forward_train_loss = tf.nn.sampled_softmax_loss(
            weights=self.forward_softmax_w, biases=self.forward_softmax_b,
            labels=forward_softmax_target, inputs=forward_softmax_input,
            num_sampled=self.config["softmax_sample_size"],
            num_classes=self.config["word_vocab_size"]
        )

        forward_train_loss = tf.reshape(forward_train_loss, [-1, self.seq_len])
        forward_train_loss = tf.multiply(forward_train_loss, forward_padding)
        forward_train_loss = tf.reduce_mean(forward_train_loss)
		## 反向模型需要将target翻转，因为是从后向前预测的。
        backward_target = tf.reverse_sequence(data["target"], data["target_len"], seq_axis=1, batch_axis=0)
        backward_pred = tf.cast(tf.argmax(tf.nn.softmax(backward_output, -1), -1), tf.int32)
        backward_correct = tf.equal(backward_pred, backward_target)
        backward_padding = tf.sequence_mask(data["target_len"], maxlen=self.seq_len, dtype=tf.float32)

        backward_softmax_target = tf.cast(tf.reshape(backward_target, [-1, 1]), tf.int64)
        backward_softmax_input = tf.reshape(backward_output, [-1, self.hidden_size])
        backward_train_loss = tf.nn.sampled_softmax_loss(
            weights=self.backward_softmax_w, biases=self.backward_softmax_b,
            labels=backward_softmax_target, inputs=backward_softmax_input,
            num_sampled=self.config["softmax_sample_size"],
            num_classes=self.config["word_vocab_size"]
        )

        backward_train_loss = tf.reshape(backward_train_loss, [-1, self.seq_len])
        backward_train_loss = tf.multiply(backward_train_loss, backward_padding)
        backward_train_loss = tf.reduce_mean(backward_train_loss)

        train_loss = forward_train_loss + backward_train_loss
        train_correct = tf.concat([forward_correct, backward_correct], axis=-1)
        train_acc = tf.reduce_mean(tf.cast(train_correct, tf.float32))

        tf.summary.scalar("train_acc", train_acc)
        tf.summary.scalar("train_loss", train_loss)

        train_ops = tf.train.AdamOptimizer().minimize(train_loss)
        return train_loss, train_acc, train_ops

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50

获得EMLo词向量

def pred(self, data):
        -，-， forward_projection, backward_projection= self.forward(data)
        eval_output = tf.concat([forward_projection, backward_projection], axis=-1)
        return eval_output
1
2
3
4

上面的 $forward\_projection, backward\_projection$ 分别表示正向，反向网络集成各层隐状态得到词向量表示，将其连接起来就得到了上某种形式的 $E M L o$ 词向量。

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/人工智能uu/article/detail/801409