放大招：NLP-探索深度学习在自然语言处理中的应用：构建标题生成器

作者：weixin_40725706 | 2024-05-22 07:23:15

踩

引言

在深度学习的领域中，自然语言处理（NLP）是一个令人兴奋且快速发展的分支。它使得机器能够理解、解释和生成人类语言。在本博客中，我们将通过构建一个简单的标题生成器来探索NLP的基础知识，了解如何使用深度学习模型处理序列数据。

序列数据与自然语言

与图像数据不同，语言数据是序列化的，这意味着单词的顺序对于理解整个句子的意图至关重要。处理这类数据时，我们通常需要使用专门的模型，如循环神经网络（RNN）。

目标

通过本节的学习，您将能够：

准备循环神经网络（RNN）使用的序列数据。
构建和训练模型以执行单词预测任务。

标题生成器的构建

我们将构建一个模型，它可以根据一些起始单词预测出一个完整的标题。这个模型将使用《纽约时报》的文章标题作为训练数据。

读入和清洗数据

首先，我们需要从CSV文件中读取数据，并将它们存储在一个列表中。同时，我们需要清洗数据，过滤掉任何标记为“未知”的标题。

import os
import pandas as pd

nyt_dir = 'data/nyt_dataset/articles/'
all_headlines = []
for filename in os.listdir(nyt_dir):
    if 'Articles' in filename:
        headlines_df = pd.read_csv(nyt_dir + filename)
        all_headlines.extend(list(headlines_df.headline.values))

# 清洗数据，移除 'Unknown'
all_headlines = [h for h in all_headlines if h != 'Unknown']
1
2
3
4
5
6
7
8
9
10
11
12

分词和创建序列

接下来，我们使用Keras的Tokenizer将文本数据转换为数字序列。分词是将文本转换为模型可以理解的数字表示的过程。

from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer()
tokenizer.fit_on_texts(all_headlines)
total_words = len(tokenizer.word_index) + 1

# 创建序列
input_sequences = []
for line in all_headlines:
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        partial_sequence = token_list[:i+1]
        input_sequences.append(partial_sequence)
1
2
3
4
5
6
7
8
9
10
11
12
13

填充序列

由于序列长度不一致，我们需要使用pad_sequences来填充序列，使它们长度一致。

from tensorflow.keras.preprocessing.sequence import pad_sequences

max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
1
2
3
4

创建预测器和目标

我们将序列分为预测器（predictors）和目标（labels）。预测器是序列中除了最后一个词以外的所有词，而目标则是序列的最后一个词。

predictors = input_sequences[:, :-1]
labels = input_sequences[:, -1]

# 将标签转换为独热编码
from tensorflow.keras import utils
labels = utils.to_categorical(labels, num_classes=total_words)
1
2
3
4
5
6

构建模型

我们构建一个包含嵌入层、长短期记忆层（LSTM）和输出层的模型。

from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.models import Sequential

input_len = max_sequence_len - 1 
model = Sequential()
model.add(Embedding(total_words, 10, input_length=input_len))
model.add(LSTM(100))
model.add(Dropout(0.1))
model.add(Dense(total_words, activation='softmax'))
1
2
3
4
5
6
7
8
9

编译和训练模型

我们使用Adam优化器和多分类交叉熵作为损失函数来编译模型。

model.compile(loss='categorical_crossentropy', optimizer='adam')
model.fit(predictors, labels, epochs=30, verbose=1)
1
2

进行预测

最后，我们可以使用训练好的模型来预测新标题。

def predict_next_token(seed_text):
    token_list = tokenizer.texts_to_sequences([seed_text])[0]
    token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
    prediction = model.predict_classes(token_list, verbose=0)
    return prediction

# 生成新标题
def generate_headline(seed_text, next_words=1):
    for _ in range(next_words):
        prediction = predict_next_token(seed_text)
        next_word = tokenizer.sequences_to_texts([prediction])[0]
        seed_text += " " + next_word
    return seed_text.title()

seed_texts = [
    'washington dc is',
    'today in new york',
    'the school district has',
    'crime has become'
]

for seed in seed_texts:
    print(generate_headline(seed, next_words=5))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

结语

通过本博客，我们探索了如何使用深度学习处理自然语言数据，并构建了一个简单的标题生成器。这个模型使用了RNN，特别是LSTM层，来处理序列数据。虽然我们的例子相对简单，但它展示了深度学习在NLP领域的潜力。随着模型的进一步训练和优化，它将能够生成更加复杂和语义上有意义的标题。

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/weixin_40725706/article/detail/607136