当前位置:   article > 正文

深度学习文本摘要_使用深度神经网络的文本摘要

深度学习文本摘要

深度学习文本摘要

介绍 (Introduction)

The amount of textual data being produced every day is increasing rapidly both in terms of complexity as well as volume. Social Media, News articles, emails, text messages (the list goes on..), generate massive information and it becomes cumbersome to go through lengthy text materials (and boring too!). Thankfully with the advancements in Deep Learning, we can build models to shorten long pieces of text and produce a crisp and coherent summary to save time and understand the key points effectively.

无论是复杂性还是数量,每天产生的文本数据量都在Swift增加。 社交媒体,新闻文章,电子邮件,文本消息(列表还在继续。)产生大量信息,而阅读冗长的文本材料(也很无聊!)变得很麻烦。 值得庆幸的是,随着深度学习的进步,我们可以构建模型来缩短较长的文本并生成清晰连贯的摘要,以节省时间并有效地理解要点。

We can broadly classify text summarization into two types:

我们可以将文本摘要大致分为两种类型:

1. Extractive Summarization: This technique involves the extraction of important words/phrases from the input sentence. The underlying idea is to create a summary by selecting the most important words from the input sentence

1.提取摘要:此技术涉及从输入句子中提取重要的单词/短语。 基本思想是通过从输入句子中选择最重要的单词来创建摘要

2. Abstractive Summarization: This technique involves the generation of entirely new phrases that capture the meaning of the input sentence. The underlying idea is to put a strong emphasis on the form — aiming to generate a grammatical summary thereby requiring advanced language modeling techniques.

2.抽象概括:该技术涉及生成捕获输入句子含义的全新短语。 其基本思想是重点强调形式-旨在生成语法摘要,从而需要高级语言建模技术。

In this article, we will use PyTorch to build a sequence 2 sequence (encoder-decoder) model with simple dot product attention using GRU and evaluate their attention scores. We will further look into metrics like — BLEU, ROUGE for evaluating our model.

在本文中,我们将使用PyTorch使用GRU构建具有简单点积注意力的序列2序列(编码器-解码器)模型,并评估其注意力得分。 我们将进一步研究诸如BLEU,ROUGE之类的指标来评估我们的模型。

Dataset used :We will work on the wikihow dataset that contains around 200,000 long sequence pairs of articles and their headlines. This dataset is one of the large-scale datasets available for summarization with the length of articles varying considerably. These articles are quite diverse in their writing style which makes the summarization problem more challenging and interesting.

使用的数据集:我们将使用 wikihow数据集,其中包含大约200,000个长序列的文章对及其标题。 该数据集是可用于汇总的大规模数据集之一,文章的长度差异很大。 这些文章的写作风格千差万别,这使摘要问题更具挑战性和趣味性。

For more information on the dataset: https://arxiv.org/abs/1810.09305

有关数据集的更多信息: https : //arxiv.org/abs/1810.09305

数据预处理 (Data Preprocessing)

Pre-processing and cleaning is an important step because building a model on unclean and messy data will in turn produce messy results. We will apply the below cleaning techniques before feeding our data to the model:

预处理和清理是重要的一步,因为在不干净且混乱的数据上建立模型将反过来产生混乱的结果。 在将数据提供给模型之前,我们将应用以下清洁技术:

  1. Converting all text to lower case for further processing

    将所有文本转换为小写以进行进一步处理
  2. Parsing HTML tags

    解析HTML标签
  3. Removing text between () and []

    删除()和[]之间的文本
  4. Contraction Mapping — Replacing shortened version of words (for e.g. can’t is replaced with cannot and so on)

    压缩映射-替换单词的缩短版本(例如,不能将不能替换为不能等)
  5. Removing apostrophe

    删除撇号
  6. Removing punctuations and special characters

    删除标点符号和特殊字符
  7. Removing stop words using nltk library

    使用nltk库删除停用词
  8. Retaining only long words, i.e. words with length > 3

    仅保留长词,即长度大于3的词

We will first define the contractions in the form of a dictionary:

我们将首先以字典的形式定义收缩:

## Find the complete list of contractions on my Github Repocontraction_mapping = {"ain't": "is not", "aren't": "are not"}stop_words = set(stopwords.words('english'))def text_cleaner(text,num):  str = text.lower()  str = BeautifulSoup(str, "lxml").text  str = re.sub("[\(\[].*?[\)\]]", "", str)  str = ' '.join([contraction_mapping[t] if t in contraction_mapping      else t f
本文内容由网友自发贡献,转载请注明出处:【wpsshop博客】
推荐阅读
相关标签
  

闽ICP备14008679号