当前位置:   article > 正文

A Survey on Automatic Text Summarization

survey on automatic text summarization

A Survey on Automatic Text Summarization

1.自动文本摘要的定义

Text summarization is compress the source text into a diminished version conserving its information content and overall meaning

1.1自动摘要的分类

单文档摘要和多文档摘要single , mul-summarization

1.2自动摘要方法类别

Extractive and abstactive summarization

2.自动摘要处理过程及EXTRACTION FEATURES

The most of the current automated text summarization systems use extradiction methods. Extractive summarization process can be divided into three phases.
First phase is Pre-Processing, second phase isProcessing.

2.1预处理常见方法

(1)Part of Speech(POS) Tagging 词性标注
这里写图片描述

(2)Stop Word Filtering 停用词过滤
a, an, in, by can be considered as a stop words and filtered from plain text
(3)Stemming 抓出词干
removing from –ed or –ing from verbs, using singular instead of plural noun, etc.
(4)Feature Calculation

2.1.1.Title Similarity:

这里写图片描述

2.1.2.Sentence Position:

这里写图片描述

2.1.3.Term Weight(Term frequency)

The total term weight is calculated by computing tf and idf for document.
Here idf refers to inverse document frequency which simply tells about whether the term is common or rare across all documents.
The score of important score wi of word i can be calculated by the traditional tf.idf methods.
这里写图片描述

2.1.4.Sentence Length

This feature is suitable when eliminating the sentences which are too short such as datelines or author names
适合日期,作者名字比较短的句子
这里写图片描述

2.1.5.Thematic Word 主题词

This feature is related with domain specific words which occur frequently in a document are probably related topic
经常出现的特殊词往往与话题有关
这里写图片描述

2.1.6.Proper Nouns 专有名词

这里写图片描述

2.1.7.Sentence to Sentence Similarity

这里写图片描述

2.4.8.Numerical Data

这里写图片描述

3. SUMMARIZATION METHODS

3.1.Query Based and Generic Summarization

在基于查询的文本摘要中,给定文档的句子的评分是基于单词或短语的频率计数。 包含查询短语的句子的分数较高,而单个查询词的分数较高。
这里写图片描述

3.1.1. Bayesian Classifier

这里写图片描述

3.1.2. Hidden Markov Model

the HMM does not assume that the probability that sentence i is in the summary is independent of whether sentence i-1 is in the summary
The main idea is using a sequential model to account for local dependencies between sentences. In HMM Model, three features were used:
 position of the sentence in the document,
 number of terms in the sentence,
 likeliness of the sentence terms given the document terms.
obtained the maximum-likelihood estimate for each transition probability,forming the transition matrix estimate
这里写图片描述

3.1.3. Neural Networks Based Text Summarization

f1 = Paragraph follows title (Paragraph Position)
f2 = Paragraph location in document
f3 = Sentence location paragraph
f4 = First sentence in paragraph
f5 = Sentence Length
f6 = Number of thematic words in sentence
f7 = Number of title words in sentence
Text Summarization process consists of three phases: training, feature fusion and sentence selection

3.1.4. Fuzzy Logic Based Text Summarization

模糊逻辑方法使用模糊规则和三角形隶属函数。模糊规则是IF-THEN的形式。三角形隶属函数将每个得分模糊为3个值中的一个,即LOW,MEDIUM和HIGH

4.EVALUATION

这里写图片描述
这里写图片描述

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/知新_RL/article/detail/532515
推荐阅读
相关标签
  

闽ICP备14008679号