如何搭建一个nlp模型
内置AI NLP365(INSIDE AI NLP365)
Project #NLP365 (+1) is where I document my NLP learning journey every single day in 2020. Feel free to check out what I have been learning over the last 262 days here. At the end of this article, you can find previous papers summary grouped by NLP areas :)
在#NLP365(+1)项目中,我记录了2020年每一天的NLP学习历程。请随时在这里查看我过去262天的学习内容。 在本文的结尾,您可以找到按NLP领域分组的以前的论文摘要:)
Today’s NLP paper is A Simple Theoretical Model of Importance for Summarization. Below are the key takeaways of the research paper.
如今的NLP论文是一个重要的简单理论模型。 以下是研究论文的主要内容。
目标与贡献 (Objective and Contribution)
Proposed a simple theoretical model to capture the information importance in summarisation. The model captures redundancy, relevance, and informativeness, all three of which contributes to the information importance in summarisation. We showcase how someone could use this framework to guide and improve summarisation systems. The contributions are as follows:
提出了一个简单的理论模型来总结信息的重要性。 该模型捕获了冗余,相关性和信息性,这三者均有助于总结中的信息重要性。 我们展示了有人如何使用此框架来指导和改进摘要系统。 贡献如下:
- Define three key concepts in summarisation: redundancy, relevance, and informativeness 定义摘要中的三个关键概念:冗余,相关性和信息性
- Formulate the Importance concept using the three key concepts in summarisation and how to interpret the results总结中的三个关键概念来制定重要性概念,以及如何解释结果
- Showed that our theoretical model of importance for summarisation has a good correlation with human summarisation, making it useful for guiding future empirical works表明我们对于摘要的重要性的理论模型与人类摘要具有良好的相关性,从而对指导未来的经验工作很有用
总体框架(The Overall Framework)
Semantic unit is considered a small piece of information. represents all the possible semantic units. A text input X is considered to be made up of many semantic units and so can be represented by a probability distribution over . can simply means the frequency distribution of semantic units in the overall text. can be interpreted as the probability that the semantic unit appears in text X or it could be interepreted as the contribution of to the overall meaning of text X.
语义单元被认为是一小部分信息。 表示所有可能的语义单元。 文本输入X被认为由许多语义单元组成,因此可以用上的概率分布表示。 可以简单地表示整个文本中语义单元的频率分布。 可以解释为语义单位出现在文本X中的概率,也可以解释为对整体含义的贡献文字X。
冗余 (Redundancy)
The level of information presented in a summary is measured by entropy as follows:
摘要中显示的信息级别通过熵来衡量,如下所示:
Entropy measures the coverage level and H(S) is maximised when every semantic unit in the summary only appears once and so the Redundancy formula is as follows:
熵衡量覆盖度,并且摘要中的每个语义单元仅出现一次时,H(S)最大化,因此冗余公式如下:
关联 (Relevance)
A relevant summary should be one that closely approximates the original text. In other words, a relevant summary should have the minimum loss of information. For us to measure relevancy, we would need to compare the probability distributions of the source document and summary using cross-entropy as follows:
相关的摘要应该是与原始文本非常接近的摘要。 换句话说,相关的摘要应使信息损失最小。 为了衡量相关性,我们需要使用交叉熵比较源文档和摘要的概率分布,如下所示:
The formula is seen as the average surprise of producing S summary when expecting D source document. A summary S with low cross entropy (and so low surprise) implies low uncertainty about what were the original document. This is only possible if is similar to .
该公式被视为期望D源文件时产生S摘要的平均惊喜。 交叉熵低的摘要S(因此惊喜低)意味着原始文档的不确定性低。 仅当与相似时,才有可能。
KL divergence measures the loss of information when using source document D to generate summary S. The summary that minimises the KL divergence minimises redundancy and maximises relevance as it is the least biased (least redundant) summary matching D. The KL divergence connects redundancy and relevance as follows:
KL散度用于度量使用源文档D生成摘要S时的信息丢失。使KL散度最小的摘要将冗余最小化并将相关性最大化,因为它是偏差最小(最小冗余)的摘要匹配D。KL散度将冗余和相关性联系起来如下:
信息性 (Informativeness)
Informativeness introduce background knowledge K to capture the use of previous knowledge for summarisation. K is represented by over all semantic units. The amount of new information in summary S is measured by the cross entropy between the summary and background knowledge as follows:
信息性介绍背景知识K,以捕获对摘要的先前知识的使用。 在所有语义单元上,K由表示。 摘要S中新信息的数量通过摘要与背景知识之间的交叉熵来度量,如下所示:
The cross entropy for relevance should be low as we want the summary to be as similar and relevant to the source document whereas the cross entropy for informativeness should be high as we are measuring the amount of background knowledge we used to generate the summary. This introduction of background knowledge allows us to customise the model depending on what kind of knowledge we want to include, whether that be domain-specific knowledge or user-specific knowledge or general knowledge. It also introduces the notion of update summarisation. Update summarisation involves summarising source document D having already seen document / summary U. Document / summary U could be modelled by background knowledge K, which makes U a previous knowledge.
相关性的交叉熵应该较低,因为我们希望摘要与源文档相似并且相关,而信息性的交叉熵应该较高,因为我们正在测量用于生成摘要的背景知识的数量。 背景知识的引入使我们能够根据我们要包括的知识种类(无论是特定领域的知识还是特定于用户的知识还是常识)来定制模型。 它还介绍了更新汇总的概念。 更新摘要包括对已经看过文档/摘要U的源文档D进行摘要。文档/摘要U可以由背景知识K建模,这会使U成为先前的知识。
重要性 (Importance)
Importance is the metric that guides what information should be included in the summary. Given a user with knowledge K, the summary should be generated with the objective to bring the most new information to the user. Therefore, for each semantic unit, we need a function that takes in the probability of semantic unit in source document D () and background knowledge (), to determine its importance. The function has four requirements:
重要性是指导摘要中应包括哪些信息的指标。 给定具有知识K的用户,应生成摘要,其目的是为用户带来最新信息。 因此,对于每个语义单元,我们需要一个函数来接受源文档D中语义单元的概率(((i_i = \ mathbb {P} _D(w_i)\)))和背景知识(),以确定其重要性。 函数具有四个要求:
Informativeness. If two semantic units are equally important in the source document, we would prefer the one that are more informative, which it’s governed by background knowledge
信息丰富。 如果两个语义单元在源文档中同等重要,我们希望使用语义更丰富的单元,它由背景知识决定
Relevance. If two semantic units are equally informative, then we would prefer the semantic unit that’s more important in the source document
相关性。 如果两个语义单元具有同等的信息量,那么我们希望在源文档中更重要的语义单元
Additivity. This is a consistency constraint to allow for addition of information measures
可加性。 这是一个一致性约束,允许添加信息度量
Normalisation. To ensure that the function is a valid distribution
归一化。 确保该功能是有效的发行版
摘要计分功能(Summary scoring function)
encodes the relative importance of semantic units, the trade-off between relevance and informativeness. An example of what this distribution would capture is that if the semantic unit is important in source document but it’s not known in background knowledge, then is very high for that semantic unit as it is very desirable to be included in the summary as it increases the knowledge gap. This is illustrated in the figure below. The summary should be non-redundant and best approximate as follows:
编码语义单位的相对重要性,相关性和信息性之间的权衡。 这种分布将捕获的一个示例是,如果语义单元在源文档中很重要,但在背景知识中却未知,则\(\ mathbb {P} _ {(\ frac {D} {K})} \\)对于该语义单元来说,它是非常高的,因为非常希望包含在摘要中,因为它会增加知识差距。 如下图所示。 摘要应该是非冗余的,并且最好近似为\(\ mathbb {P} _ {(\ frac {D} {K})} \\),如下所示:
\)
可汇总性 (Summarisability)
We can use the to measure how many good summaries can be extracted from the distribution as follows:
我们可以使用来衡量可以从分布中提取多少个好的摘要,如下所示:
\)
If is high, then there are many similar good summaries that can be generated from the distribution. Conversely, if it’s low, there are only few good summaries. In terms of the summary scoring function, another way of expressing it is as follows:
如果高,则可以从分布中生成许多相似的良好摘要。 相反,如果它很低,则只有很少的摘要。 就汇总评分功能而言,另一种表达方式如下:
Maximising is equivalent of maximising the relevance and informativeness while minimising the redundancy, which it’s exactly what we want in a high quality summary. represents the strength of the Relevance component and represents the strength of the Informativeness component. This means that H(S), CE(S, D), and CE(S, K) are three independent factors that affects the Importance concept.
最大化等同于最大化相关性和信息性,同时最大限度地减少冗余,这正是我们在高质量摘要中想要的。 表示相关性组件的强度,表示信息性组件的强度。 这意味着H(S),CE(S,D)和CE(S,K)是影响重要性概念的三个独立因素。
潜在信息 (Potential information)
So far, we have connected summary S with source document D using relevance and summary S with background knowledge K using informativeness. However, we could also connect source document D with background knowledge K. We can extract a lot of new information from source document D if it strongly differs from K. The computation of this is the same as Informativeness except it is between source document D and background knowledge K. This new cross-entropy represents the maximum information gain that’s possible from source document D given background knowledge K.
到目前为止,我们已经使用相关性将摘要S与源文档D相关联,并将摘要S与背景知识K与相关信息K相关联。 但是,我们也可以将源文档D与背景知识K连接起来。如果源文档D与K有很大不同,我们可以从源文档D中提取很多新信息。此计算与信息性相同,区别在于它位于源文档D和源文档D之间。背景知识K。这种新的交叉熵表示在给定背景知识K的情况下,可以从源文档D获得最大的信息增益。
实验 (Experiments)
We used two evaluation datasets: TAC-2008 and TAC-2009. The datasets focus on two different summarisation tasks: normal and update summarisation for multi-document. Background knowledge K, , and are the parameters of our theoretical model for summarisation. We have set and the background knowledge K to either be frequency distribution over words in background documents or probability distribution over all words from source documents.
我们使用了两个评估数据集:TAC-2008和TAC-2009。 数据集专注于两个不同的摘要任务:多文档的常规摘要和更新摘要。 背景知识K,和是我们用于概括的理论模型的参数。 我们已经将和背景知识K设置为背景文档中单词的频率分布或源文档中所有单词的概率分布。
与人类判断的关联 (Correlation with human judgements)
We assess how well our quantities correlate with human judgements. Each quantity of our framework can be used to score sentences for summary and so we can evaluate how well they correlate with human judgement. The results are showcase below. Out of the three quantities, it seems that relevance has the highest correlation with human judgements. The inclusion of background knowledge works better with update summarisation as expected. Lastly, the gives the best performance in both types of summarisation. Individual quantity did not have strong performance on their own but once they are put together, it gives us a reliable strong summary scoring function.
我们评估我们的数量与人类判断之间的相关程度。 我们框架的每一个数量都可以用来为句子打分以进行总结,因此我们可以评估它们与人类判断的相关性。 结果在下面展示。 在这三个数量中,似乎相关性与人类判断的相关性最高。 如预期的那样,包含背景知识可以更好地进行更新摘要。 最后,在两种汇总中均具有最佳性能。 个体数量本身并没有很强的表现,但是一旦将它们放在一起,它便为我们提供了可靠的强大的汇总评分功能。
与参考摘要比较 (Comparison with reference summaries)
Ideally we would want our generated summaries (using ) to be similar to human reference summaries (). We scored both summaries using and found that human reference summaries scored significantly higher than our generated summaries, proving the reliability of our scoring function.
理想情况下,我们希望生成的摘要(使用)与人工参考摘要( )。 我们使用对两个摘要进行评分,发现人类参考摘要的评分明显高于生成的摘要,证明了评分功能的可靠性。
结论与未来工作 (Conclusion and Future Work)
Importance unifies the three common metrics of redundancy, relevance, and informativeness when it comes to summarisation and tells us which information to discard or include in the final summary. Background knowledge and semantic units choice are open parameters of the theoretical model, which means that they are open for experimentation / exploration. N-grams are good approximation of semantic units but what other granularity could we consider here?
在总结时,重要性统一了冗余,相关性和信息性这三个通用指标,并告诉我们哪些信息要舍弃或包含在最终摘要中。 背景知识和语义单元的选择是理论模型的开放参数,这意味着它们可供实验/探索。 N-gram是语义单元的良好近似,但是我们在这里还可以考虑其他什么粒度?
Potential future work for background knowledge could be to use the framework to learn knowledge from the data. Specifically, you can train a model to learn background knowledge such that the model has the highest correlation with human judgements. If you aggregate all the information over all the users and topics, you can find the generic background knowledge. If you aggregate all the users but in one particular topic, you can find topic-specific background knowledge and similar work can be done for a single user.
有关背景知识的潜在未来工作可能是使用该框架从数据中学习知识。 具体来说,您可以训练模型以学习背景知识,从而使该模型与人类判断具有最高的相关性。 如果汇总了所有用户和主题上的所有信息,则可以找到一般的背景知识。 如果汇总除一个特定主题之外的所有用户,则可以找到特定于主题的背景知识,并且可以为单个用户完成类似的工作。
资源: (Source:)
[1] Peyrard, M., 2018. A simple theoretical model of importance for summarization. arXiv preprint arXiv:1801.08991.
[1] Peyrard,M.,2018年。一个重要的简单理论模型。 arXiv预印本arXiv:1801.08991 。
Originally published at https://ryanong.co.uk on April 29, 2020.
最初于2020年4月29日在https://ryanong.co.uk上发布。
如何搭建一个nlp模型