This repository contains resources for Natural Language Processing (NLP) with a focus on the task of Text Classification. The content is mainly from paper 《A Survey on Text Classification: From Shallow to Deep Learning》 (该repository主要总结自然语言处理(NLP)中文本分类任务的资料。内容主要来自文本分类综述论文《A Survey on Text Classification: From Shallow to Deep Learning》)
Deep Learning Models
Shallow Learning Models
Evaluation Metrics
Future Research Challenges
Tools and Repos
Multi-task deep neural networks for natural language understanding --- MT-DNN--- by Xiaodong Liu, Pengcheng He, Weizhu Chen, Jianfeng Gao (Github)凭借对双向上下文进行建模的能力,与基于自回归语言模型的预训练方法(GPT)相比,基于像BERT这种去噪自编码的预训练方法能够达到更好的性能。然而,由于依赖于使用掩码(masks)去改变输入,BERT忽略了屏蔽位置之间的依赖性并且受到预训练与微调之间差异的影响。结合这些优缺点,本文提出了XLNet,这是一种通用的自回归预训练方法,其具有以下优势:(1)通过最大化因式分解次序的概率期望来学习双向上下文,(2)由于其自回归公式,克服了BERT的局限性。此外,XLNet将最先进的自回归模型Transformer-XL的创意整合到预训练中。根据经验性测试,XLNet在20个任务上的表现优于BERT,并且往往有大幅度提升,并在18个任务中实现最先进的结果,包括问答,自然语言推理,情感分析和文档排序。
BERT: pre-training of deep bidirectional transformers for language understanding --- BERT--- by Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova (Github)Graph convolutional networks for text classification --- TextGCN--- by Liang Yao, Chengsheng Mao, Yuan Luo (Github)本文提出了一种多任务深度神经网络 (MT-DNN) ,用于跨多种自然语言理解任务(NLU)的学习表示。MT-DNN 不仅利用大量跨任务数据,而且得益于一种正则化效果,这种效果可以帮助产生更通用的表示,从而有助于扩展到新的任务和领域。MT-DNN 扩展引入了一个预先训练的双向转换语言模型BERT。MT-DNN在十个自然语言处理任务上取得了SOTA的成果,包括SNLI、SciTail和九个GLUE任务中的八个,将GLUE的baseline提高到了82.7 % (2.2 %的绝对改进)。在SNLI和Sc-iTail数据集上的实验证明,与预先训练的BERT表示相比,MT-DNN学习到的表示可以在域内标签数据较少的情况下展现更好的领域适应性。代码和预先训练好的模型将进行开源。
SA is the process of analyzing and reasoning the subjective text withinemotional color. It is crucial to get information on whether it supports a particular point of view fromthe text that is distinct from the traditional text classification that analyzes the objective content ofthe text. SA can be binary or multi-class. Binary SA is to divide the text into two categories, includingpositive and negative. Multi-class SA classifies text to multi-level or fine-grained labels.
情感分析(Sentiment Analysis,SA)是在情感色彩中对主观文本进行分析和推理的过程。通过分析文本来判断作者是否支持特定观点的信息至关重要,这与分析文本客观内容的传统文本分类任务不同。SA可以是二分类也可以是多分类。Binary SA将文本分为两类,包括肯定和否定。多类SA将文本分类为多级或细粒度更高的不同标签。
Movie Review (MR) 电影评论数据集Stanford Sentiment Treebank (SST) 斯坦福情感库The Multi-Perspective Question Answering (MPQA)多视角问答数据集IMDB reviews IMDB评论Yelp reviews Yelp评论Amazon Reviews (AM) 亚马逊评论数据集News content is one of the most crucial information sources which hasa critical influence on people. The NC system facilitates users to get vital knowledge in real-time.News classification applications mainly encompass: recognizing news topics and recommendingrelated news according to user interest. The news classification datasets include 20NG, AG, R8, R52,Sogou, and so on. Here we detail several of the primary datasets.
20 Newsgroups (20NG)AG News (AG)R8 and R52Sogou News (Sogou) 搜狗新闻The topic analysis attempts to get the meaning of the text by defining thesophisticated text theme. The topic labeling is one of the essential components of the topic analysistechnique, intending to assign one or more subjects for each document to simplify the topic analysis.
DBpediaOhsumedYahoo answers (YahooA) 雅虎问答The QA task can be divided into two types: the extractive QA and thegenerative QA. The extractive QA gives multiple candidate answers for each question to choosewhich one is the right answer. Thus, the text classification models can be used for the extractiveQA task. The QA discussed in this paper is all extractive QA. The QA system can apply the textclassification model to recognize the correct answer and set others as candidates. The questionanswering datasets include SQuAD, MS MARCO, TREC-QA, WikiQA, and Quora [209]. Here wedetail several of the primary datasets.
问答任务可以分为两种:抽取式问答(extractiveQA)和生成式问答(extractiveQA)。抽取式问答为每个问题提供了多个候选答案,以选择哪个是正确答案。因此,文本分类模型可以用于抽取式问答任务。QA系统可以使用文本分类模型来识别正确答案,并将其他答案设置为候选答案。问答数据集包括SQuAD,MS MARCO,TREC-QA,WikiQA和Quora [209]。这里我们详细介绍了几个主要数据集。
Stanford Question Answering Dataset (SQuAD) 斯坦福问答数据集MS MARCOTREC-QAWikiQANLI is used to predict whether the meaning of one text canbe deduced from another. Paraphrasing is a generalized form of NLI. It uses the task of measuringthe semantic similarity of sentence pairs to decide whether one sentence is the interpretation ofanother. The NLI datasets include SNLI, MNLI, SICK, STS, RTE, SciTail, MSRP, etc. Here we detailseveral of the primary datasets.
The Stanford Natural Language Inference (SNLI)Multi-Genre Natural Language Inference (MNLI)Sentences Involving Compositional Knowledge (SICK)Microsoft Research Paraphrase (MSRP)A dialog act describes an utterance in a dialog based on semantic,pragmatic, and syntactic criteria. DAC labels a piece of a dialog according to its category of meaningand helps learn the speaker’s intentions. It is to give a label according to dialog. Here we detailseveral of the primary datasets, including DSTC 4, MRDA, and SwDA.
对话行为基于语义,语用和句法标准来描述对话中的话语。DAC根据其含义类别标记一个对话框,并帮助理解讲话者的意图。它是根据对话框给标签。在这里,我们详细介绍了所有主要数据集,包括DSTC 4,MRDA和SwDA。
Dialog State Tracking Challenge 4 (DSTC 4)ICSI Meeting Recorder Dialog Act (MRDA)Switchboard Dialog Act (SwDA)In multi-label classification, an instance has multiple labels, and each la-bel can only take one of the multiple classes. There are many datasets based on multi-label textclassification. It includes Reuters, Education, Patent, RCV1, RCV1-2K, AmazonCat-13K, BlurbGen-reCollection, WOS-11967, AAPD, etc. Here we detail several of the main datasets.
Reuters newsPatent DatasetReuters Corpus Volume I (RCV1) and RCV1-2KWeb of Science (WOS-11967)Arxiv Academic Paper Dataset (AAPD)There are some datasets for other applications, such as Geonames toponyms, Twitter posts,and so on.
还有一些用于其他应用程序的数据集,比如Geonames toponyms、Twitter帖子等等。
In terms of evaluating text classification models, accuracy and F1 score are the most used to assessthe text classification methods. Later, with the increasing difficulty of classification tasks or theexistence of some particular tasks, the evaluation metrics are improved. For example, evaluationmetrics such as P@K and Micro-F1 are used to evaluate multi-label text classification performance,and MRR is usually used to estimate the performance of QA tasks.
在评估文本分类模型方面,准确率和F1分数是评估文本分类方法最常用的指标。随着分类任务难度的增加或某些特定任务的存在,评估指标也得到了改进。例如P @ K和Micro-F1评估指标用于评估多标签文本分类性能,而MRR通常用于评估QA任务的性能。
Single-label text classification divides the text into one of the most likelycategories applied in NLP tasks such as QA, SA, and dialogue systems [9]. For single-label textclassification, one text belongs to just one catalog, making it possible not to consider the relationsamong labels. Here we introduce some evaluation metrics used for single-label text classificationtasks.
Accuracy and Error RatePrecision, Recall and F1Accuracy and Error Rate are the fundamental metrics for a text classification model. The Accuracy and Error Rate are respectively defined as
Exact Match (EM)Mean Reciprocal Rank (MRR)Hamming-loss (HL)These are vital metrics utilized for unbalanced test sets regardless ofthe standard type and error rate. For example, most of the test samples have a class label. F1 is theharmonic average of Precision and Recall. Accuracy, Recall, and F1 as defined
The desired results will be obtained when the accuracy, F1 and recall value reach 1. On the contrary,when the values become 0, the worst result is obtained. For the multi-class classification problem,the precision and recall value of each class can be calculated separately, and then the performanceof the individual and whole can be analyzed.
Compared with single-label text classification, multi-label text classifica-tion divides the text into multiple category labels, and the number of category labels is variable. These metrics are designed for single label text classification, which are not suitable for multi-label tasks. Thus, there are some metrics designed for multi-label text classification.
Micro−F1Macro−F1The Micro−F1 is a measure that considers the overall accuracy and recall of alllabels. The Micro−F1is defined as
The Macro−F1 calculates the average F1 of all labels. Unlike Micro−F1, which setseven weight to every example, Macro−F1 sets the same weight to all labels in the average process. Formally, Macro−F1is defined as
In addition to the above evaluation metrics, there are some rank-based evaluation metrics forextreme multi-label classification tasks, including P@K and NDCG@K.
除了上述评估指标外,还有一些针对极端多标签分类任务的基于排序的评估指标,包括P @ K和NDCG @ K。
Precision at Top K (P@K)Normalized Discounted Cummulated Gains (NDCG@K)The P@K is the precision at the top k. ForP@K, each text has a set of L ground truth labels Lt={l0,l1,l2...,lL−1}, in order of decreasing probability Pt=p0,p1,p2...,pQ−1.The precision at k is
其中P@K为排名第k处的准确率。P@K,每个文本有一组L个全局真标签Lt={l0,l1,l2...,lL−1}, 为了减少概率Pt=p0,p1,p2...,pQ−1。第k处的准确率为
The NDCG at k is
Zero-shot/Few-shot learning外部知识 多标签文本分类任务 具有许多术语词汇的特殊领域现有的浅层和深度学习模型的大部分结构都被尝试用于文本分类,包括集成方法。BERT学习了一种语言表示法,可以用来对许多NLP任务进行微调。主要的方法是增加数据,提高计算能力和设计训练程序,以获得更好的结果如何在数据和计算资源和预测性能之间权衡是值得研究的。
模型的语义鲁棒性 模型的可解释性Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。