当前位置:   article > 正文

NLP-文本摘要:数据集介绍及预处理【CNN/DM(偏抽取式)、NYT Annotated Corpus(偏抽取式)、Newsroom(抽取式+生成式)、XSum(抽取式/BBC)、XL-Sum】_xsum数据集

xsum数据集

在这里插入图片描述

一、CNN/DailyMail数据集

论文《Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond》第一次提出。
训练集中的源文档平均有766个单词,共29.74句,而摘要由53个单词和3.72句组成。【The source documents in the train- ing set have 766 words spanning 29.74 sentences on an average while the summaries consist of 53 words and 3.72 sentences】

CNN总数据量:92579(~10万)
Daily Mail总数据量:219506(~22万)

背景描述
CNN/Daily Mail(简称CNN/DM)作为单文本摘要语料库,每篇摘要包含多个摘要句。

数据集最初是从美国有限新闻网(CNN)和每日邮报网(Daily Mail)收集的约100万条新闻数据作为机器阅读理解语料库。

后来进行简单改动,形成用于单文本生成式摘要的语料库。

将每篇新闻的要点按原文中出现的顺序组成多句的摘要,每个要点看成是一个句子。

数据说明
用于单文本摘要的CNN/DM数据集规模:

训练集大小: 286817
验证集大小: 13368
测试集大小: 11487
训练集中平均摘要句子数: 3.72
数据来源
https://cs.nyu.edu/~kcho/DMQA/

问题描述
数据集适用于机器阅读理解、文本自动摘要等自然语言处理相关问题

二、New York Times Annotated Corpus数据集

是经纽约时报的文章预处理后构成,它包含了1987-2007年间数百万篇文章,约有超过65万篇工作人员撰写的摘要和150万篇人工标注的文章,并有人、组织、位置和主题等内容的归一化索引表。

可用于自动文摘、文本分类、内容提取等任务。

对自动文摘任务来说,由于摘要的风格偏向于抽取式策略的结果,因此其更适合作为抽取式自动文摘的数据集。

1、文件夹结构

New York Times 文件夹里一共有9个文件,其中有4个文件夹,5个文件,内容如下

2、New York Time 语料库的描述:

  • 1.8 million的文章
  • 超过650k手动编写的文章摘要
  • 超过1.5 million 的人工标记的文章,标记包括 人物,地点,组织,标题,主题
  • 超过275k使用算法生成标记的文章
  • 用于解析xml文件的java工具

3、数据类型

采用xml文档形式编写,依据NITF标准。

4、数据应用

语料库中有650k个手动编写的文章摘要,这个可以用于文档摘要生成算法的评估,

有1.5million个标记好的文档,可以用于:文档路由算法 && 文档分类算法 && 实体识别算法 && 跨文档共同引用解决方法 && 信息检索 等领域的发展和评估

5、xml文件结构

一共有20个文件夹,代表从1987-2007的20年,每一个文件夹里面有12个子文件夹,代表从Jan到Dec的12个月,每个子文件夹里又包含31个文件,代表31天,每一个子文件夹里又包含了大概一百多个文章(也就是New York Time每天发布100多个新闻),我随机选取 1993/12/31里面的某个文章作为说明。

文章分为头部和内容部。头部是元数据,记录了包括文档的发布日期,分类,id号等信息;这里着重讲解内容部分 body :

  • body.head
    • headline: 也就是文章的标题
  • body.content
    • lead_paragraph: 描述了开头的几段
    • full_text:全文内容,内容都是以p标签包围的

每个xml都存储了元数据和数据。我们着重关心的是body.content里面的full_text的内容,就是新闻的内容。我们对这个内容进行检索用

三、Newsroom

Newsroom数据集是可用于训练和评价自动文摘系统的大型数据集。
它收录了38个主要新闻出版社人工撰写的130万篇文章和摘要。
这些数据是从1998-2017年间的搜索和社交媒体中获取得到,并使用了多种抽取式和生成式结合的策略进行摘要预处理,这使得Newsroom数据集可以作为2种摘要产生方法的数据集。

CORNELL NEWSROOM contains three large files for training, development, and released test sets. Each of these files uses the compressed JSON line format. Each line is an object representing a single article-summary pair. An example summary object:

{
            "text": "...",
         "summary": "...",
           "title": "...",
         "archive": "http://...",
            "date": 20160302060024,
         "density": 1.25,
        "coverage": 0.75,
     "compression": 12.5,
 "compression_bin": "medium",
    "coverage_bin": "low",
     "density_bin": "abstractive"
}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
{
	"url": "http:\/\/www.foxsports.com\/baseball\/xchange\/teamnote\/xch_sdg.sml",
	"archive": "http:\/\/web.archive.org\/web\/19980117162148id_\/http:\/\/www.foxsports.com:80\/baseball\/xchange\/teamnote\/xch_sdg.sml",
	"title": "Pro Sports Xchange notes",
	"date": "19980117162148",
	"text": "So sayeth Padre general manager Kevin Towers.\n\nLess than two weeks after unofficially setting his starting rotation by trading for No. 1 Kevin Brown and re-signing No. 5 Pete Smith, Towers stirred the pot Jan. 7 by signing Mark Langston to a minor league contract.\n\nBut the Padres have no intention of having Langston pitch at Triple-A Las Vegas. And neither does the 37-year-old left-hander.\n\n\"We didn't get him to pitch at Las Vegas,\" said GM Kevin Towers emphatically. \"If Mark is back to full health, he's going to go for one of the spots in the starting rotation.\"\n\nThere's the rub. Langston had elbow and knee problems last year and made only one start (Aug. 20) after undergoing arthroscopic surgery in May to have bone spurs removed from his left elbow.\n\nIt was his second round of elbow surgery in four seasons.\n\nHowever, the Padres hit the jackpot once before with a over-the-hill left-hander. Fernando Valenzuela was 13-8 with a 3.62 ERA in '96 and helped pitch the Padres to the N.L. West title.\n\nLangston thinks it will. And Towers said it's not that big a gamble.\n\n\"Right now, we're very happy with the rotation we have,\" said Towers of the Brown, Andy Ashby, Joey Hamilton, Sterling Hitchcock and Smith quintet.\n\n\"But you never know about pitching. Something can happen and usually does. I like insurance.\"\n\nLangston has a career 174-150 record with a 3.88 ERA. He is a four-time All-Star and was Anaheim's opening day starter last April.\n\nAnd a year after his 1994 surgery, Langston bounced back with a 15-7 record with a 4.63 ERA.\n\nLangston said his arm is sound and said he shouldn't have tried to return last year.\n\n\"I simply tried to come back too soon last year,\" he said. \"The Angels were in a pennant race and short on starting pitchers. So I tried to help out and it was a mistake. I was advised by the doctors not to do it.\n\n\"But the arm feels good. I've been playing long toss for a month with no problems. I see myself winning one of those starting spots in San Diego. And I believe we have the makings of being a contender. If I'm healthy, I will enhance this team.\"\n\nThe Angels did not offer Langston a contract after last season (2-4, 5.65 ERA). Langston said he had better offers elsewhere, but chose to remain as close as possible to his Anaheim Hills home in Southern California.\n\n\"I see no downside to signing Mark,\" said Padre manager Bruce Bochy. \"It's definitely worth taking a look for us. He certainly knows how to pitch.\"\n\nTowers said the addition of Brown and Langston might also help Ashby and Hamilton mature into better pitchers.\n\n\"We've now got three guys here (Brown, Langston and pitching coach Dave Stewart) who have pitched at the highest levels of competition. I think it will do a world of good for our other pitchers to watch how these guys go about their business.\"\n\nLangston has played only one season in the N.L., going 12-9 with a 2.38 in Montreal over two-thirds of the '89 season.\n\nNOTES, QUOTES, ANECDOTES The Padres signed 2B\/leadoff man Quilvio Veras to a two-year contract worth $3.1 million. Veras, who turns 27 April 3, will make $1.1 million this season and $2 in 1998. The switch-hitter struggled early last year while sharing the leadoff job with Rickey Henderson, but batted .293 over the last 100 games. But he needs to reduce his 84 strikeouts, draw more than 72 walks and hit better than .194 from the right side.\n\nIt's official. Native San Diegan 1B Eddie Williams is back for a third fling as a Padre, although Williams will be playing in Las Vegas as an insurance policy against Wally Joyner being hurt. Williams, who signed a minor league contract, hit .240 with three homers and 12 RBI in 38 games with the Dodgers and Pirates last season.\n\nPut Greg Vaughn ahead of Ruben Rivera in the Padre left field derby. Rivera has been hovering around .200 in the Dominican Republic Winter League after missing much of the campaign with a broken finger. Over the last year, Rivera has had less than 100 at bats. \"I don't care what he hits, but Ruben needed more swings,\" said Towers.\n\nROSTER REPORT FREE AGENCY UPDATE -- Signed catcher Greg Myers (Braves), catcher Carlos Hernandez (re-signed), signed infielder Craig Shipley (allowed to become free-agent), right-handed pitcher Pete Smith (re-signed), left-handed pitcher Mark Langston (Angels, signed to minor-league contract).\n\nMEDICAL WATCH -- Left fielder Greg Vaughn (recovering from right knee surgery), third baseman Ken Caminiti (recovering from right knee surgery), first baseman Wally Joyner (recovering from right knee surgery), center fielder Steve Finley (recovering from big toe surgery), right fielder Tony Gwynn (recovering from left knee surgery), outfielder Ruben Rivera (recovering from a broken left finger).",
	"summary": "SAN DIEGO PADRES team notebook",
	"compression": 209.0,
	"coverage": 0.8,
	"density": 1.2,
	"compression_bin": "high",
	"coverage_bin": "medium",
	"density_bin": "abstractive"
}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
{
	"url": "http:\/\/www.nydailynews.com\/archives\/news\/1995\/10\/17\/1995-10-17_new_yorkers__only_regret_was.html",
	"archive": "http:\/\/web.archive.org\/web\/20110210093603id_\/http:\/\/www.nydailynews.com:80\/archives\/news\/1995\/10\/17\/1995-10-17_new_yorkers__only_regret_was.html",
	"title": "NEW YORKERS' ONLY REGRET WAS STAYING HOME",
	"date": "20110210093603",
	"text": "This story was reported by: NICK CHARLES, AUSTIN EVANS FENNER AND SAMSON MULUGETA It was written by: KAREN HUNTER\n\nTuesday, October 17th 1995, 4:20AM\n\nAs many black men marched on Washington yesterday, some New Yorkers spoke of their pride in the event and their disappointment in not being there, too.\n\n\"I felt like the only black person working,\" said Roderick Vinson, 38, of Harlem. \"That feeling made me sick to my stomach. I couldn't believe I missed one of the important events of my life.\"\n\nWinston Ford, 50, had to work, too. He makes his living selling incense and body oils in Brooklyn.\n\n\"I didn't have the finances to make the trip,\" he said. \"But my heart and soul is with them in Washington.\"\n\nFor HIV-positive Sheldon Julius of Harlem, the Million Man March was a wakeup call. Long an absentee father, he called his 15-year-old son Sunday night and for the first time ever told him that he loved him. \"The calling of the march made me realize my responsibility,\" he said.\n\nBut some other black New Yorkers said they had no use for march organizer Louis Farrakhan and made no apologies for missing the rally.\n\n\"Farrakhan's wrong,\" said Allen Washington, 61, a retired Triborough Bridge and Tunnel Authority worker. \"Whites and blacks need each other. If we worked together, we'd be a great nation.\"\n\n\"Louis Farrakhan shouldn't be at the march because of the remarks he has made about Jews and whites,\" agreed Brooklyn construction worker Cyril Peter, 35. \"As long as he's there, there will be a negative effect.\"\n\nCharles Williams, 45, an East Elmhurst, Queens, graphics worker, decided that the event was hollow. \"There is no agenda,\" he said. \"It isn't about jobs or housing, it's just about a paper platform.\"\n\nStill, drug counselor Jeanette Morgan was bursting with hope and pride as she sipped coffee in a Queens diner and thought about her brothers, sons and grandsons marching in Washington.\n\n\"I am so emotional about this day, I can barely talk about it,\" Morgan said. \"When the men return, I hope they go to their brothers at a street corner and offer to help.\"\n\nOn one Harlem corner yesterday, college student Mike Carr stood shaking his head as he watched a man idly nurse a 40-ounce beer in a bag.\n\n\"It's a shame,\" said Carr. \"These are the brothers who could have used the march the most.\"",
	"summary": "As many black men marched on Washington yesterday, some New Yorkers spoke of their pride in the event and their disappointment in not being there, too. \"I felt like the only black person working,\"said Roderick Vinson, 38, of Harlem. \"That feeling made me sick to my stomach. I couldn't believe I missed one of the important events of my life.\"Winston Ford, 50, had to work, too. He makes his living selling",
	"compression": 6.1529411765,
	"coverage": 0.9764705882,
	"density": 24.6,
	"compression_bin": "low",
	"coverage_bin": "high",
	"density_bin": "extractive"
}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14

The date is an integer using the Internet Archive date format: YYYYMMDDHHMMSS. Density and coverage scores are provided for convenience, computed using the summary analysis tool also provided. Data subset and subsets by density, coverage, and compression are also provided. For example, in Python, each data file can be read as follows:

import json, gz

path = "train.jsonl.gz"
data = []

with gz.open(path) as f:
    for ln in f:
        obj = json.loads(ln)
        data.append(obj)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

4、XSum(Extreme Summarization Dataset)

Our extreme summarization dataset (which we call XSum) consists of BBC articles and accompanying single sentence summaries. Specifically, each article is prefaced with an introduc- tory sentence (aka summary) which is professionally written, typically by the author of the article.

5、XL-Sum

论文地址: http://arxiv.org/pdf/2106.13822v1.pdf

来源: Bangladesh University of Engineering and Technology (BUET)

论文名称:XL-Sum Large-Scale Multilingual Abstractive Summarization for 44 Languages

原文作者:Tahmid Hasan

当代关于抽象文本摘要的研究主要集中在像英语这样的高资源语言上,这主要是因为低/中资源语言的数据集可用性有限。

在这项工作中,我们提出了XL-Sum,一个全面和多样化的数据集,包括来自BBC的100万专业注释的文章-摘要对,使用一套精心设计的启发式提取。

该数据集涵盖了从低资源到高资源的44种语言,其中许多语言目前没有公共数据集可用。

XL-Sum具有高度的抽象性、简练性和高质量。

我们使用XL-Sum对目前最先进的预训练多语言模型mT5进行了微调,并对多语言和低资源的摘要任务进行了实验。

与使用类似的单语言数据集获得的结果相比,XL-Sum得出了具有竞争力的结果:在我们基准测试的10种语言上,我们显示出高于11分的ROUGE-2分数,其中一些超过了多语言训练获得的15分。

此外,对低资源语言的个别锻炼也提供了有竞争力的表现。据我们所知,XL-Sum是最大的抽象摘要数据集,从单个数据源收集的样本数量和涵盖的语言数量来看。

在这里插入图片描述




参考资料:
Paperswithcode:文本摘要所有Datasets列表
自动文本摘要研究综述
New York Times Corpus 介绍 (未完待续)
PaperWithCode:New York Times Annotated Corpus介绍
The New York Times Annotated Corpus官网
The New York Times Annotated Corpus官网02
The New York Times Annotated Corpus 下载地址
Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies
Newsroom下载地址(WHX可用)
Extreme Summarization (XSum) Dataset
【数据集】开源 | XL-Sum,一个全面和多样化的数据集,包括来自BBC的100万专业注释的文章-摘要对,涵盖44种语言
XSum数据集:Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/IT小白/article/detail/373505
推荐阅读
  

闽ICP备14008679号