关于Ontonotes5.0数据集下载过程（个人向）_ontonotes 5.0下载

作者：小舞很执着 | 2024-08-21 14:12:43

踩

ontonotes 5.0下载

文章目录

前言

说实话，这个数据集下载真的是很折腾了很久，这篇文章仅仅是介绍获取OntoNotes Release 5.0数据集的全过程，以及对于指代消解中英文数据的预处理。

获取数据

1.首先要在官网上注册账号。除了邮箱和组织需要注意下，其他无所谓了。
在这里插入图片描述

注册完毕之后登陆，会提示：You currently have a guest account with LDC. If you’d like to be able to access more data please consider a membership for your organization.
此时，理论上是要等待组织管理员通过你的申请，才能够有权限下载免费的数据。不然也只能干瞪眼。
但是！实际情况是等了很久都没人理你，然后不了了之。这时候，你需要用你注册的邮箱发送一份邮件给官方，大意是描述等了一段时间没人响应你的申请，可能是由于管理员毕业、离校之类的种种原因，希望得到帮助。
然后过一两天（时差问题）你就会收到官方的回复，告诉你组织管理员的邮箱和电话（如果你注册数据填的不够完备，她会先告诉你哪些内容需要补充）。
我们学校的管理员是个大佬，我只敢发邮件去叨扰他，索性的是很快就得到了回复，通过了我的申请。
再重新登录，你就会发现你已经是个有组织的人~~在downloads中可以看到可以下载的内容。

在这里插入图片描述

一般情况下，下载中是没有你想要的数据集的，你需要搜索到对应的数据集然后下单，过几天之后才能在你的下载页面看到对应内容的下载。

数据处理

以下方法适用于获取指代消解的数据集

目前使用这个项目的处理脚本去预处理数据的。值得注意的是代码是__python2__写的，而且需要的依赖还挺多，不过改成python3也很容易，报错的地方修改下即可（因为一般都是抛出异常和print语法不对而已）
执行完毕之后会在同目录生成三个json文件数据：训练集-验证集-测试集。
其中doc_key为文本分类，sentences为切分完毕的对话，speakers为句子发言者、clusters即指代文本短语的聚类结果(共指)
英文：train.english.jsonlines、dev.english.jsonlines、test.english.jsonlines
中文：train.chinese.jsonlines、dev.chinese.jsonlines、test.chinese.jsonlines
示例数据如下：

{
   "doc_key": "bc/cctv/00/cctv_0001_0", "sentences": [["What", "kind", "of", "memory", "?"], ["We", "respectfully", "invite", "you", "to", "watch", "a", "special", "edition", "of", "Across", "China", "."], ["WW", "II", "Landmarks", "on", "the", "Great", "Earth", "of", "China", ":", "Eternal", "Memories", "of", "Taihang", "Mountain"], ["Standing", "tall", "on", "Taihang", "Mountain", "is", "the", "Monument", "to", "the", "Hundred", "Regiments", "Offensive", "."], ["It", "is", "composed", "of", "a", "primary", "stele", ",", "secondary", "steles", ",", "a", "huge", "round", "sculpture", "and", "beacon", "tower", ",", "and", "the", "Great", "Wall", ",", "among", "other", "things", "."], ["A", "primary", "stele", ",", "three", "secondary", "steles", ",", "and", "two", "inscribed", "steles", "."], ["The", "Hundred", "Regiments", "Offensive", "was", "the", "campaign", "of", "the", "largest", "scale", "launched", "by", "the", "Eighth", "Route", "Army", "during", "the", "War", "of", "Resistance", "against", "Japan", "."], ["This", "campaign", "broke", "through", "the", "Japanese", "army", "'s", "blockade", "to", "reach", "base", "areas", "behind", "enemy", "lines", ",", "stirring", "up", "anti-Japanese", "spirit", "throughout", "the", "nation", "and", "influencing", "the", "situation", "of", "the", "anti-fascist", "war", "of", "the", "people", "worldwide", "."], ["This", "is", "Zhuanbi", "Village", ",", "Wuxiang", "County", "of", "Shanxi", "Province", ",", "where", "the", "Eighth", "Route", "Army", "was", "headquartered", "back", "then", "."], ["On", "a", "wall", "outside", "the", "headquarters", "we", "found", "a", "map", "."], ["This", "map", "was", "the", "Eighth", "Route", "Army", "'s", "depiction", "of", "the", "Mediterranean", "Sea", "situation", "at", "that", "time", "."], ["This", "map", "reflected", "the1

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/小舞很执着/article/detail/1012139