当前位置:   article > 正文

关于Ontonotes5.0数据集下载过程(个人向)_ontonotes 5.0下载

ontonotes 5.0下载

前言

说实话,这个数据集下载真的是很折腾了很久,这篇文章仅仅是介绍获取OntoNotes Release 5.0数据集的全过程,以及对于指代消解中英文数据的预处理。

获取数据

1.首先要在官网上注册账号。除了邮箱和组织需要注意下,其他无所谓了。
在这里插入图片描述

  1. 注册完毕之后登陆,会提示:You currently have a guest account with LDC. If you’d like to be able to access more data please consider a membership for your organization.
    此时,理论上是要等待组织管理员通过你的申请,才能够有权限下载免费的数据。不然也只能干瞪眼。
    但是!实际情况是等了很久都没人理你,然后不了了之。这时候,你需要用你注册的邮箱发送一份邮件给官方,大意是描述等了一段时间没人响应你的申请,可能是由于管理员毕业、离校之类的种种原因,希望得到帮助。
    然后过一两天(时差问题)你就会收到官方的回复,告诉你组织管理员的邮箱和电话(如果你注册数据填的不够完备,她会先告诉你哪些内容需要补充)。

    我们学校的管理员是个大佬,我只敢发邮件去叨扰他,索性的是很快就得到了回复,通过了我的申请。
    在这里插入图片描述

  2. 再重新登录,你就会发现你已经是个有组织的人~~在downloads中可以看到可以下载的内容。

在这里插入图片描述

在这里插入图片描述

  1. 一般情况下,下载中是没有你想要的数据集的,你需要搜索到对应的数据集然后下单,过几天之后才能在你的下载页面看到对应内容的下载。
    在这里插入图片描述在这里插入图片描述

数据处理

以下方法适用于获取指代消解的数据集

  1. 目前使用这个项目的处理脚本预处理数据的。值得注意的是代码是__python2__写的,而且需要的依赖还挺多,不过改成python3也很容易,报错的地方修改下即可(因为一般都是抛出异常和print语法不对而已)
    在这里插入图片描述

  2. 执行完毕之后会在同目录生成三个json文件数据:训练集-验证集-测试集。
    其中doc_key为文本分类,sentences为切分完毕的对话,speakers为句子发言者、clusters即指代文本短语的聚类结果(共指)
    英文:train.english.jsonlines、dev.english.jsonlines、test.english.jsonlines
    中文:train.chinese.jsonlines、dev.chinese.jsonlines、test.chinese.jsonlines
    示例数据如下:

{
   "doc_key": "bc/cctv/00/cctv_0001_0", "sentences": [["What", "kind", "of", "memory", "?"], ["We", "respectfully", "invite", "you", "to", "watch", "a", "special", "edition", "of", "Across", "China", "."], ["WW", "II", "Landmarks", "on", "the", "Great", "Earth", "of", "China", ":", "Eternal", "Memories", "of", "Taihang", "Mountain"], ["Standing", "tall", "on", "Taihang", "Mountain", "is", "the", "Monument", "to", "the", "Hundred", "Regiments", "Offensive", "."], ["It", "is", "composed", "of", "a", "primary", "stele", ",", "secondary", "steles", ",", "a", "huge", "round", "sculpture", "and", "beacon", "tower", ",", "and", "the", "Great", "Wall", ",", "among", "other", "things", "."], ["A", "primary", "stele", ",", "three", "secondary", "steles", ",", "and", "two", "inscribed", "steles", "."], ["The", "Hundred", "Regiments", "Offensive", "was", "the", "campaign", "of", "the", "largest", "scale", "launched", "by", "the", "Eighth", "Route", "Army", "during", "the", "War", "of", "Resistance", "against", "Japan", "."], ["This", "campaign", "broke", "through", "the", "Japanese", "army", "'s", "blockade", "to", "reach", "base", "areas", "behind", "enemy", "lines", ",", "stirring", "up", "anti-Japanese", "spirit", "throughout", "the", "nation", "and", "influencing", "the", "situation", "of", "the", "anti-fascist", "war", "of", "the", "people", "worldwide", "."], ["This", "is", "Zhuanbi", "Village", ",", "Wuxiang", "County", "of", "Shanxi", "Province", ",", "where", "the", "Eighth", "Route", "Army", "was", "headquartered", "back", "then", "."], ["On", "a", "wall", "outside", "the", "headquarters", "we", "found", "a", "map", "."], ["This", "map", "was", "the", "Eighth", "Route", "Army", "'s", "depiction", "of", "the", "Mediterranean", "Sea", "situation", "at", "that", "time", "."], ["This", "map", "reflected", "the
  • 1
本文内容由网友自发贡献,转载请注明出处:【wpsshop博客】
推荐阅读
相关标签
  

闽ICP备14008679号