当前位置:   article > 正文

做论文常用中文摘要数据集_cnewsum

cnewsum

(1)短文本

1)哈工大LCSTS

加载方式:

  1. import pandas as pd
  2. from datasets import load_dataset, Dataset
  3. lcsts_part_1 = pd.read_table(r'D:\softwares\zwj\nlp\project\long-document\datasets\PART_III.txt', header=None,
  4. warn_bad_lines=True, error_bad_lines=False, sep='<[/d|/s|do|su|sh][^a].*>',
  5. encoding='utf-8')
  6. lcsts_part_1 = lcsts_part_1[0].dropna()
  7. lcsts_part_1 = lcsts_part_1.reset_index(drop=True)
  8. lcsts_part_1 = pd.concat([lcsts_part_1[1::2].reset_index(drop=True), lcsts_part_1[::2].reset_index(drop=True)], axis=1)
  9. lcsts_part_1.columns = ['document', 'summary']
  10. lcsts_part_2 = pd.read_table(r'D:\softwares\zwj\nlp\project\long-document\datasets\PART_III.txt', header=None,
  11. warn_bad_lines=True, error_bad_lines=False, sep='<[/d|/s|do|su|sh][^a].*>',
  12. encoding='utf-8')
  13. lcsts_part_2 = lcsts_part_2[0].dropna()
  14. lcsts_part_2 = lcsts_part_2.reset_index(drop=True)
  15. x = lcsts_part_2[1::2].reset_index(drop=True)
  16. xx = lcsts_part_2[::2].reset_index(drop=True)
  17. lcsts_part_2 = pd.concat([lcsts_part_2[1::2].reset_index(drop=True), lcsts_part_2[::2].reset_index(drop=True)], axis=1)
  18. lcsts_part_2.columns = ['document', 'summary']
  19. dataset_train = Dataset.from_dict(lcsts_part_1).shuffle(seed=42)
  20. dataset_valid = Dataset.from_dict(lcsts_part_2).shuffle(seed=42)

(2)中等长度

1)NLPCC2017的单文档新闻测试集合TTNews

2)NLPCC2021的字节跳动CNewSum

转换脚本:

  1. # coding=utf-8
  2. import json
  3. from datasets import load_dataset
  4. import jsonlines
  5. data_type = 'jsonl'
  6. data_field = 'data'
  7. json_data_path = r'./test.simple.anno.label.jsonl'
  8. article = ''
  9. summary = ''
  10. data = []
  11. dict = {}
  12. index=0
  13. with open("./CNewSum_test_original.json","w",encoding='UTF-8') as f:
  14. with jsonlines.open(json_data_path) as reader:
  15. for idx,obj in enumerate(reader):
  16. tmp = 10
  17. for _ in obj['article']:
  18. article+=str(_)
  19. for _ in obj['summary']:
  20. summary+=str(_)
  21. dict['content'] = article
  22. dict['title'] = summary
  23. data.append(dict)
  24. article=''
  25. dict={}
  26. summary=''
  27. d = json.dumps(data, indent=4, sort_keys=False, ensure_ascii=False)
  28. f.write(d)
  29. dataset_train = load_dataset('json', data_files=[r'./CNewSum_test_original.json'])

(3)长文本

1)NLPCC2020的CLTS,但该数据集并不好很差,大量摘要为正文摘抄抽取。

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/Gausst松鼠会/article/detail/479991
推荐阅读
相关标签
  

闽ICP备14008679号