赞
踩
1)哈工大LCSTS
加载方式:
- import pandas as pd
- from datasets import load_dataset, Dataset
-
- lcsts_part_1 = pd.read_table(r'D:\softwares\zwj\nlp\project\long-document\datasets\PART_III.txt', header=None,
- warn_bad_lines=True, error_bad_lines=False, sep='<[/d|/s|do|su|sh][^a].*>',
- encoding='utf-8')
- lcsts_part_1 = lcsts_part_1[0].dropna()
- lcsts_part_1 = lcsts_part_1.reset_index(drop=True)
- lcsts_part_1 = pd.concat([lcsts_part_1[1::2].reset_index(drop=True), lcsts_part_1[::2].reset_index(drop=True)], axis=1)
- lcsts_part_1.columns = ['document', 'summary']
-
- lcsts_part_2 = pd.read_table(r'D:\softwares\zwj\nlp\project\long-document\datasets\PART_III.txt', header=None,
- warn_bad_lines=True, error_bad_lines=False, sep='<[/d|/s|do|su|sh][^a].*>',
- encoding='utf-8')
- lcsts_part_2 = lcsts_part_2[0].dropna()
- lcsts_part_2 = lcsts_part_2.reset_index(drop=True)
- x = lcsts_part_2[1::2].reset_index(drop=True)
- xx = lcsts_part_2[::2].reset_index(drop=True)
- lcsts_part_2 = pd.concat([lcsts_part_2[1::2].reset_index(drop=True), lcsts_part_2[::2].reset_index(drop=True)], axis=1)
- lcsts_part_2.columns = ['document', 'summary']
-
- dataset_train = Dataset.from_dict(lcsts_part_1).shuffle(seed=42)
- dataset_valid = Dataset.from_dict(lcsts_part_2).shuffle(seed=42)
1)NLPCC2017的单文档新闻测试集合TTNews
2)NLPCC2021的字节跳动CNewSum
转换脚本:
- # coding=utf-8
- import json
- from datasets import load_dataset
- import jsonlines
-
- data_type = 'jsonl'
- data_field = 'data'
- json_data_path = r'./test.simple.anno.label.jsonl'
-
- article = ''
- summary = ''
- data = []
- dict = {}
- index=0
- with open("./CNewSum_test_original.json","w",encoding='UTF-8') as f:
- with jsonlines.open(json_data_path) as reader:
- for idx,obj in enumerate(reader):
- tmp = 10
- for _ in obj['article']:
- article+=str(_)
- for _ in obj['summary']:
- summary+=str(_)
-
- dict['content'] = article
- dict['title'] = summary
- data.append(dict)
- article=''
- dict={}
- summary=''
- d = json.dumps(data, indent=4, sort_keys=False, ensure_ascii=False)
- f.write(d)
-
-
- dataset_train = load_dataset('json', data_files=[r'./CNewSum_test_original.json'])
1)NLPCC2020的CLTS,但该数据集并不好很差,大量摘要为正文摘抄抽取。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。