赞
踩
链接:https://pan.baidu.com/s/1PGCIz-yub3ugXYuNivlZzw
提取码:nuwl
提取出来四个数据集,其中chnsenticorp是主要数据
chnsenticorp分为四类:
这里以6000的为例,有pos和neg两个文件夹,每个文件夹下各3000 .txt文档,每个文档是一条对应情感的review:
准备将其处理成两个.txt文档,方便后续使用:
import os import codecs folder=["./neg","./pos"] record=dict() for fold in folder: record[fold]=0 out_file = fold + "_6000.txt" out = codecs.open(out_file,"w",errors="ignore",encoding="gbk") for _,_,filenames in os.walk(fold): for filename in filenames: file=codecs.open(os.path.join(fold, filename).replace("\\",'/'), "r",errors="ignore",encoding="gbk") context = file.read() file.close() context=context.replace('\n', '').replace('\r', '')+"\n" out.writelines(context) record[fold]+=1 out.close() print("record:",record)
再将其处理成json格式,为每一条sentence再给一个id号(因为本人需要后续使用Transformer,读者可以不用):
import json import random def shuffle2list(a: list, b: list): # shuffle two list with same rule, you can also use sklearn.utils.shuffle package c = list(zip(a, b)) random.shuffle(c) a[:], b[:] = zip(*c) return a, b sen_lis=[] label_lis=[] # pos:1; # neg:0; res=[] with open("./pos_6000.txt","r",errors="ignore",encoding="gbk") as pos,open( "./neg_6000.txt","r",errors="ignore",encoding="gbk") as neg: lines=pos.readlines() for line in lines: sen_lis.append(line.strip("\n")) label_lis.append(1) lines=neg.readlines() for line in lines: sen_lis.append(line.strip("\n")) label_lis.append(0) sen_lis,label_lis=shuffle2list(sen_lis,label_lis) for i in range(len(sen_lis)): item=dict() item["guid"]=i item["text_a"]=sen_lis[i] item["label"]=label_lis[i] res.append(item) print("all of %d instances"%(i+1)) with open("./ChnSenticrop.json","w") as jfile: json.dump(res,jfile,ensure_ascii=False)
洗好之后:
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。