赞
踩
目录
安装方法参考;doccano安装使用教程
进入doccano之后
右上角登陆,右上角可以切换中文
点左上角创建
选序列标注
数据集
导入
(文本每行是一句话,我随便找个水浒传的章节名写在上面为例)
导入成功如下所示
标签
左上角开始标注
标注完后导出
解压,admin.jsonl
- def json2bio(fpath, output):
- with open(fpath, encoding='utf-8') as f:
- lines = f.readlines()
- for line in lines: # '{"id":1,"text":"张天师祈禳瘟疫 洪太尉误走妖魔","label":[[0,3,"人物"],[3,5,"动作"],[8,11,"人物"],[11,13,"动作"]]}'
- annotations = json.loads(line) # {"id":1,"text":"张天师祈禳瘟疫 洪太尉误走妖魔","label":[[0,3,"人物"],[3,5,"动作"],[8,11,"人物"],[11,13,"动作"]]}
- text = annotations['text'].replace('\n', ' ') # '张天师祈禳瘟疫 洪太尉误走妖魔'
- all_words = list(
- text.replace(' ', ',')) # ['张', '天', '师', '祈', '禳', '瘟', '疫', ',', '洪', '太', '尉', '误', '走', '妖', '魔']
- all_label = ['O'] * len(all_words)
- for i in annotations['label']: # [0, 3, '人物']
- b_location = i[0] # 0
- e_location = i[1] # 3
- label = i[2] # '人物'
- all_label[b_location] = 'B-' + label
- if b_location != e_location:
- for word in range(b_location + 1, e_location):
- all_label[word] = 'I-' + label
- cur_line = 0
- # 写入文件
- toekn_label = zip(all_words, all_label)
- with open(output, 'a', encoding='utf-8') as f:
- for tl in toekn_label:
- f.write(tl[0] + str(' ') + tl[1])
- f.write('\n')
- cur_line += 1
- if cur_line == len(all_words):
- f.write('\n') # 空格间隔不同句子
'运行
转换为可以训练的bio格式的训练集
接下来就可以用这个数据集训练了
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。