赞
踩
实习收到的第一个任务
人民日报数据集的训练集用的就是BIO格式
然后我们转化为BIOES 和 BMES
首先是BIO转BMES
- path = r'./input/data_train.txt'
- res_path = r'./output/BMES.txt'
-
- f = open(path, encoding='utf-8')
- f1 = open(res_path, 'w+', encoding='utf_8')
-
- sentences = []
- sentence = []
- label_set = set()
- cnt_line = 0
- for line in f:
- cnt_line += 1
- if len(line) == 0 or line[0] == '\n':
- if len(sentence) > 0:
- sentences.append(sentence)
- print(sentence)
- sentence = []
- continue
- splits = line.split(' ')
- sentence.append([splits[0], splits[-1][:-1]])
- label_set.add(splits[-1])
-
- if len(sentence) > 0:
- sentences.append(sentence)
- sentence = []
- f.close()
-
- for sen in sentences:
- i = 0
- for index, word in enumerate(sen):
- char = word[0]
- label = word[1]
- if index < len(sen) - 1:
- if (label[0] == 'B'):
- if sen[index + 1][1][0] == 'I':
- label = label
- elif sen[index + 1][1][0] == 'O':
- label = 'S' + label[1:]
- elif (label[0] == 'I'):
- if sen[index + 1][1][0] == 'I':
- label = 'M' + label[1:]
- if sen[index + 1][1][0] == 'O' or sen[index + 1][1][0] == 'B':
- label = 'E' + label[1:]
- elif (label[0] == 'O'):
- label = label
- else:
- if (label[0] == 'B'):
- label = 'S' + label[1:]
- elif (label[0] == 'I'):
- label = 'E' + label[1:]
- elif (label[0] == 'O'):
- label = label
-
- f1.write(f'{char} {label}\n')
- f1.write('\n')
- f1.close()
-
然后是BMES转BIOES
- f= open(r'./output/BMES.txt', 'r', encoding='utf-8')
- f1 = open(r'./output/BIOES.txt', 'w+', encoding='utf-8')
- str1=[]
-
- for line in f.readlines():
- #print(list(line))
- if line!="\n":
- line1 = line.split()
- str2 = line1[0]
- for i in range(1, len(line1)):
- line2 = list(line1[i])
- if line2[0] == "M":
- line2[0] = "I"
- str3 = ''
- for i in line2:
- str3 = str3 + i
- str2 = str2 + ' ' + str3
- print(str2)
- str1.append(str2)
- else:
- str1.append(line)
- for j in str1:
- f1.write(j)
- f1.write("\n")
不同的标注格式跑出来的召回率是不一样的 以后会经常用到
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。