当前位置:   article > 正文

NLP学习笔记——命名实体识别_bert att crf

bert att crf

目录

一、思路与步骤

二、模型结构与代码实现

config.py

dataload.py

CRF.py

BERT_ATT_CRF.py

 main.py

实体预测结果提取.py


一、思路与步骤

1、获取数据:选用小说《斗破苍穹》章节内容,并对数据进行人工标注。数据连接:斗破苍穹数据,提取码:jkzi

2、数据优化:根据规律尽可能将不属于小说剧情内容的文本删除(例如作者的感言,求月票等等。

3、对小说的章节内容拆分成较短的文段或句子:章节内容的文本过长,大大降低了模型运行的速度。

4、将文本通过预训练模型(BERT、XLNet等)或者Word2Vec等别的方式进行分词转向量:通常预训练模型得到的词向量效果更好。分词一般分字,一个字对应一个标签

5、数据预处理:将转成的向量文本数据和标签数据规范化:文本数据张量形状为(批数据条数,最大序列长度,词向量维度),标签数据张量形状为(批数据条数,最大序列长度,类别个数)。

6、搭建模型:一般只需要搭建编码器部分,在接上全连接做分类器,损失函数选择条件随机场(CRF)。它能处理类别之间的紧密关系。条件随机场介绍:机器学习(有监督)——条件随机场CRF

7、模型调参、训练与评估:根据模型参数进行调参,一般主要调整的参数有:学习率、模型层数、训练次数。模型评估一般选择准确率和p、r、f1值。

二、模型结构与代码实现

1、模型:Bert-Att-CRF(由Bert、Self-attention、CRF组成)。如下图所示:

2、项目结构如下图所示:bert-base-chinese(BERT预训练模型)、Bert_att_crf(模型训练过程文件)、data(数据文件)。

3、代码文件内容:

项目地址:EntityRecognition · 唯有读书高/Knowledge Graph - 码云 - 开源中国 (gitee.com)

config.py
  1. import os
  2. import torch
  3. class Config(object):
  4. def __init__(self):
  5. self.save_file_name = 'Bert_att_crf'
  6. self.base_path = os.path.abspath('./') # 获取当前目录的绝对路径
  7. self.min_seq_len = 150
  8. self.max_seq_len = 200
  9. self.learning_rate = 1e-5
  10. self.drop_rate = 1e-2
  11. self.batch_size = 12
  12. self.label_num = 23
  13. self.layer_num = 2
  14. self.epoch = 20
  15. self.word_dim = 768
  16. self.save_model_path = os.path.join(self.base_path, self.save_file_name, 'model_weights.pth')
  17. self.Bert_path = os.path.join(self.base_path, 'bert-base-chinese')
  18. self.do_lower_case = True
  19. self.data_set_path = r'data/斗破苍穹_实体识别模型训练数据.xlsx'
  20. # 优先使用GPU
  21. self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
dataload.py
  1. import pickle
  2. from torch.utils.data import Dataset
  3. import pandas as pd
  4. from config import Config
  5. import os
  6. class NERDataset(Dataset):
  7. def __init__(self, config, tokenizer):
  8. """
  9. :param config: 本项目的参数设置
  10. :param tokenizer: 预训练模型的分词器
  11. """
  12. super(NERDataset, self).__init__()
  13. self.config = config
  14. self.tokenizer = tokenizer
  15. def load_pkl(self, path: str):
  16. """加载pkl文件"""
  17. with open(path, 'rb') as f:
  18. data = pickle.load(f)
  19. return data
  20. def save_pkl(self, data, path: str):
  21. """保存pkl文件"""
  22. with open(path, 'wb') as f:
  23. pickle.dump(data, f)
  24. def save_label_data(self, label: list, label_pkl_path: str):
  25. """
  26. :param label: 标签数据
  27. :param label_pkl_path: 存储路径
  28. :return:
  29. """
  30. label_set = '、'.join(label).split('、')
  31. label_set = list(set(label_set)) # 去重
  32. label_set.extend(['<START>', '<END>']) # 加入特殊标签
  33. label2id = {value: idx for idx, value in enumerate(label_set)}
  34. id2label = {idx: value for idx, value in enumerate(label_set)}
  35. label_dict = {'label2id': label2id, 'id2label': id2label}
  36. self.save_pkl(label_dict, label_pkl_path) # 保存标签文件
  37. return label_dict
  38. def read_excel(self, excel_path: str, sheet_name: str = 'Sheet1',
  39. train_mode: bool = True) -> (list, dict):
  40. """
  41. :param excel_path: 表格文件路径
  42. :param sheet_name: 表格名字
  43. :param train_mode: 是否是训练模式
  44. :return:
  45. """
  46. excel_path = os.path.join(self.config.base_path, excel_path)
  47. data = pd.read_excel(excel_path, sheet_name=sheet_name)
  48. # 训练模式
  49. if train_mode:
  50. text = data['文本'].tolist()
  51. label = data['标签'].tolist()
  52. # 获取标签字典
  53. os.makedirs(f'./{self.config.save_file_name}', exist_ok=True)
  54. label_pkl_path = os.path.join(self.config.base_path, self.config.save_file_name, "label_dict.pkl")
  55. # 是否有保存的标签类别信息,并且类别个数符合要求
  56. if os.path.exists(label_pkl_path):
  57. label_dict = self.load_pkl(label_pkl_path) # {'label2id': label2id, 'id2label':id2label}
  58. if len(self.load_pkl(label_pkl_path)['id2label']) != self.config.label_num:
  59. self.save_label_data(label, label_pkl_path)
  60. else: # 没有则从数据集集获取
  61. label_dict = self.save_label_data(label, label_pkl_path)
  62. """获取文本标注数据"""
  63. line = [[' '.join(list(text[i])), ' '.join(label[i].split('、'))] for i in range(len(text))]
  64. else: # 预测模式
  65. text = data['文本'].tolist()
  66. title = data['标题'].tolist()
  67. """获取标签字典"""
  68. label_pkl_path = os.path.join(self.config.base_path, self.config.save_file_name, "label_dict.pkl")
  69. # {'label2id': label2id, 'id2label':id2label}
  70. label_dict = self.load_pkl(label_pkl_path)
  71. """获取文本标注数据,并根据标点符号拆分句子到最大序列范围内"""
  72. line = []
  73. for i in range(len(text)): # 遍历章节数据
  74. this_text = text[i].replace('\n', '').replace(' ', '')
  75. this_title = title[i].replace('\n', '')
  76. start_idx = 0 # 句子的头一个index
  77. i = start_idx # 正在查询的index
  78. min_len = self.config.min_seq_len # 句子的最小长度
  79. while i < len(this_text):
  80. # 句子最后一个index没有超过文本长度, 并且是结束符号
  81. if i + min_len < len(this_text) and this_text[i + min_len] in '。?!,':
  82. end_idx = i + min_len
  83. this_sentence = this_text[start_idx:end_idx + 1] # 句子提取
  84. line.append((this_title, this_sentence))
  85. start_idx = end_idx + 1 # 更新开始index
  86. i = start_idx # 更新查询index
  87. # 超过文本长度(最后剩下的一点)
  88. elif i + min_len >= len(this_text):
  89. this_sentence = this_text[start_idx:] # 截取剩余句子
  90. line.append((this_title, this_sentence))
  91. break
  92. # 以上条件都不满足,查询下一个index
  93. else:
  94. i += 1
  95. return line, label_dict
  96. def data_process(self, excel_path: str, sheet_name: str = 'Sheet1',
  97. train_mode: bool = True) -> (list, dict):
  98. """
  99. :param excel_path: 表格文件路径
  100. :param sheet_name: 表格名字
  101. :param train_mode: 是否是训练模式
  102. :return:
  103. """
  104. # 读取数据
  105. pre_proces_line = []
  106. line, label_dict = self.read_excel(excel_path, sheet_name, train_mode=train_mode)
  107. # 训练模式
  108. if train_mode:
  109. label2id = label_dict['label2id'] # 获取转换字典
  110. for index, item in enumerate(line):
  111. text = item[0].split(' ')
  112. label = item[1].split(' ')
  113. # 使用BERT的tokenizer功能
  114. # 词嵌入
  115. max_seq_length = self.config.max_seq_len
  116. encoded_dict = self.tokenizer(''.join(text), padding='max_length', max_length=max_seq_length,
  117. truncation=True)
  118. decoded_text = self.tokenizer.convert_ids_to_tokens(encoded_dict['input_ids'])
  119. # 短补长截
  120. label = [label2id[seq] for seq in label]
  121. if len(label) >= max_seq_length-2: # 截断
  122. label = [label2id["<START>"]] + label[:max_seq_length-2] + [label2id["<END>"]]
  123. else: # 补充
  124. label = [label2id["<START>"]] + label + [label2id["<END>"]]
  125. while len(label) < max_seq_length:
  126. label.append(-1)
  127. text = encoded_dict['input_ids'] # 输入序列
  128. mask = encoded_dict['attention_mask'] # 输入掩码
  129. token_type_ids = encoded_dict['token_type_ids'] # 输入序列的token类别
  130. assert len(text) == len(label) == len(mask)
  131. pre_proces_line.append({'text': text, 'mask': mask, 'label': label,
  132. 'token_type_ids': token_type_ids, 'str_text': decoded_text})
  133. return pre_proces_line, label_dict
  134. # 预测模式
  135. else:
  136. for index, item in enumerate(line):
  137. title = item[0]
  138. text = item[1]
  139. # 使用BERT的tokenizer功能"""
  140. max_seq_length = self.config.max_seq_len
  141. encoded_dict = self.tokenizer(text, padding='max_length', max_length=max_seq_length,
  142. truncation=True)
  143. decoded_text = self.tokenizer.convert_ids_to_tokens(encoded_dict['input_ids'])
  144. text = encoded_dict['input_ids']
  145. mask = encoded_dict['attention_mask']
  146. token_type_ids = encoded_dict['token_type_ids']
  147. pre_proces_line.append({'text': text, 'mask': mask, 'label': title,
  148. 'token_type_ids': token_type_ids, 'str_text': decoded_text})
  149. assert len(text) == len(decoded_text)
  150. return pre_proces_line, label_dict
  151. if __name__ == '__main__':
  152. from transformers import BertTokenizer
  153. tokenizer_ = BertTokenizer.from_pretrained(Config().Bert_path, do_lower_case=Config().do_lower_case)
  154. dataset = NERDataset(Config(), tokenizer_)
  155. pre_processing_line, label_tag_dict = dataset.data_process(r'data/斗破苍穹(标注与未标注数据).xlsx',
  156. sheet_name='未标注数据', train_mode=False)
  157. print(label_tag_dict)
  158. print(len(label_tag_dict['label2id']))
CRF.py

        也可以至今调用TorchCRF的CRF,这个是为了搞懂CRF写的。

  1. import torch.nn as nn
  2. import torch
  3. from torch import FloatTensor, Tensor, BoolTensor
  4. from config import Config
  5. class CRF(nn.Module):
  6. def __init__(self, num_labels: int):
  7. super(CRF, self).__init__()
  8. self.config = Config()
  9. self.num_labels = num_labels
  10. # 使用均匀分布初始化一个转移矩阵
  11. self.transfer_matrix = nn.Parameter(torch.empty(self.num_labels, self.num_labels))
  12. nn.init.uniform_(self.transfer_matrix, -0.1, 0.1)
  13. # 使用均匀分布初始化一个开始矩阵
  14. self.start_matrix = nn.Parameter(torch.empty(self.num_labels))
  15. nn.init.uniform_(self.start_matrix, -0.1, 0.1)
  16. # 使用均匀分布初始化一个结束矩阵
  17. self.end_matrix = nn.Parameter(torch.empty(self.num_labels))
  18. nn.init.uniform_(self.end_matrix, -0.1, 0.1)
  19. def forward(self, x: FloatTensor, y: Tensor, mask: BoolTensor
  20. ) -> Tensor:
  21. """
  22. 分子除以分母改为相减,希望的概率越大,获取的loss值会负方向趋近于0
  23. :param x: 特征序列(通常是经过RNN等模型提取到的特征张量)
  24. :param y: 标签序列
  25. :param mask: 填充符掩码(特征序列里含有填充符<pad>,对应的标签也有)
  26. :return: 损失值(负数)
  27. 公式: 概率 = 标签路径上的边和节点得分之和/所有边和节点得分之和
  28. 希望概率最大,因此公式转log使概率从负方向趋近于0。再取反便是loss(正数)
  29. """
  30. molecule = self.formula_molecule(x, y, mask).to(self.config.device)
  31. denominator = self.formula_denominator(x, mask).to(self.config.device)
  32. loss = molecule - denominator
  33. return loss
  34. def formula_molecule(self, x: FloatTensor, y: Tensor, mask: BoolTensor
  35. ) -> Tensor:
  36. """
  37. 计算公式的分子部分
  38. :param x: 特征序列(通常是经过RNN等模型提取到的特征张量)
  39. :param y: 标签序列
  40. :param mask: 填充符掩码(特征序列里含有填充符<pad>,对应的标签也有)
  41. :return: 分子得分
  42. """
  43. batch_size, len_seq, _ = x.size()
  44. batch_idx = torch.arange(batch_size) # tensor([ 0, 1, ...., batch_size])
  45. first_y = y[:, 0] # 每个序列的第一个类别标签
  46. last_y = y[:, -1] # 每个序列的最后的类别标签
  47. # 由开始到第一个标签的转移得分
  48. score = self.start_matrix[first_y]
  49. # 中间的得分
  50. for i in range(len_seq-1):
  51. now_y = y[:, i] # 当前标签的值y1
  52. next_y = y[:, i + 1] # 下一个标签的值y2
  53. now_mask = mask[:, i] # 排除掩码部分
  54. next_mask = mask[:, i + 1]
  55. transfer = self.transfer_matrix[now_y, next_y] # 当前时刻y1——>y2的转移权重
  56. now_x = x[batch_idx, i, now_y] # 当前标签的值x1
  57. score += now_x * now_mask + transfer * next_mask
  58. # 最后的得分
  59. score += self.end_matrix[last_y] # 加上最后结束的转移得分
  60. return score
  61. def formula_denominator(self, x: FloatTensor, mask: BoolTensor):
  62. """
  63. 计算所有边(转移权重)和节点(类别)的总得分作为分母,与有效序列长度有关,越长越大
  64. :param x: 特征序列(通常是经过RNN等模型提取到的特征张量)
  65. :param mask: 填充符掩码(特征序列里含有填充符<pad>,对应的标签也有)
  66. :return: 分别得分
  67. """
  68. batch_size, len_seq, _ = x.size()
  69. # 设置张量形状
  70. mask = mask.unsqueeze(-1).expand(batch_size, len_seq, self.num_labels)
  71. start_matrix = self.start_matrix.unsqueeze(0).expand(batch_size, self.num_labels)
  72. end_matrix = self.end_matrix.unsqueeze(0).expand(batch_size, self.num_labels)
  73. # 第一个token
  74. x_0 = x[:, 0]
  75. score = start_matrix + x_0
  76. # 中间的token
  77. for i in range(1, len_seq):
  78. this_x = x[:, i].unsqueeze(1)
  79. this_mask = mask[:, i]
  80. this_score = score.unsqueeze(-1) + self.transfer_matrix + this_x # 当前的结果
  81. this_score = torch.logsumexp(this_score, dim=1) # label1-->(label1/label2....)维度求和
  82. score = torch.where(this_mask, this_score, score) # 该位置是True就更新为当前结果
  83. # 最后的token
  84. score = score + end_matrix
  85. score = torch.logsumexp(score, dim=1) # len_seq维度求和
  86. return score
  87. def viterbi_decode(self, x: FloatTensor, mask: BoolTensor):
  88. """
  89. 预测时,利用维特比算法进行解码,获取到预测的标签序列
  90. :param x: 特征序列(通常是经过RNN等模型提取到的特征张量)
  91. :param mask: 填充符掩码(特征序列里含有填充符<pad>,对应的标签也有)
  92. :return: 标签结果[tensor(标签值), tensor(标签值)]
  93. """
  94. batch_size, len_seq, _ = x.size()
  95. # 用维特比算法筛选最大的得分路径
  96. # 将维度都拓展成(batch_size, num_labels, num_labels)
  97. start_matrix = self.start_matrix.unsqueeze(0).expand(batch_size, self.num_labels)
  98. x_0 = x[:, 0] # 序列第一个标签
  99. score = [start_matrix + x_0] # 记录维特比计算的得分
  100. path = [] # 记录维特比路径最大的id
  101. for i in range(1, len_seq):
  102. # 获取当前时刻的标签
  103. x_i = x[:, i].unsqueeze(1)
  104. # 对应路径的得分求和
  105. this_score = score[i-1].unsqueeze(-1) + self.transfer_matrix + x_i
  106. # 获取上个时刻标签分别到当前时刻标签得分的最大值和标签id(当前同一标签里的路径对比,不同的不比)
  107. # 例如有标签:1、2。获取上个时刻1与2里到当前时刻1(或2)得分的最大值和id,
  108. # 所以结果形状为(batch_size,num_labels)
  109. last_score, last_path = this_score.max(1)
  110. score.append(last_score) # 将更新后的得分添加到列表,用于下一个时刻的相加对比
  111. path.append(last_path)
  112. # 对筛选出来的得分路径进行解码
  113. effective_length = mask.sum(dim=1).squeeze(0) # 获取有效序列的长度(去除掩码部分)
  114. new_path = []
  115. _, max_index = score[-1].max(1) # 从最后一个筛选结果里进一步获取最好的结果
  116. # 将结果添加进去(从后面解码,结果是倒序的)
  117. new_path.append(max_index.tolist())
  118. for i in range(len(path)):
  119. rear_path = path[-1-i] # 倒数第i个序列的标签集
  120. batch_id = torch.arange(batch_size)
  121. max_index = rear_path[batch_id, max_index] # 根据结果索引max_index查找上一个最好的标签索引
  122. new_path.append(max_index.tolist())
  123. new_path = torch.tensor(new_path).T
  124. new_path = torch.flip(new_path, [1]).tolist() # 因为结果是倒序的,所以将每一行元素再进行倒序
  125. new_path = [new_path[i][:effective_length[i]] for i in range(batch_size)] # 只取有效序列部分
  126. return new_path
  127. if __name__ == '__main__':
  128. labels = ['a', 'b', 'c']
  129. X = torch.FloatTensor([[[0.1, 0.2, 0.8], [0.3, 0.8, 0.3], [0.5, 0.6, 0.3]],
  130. [[0.3, 0.2, 0.5], [0.3, 0.2, 0.8], [0.9, 0.1, 0.6]],
  131. [[0.7, 0.8, 0.8], [0.9, 0.1, 0.8], [0.2, 0.3, 0.6]]])
  132. Y = torch.LongTensor([[0, 1, 1],
  133. [2, 0, 1],
  134. [0, 2, 1]])
  135. Mask = torch.LongTensor([[1, 1, 1],
  136. [1, 1, 0],
  137. [1, 1, 1]])
  138. crf = CRF(len(labels))
  139. Loss = crf.forward(X, Y, Mask.byte())
  140. label = crf.viterbi_decode(X, Mask.byte())
  141. print(Loss)
  142. print(label)
BERT_ATT_CRF.py
  1. import torch
  2. import torch.nn as nn
  3. from transformers import BertModel
  4. from CRF import CRF
  5. from torch import Tensor
  6. class BertAttCRF(nn.Module):
  7. def __init__(self, myconfig, pre_config):
  8. """
  9. :param myconfig: 本次项目需要传入的参数配置
  10. :param pre_config: 预训练模型的参数配置
  11. """
  12. super(BertAttCRF, self).__init__()
  13. self.config = myconfig
  14. self.bert = BertModel.from_pretrained(self.config.Bert_path, config=pre_config)
  15. self.drop = nn.Dropout(p=self.config.drop_rate) # 随机丢失一小部分,放在过拟合
  16. # self-attention
  17. self.attention = nn.MultiheadAttention(embed_dim=self.config.word_dim, num_heads=8)
  18. self.layer_norm = nn.LayerNorm(self.config.word_dim) # 层归一化
  19. self.linear_layer = nn.Linear(self.config.word_dim, self.config.label_num) # 全连接
  20. self.crf = CRF(num_labels=self.config.label_num)
  21. def forward(self, input_ids: Tensor, attention_mask: Tensor,
  22. token_type_ids: Tensor, tags: Tensor):
  23. """
  24. :param input_ids: torch.Size([batch_size,seq_len]), 代表输入实例的tensor张量
  25. :param token_type_ids: torch.Size([batch_size,seq_len]), 一个实例可以含有两个句子,相当于标记
  26. :param attention_mask: torch.Size([batch_size,seq_len]), 指定对哪些词进行self-Attention操作
  27. :param tags: 标签
  28. :return:
  29. """
  30. output = self.bert(input_ids, token_type_ids=token_type_ids,
  31. attention_mask=attention_mask)
  32. sequence_output = output[0] # torch.Size([batch_size,seq_len,hidden_size])
  33. # attention n_layer
  34. for _ in range(self.config.layer_num): # 残差结构
  35. output = self.layer_norm(sequence_output) # LayerNormal归一化
  36. output = self.attention(output, output, output,
  37. key_padding_mask=attention_mask.T)
  38. sequence_output = torch.add(sequence_output, output[0])
  39. sequence_output = self.drop(sequence_output)
  40. emissions = self.linear_layer(sequence_output) # [seq_length, batch_size, num_labels]
  41. loss = -1 * self.crf(emissions, tags, mask=attention_mask.byte())
  42. return loss
  43. def predict(self, input_ids: Tensor, attention_mask=None,
  44. token_type_ids: Tensor = None):
  45. """
  46. :param input_ids: torch.Size([batch_size,seq_len]), 代表输入实例的tensor张量
  47. :param token_type_ids: torch.Size([batch_size,seq_len]), 一个实例可以含有两个句子,相当于标记
  48. :param attention_mask: torch.Size([batch_size,seq_len]), 指定对哪些词进行self-Attention操作
  49. :return:
  50. """
  51. outputs = self.bert(input_ids, token_type_ids=token_type_ids,
  52. attention_mask=attention_mask)
  53. sequence_output = outputs[0]
  54. for _ in range(self.config.layer_num): # 残差结构
  55. output = self.layer_norm(sequence_output) # LayerNormal归一化
  56. output = self.attention(output, output, output,
  57. key_padding_mask=attention_mask.T)
  58. sequence_output = torch.add(sequence_output, output[0])
  59. sequence_output = self.drop(sequence_output)
  60. sequence_output = self.linear_layer(sequence_output)
  61. # CRF维特比算法解码
  62. sequence_output = self.crf.viterbi_decode(sequence_output,
  63. attention_mask.byte())
  64. return sequence_output
 main.py
  1. import pandas
  2. from tqdm import tqdm
  3. from config import Config
  4. from dataload import NERDataset
  5. from BERT_ATT_CRF import BertAttCRF
  6. import torch
  7. from transformers import BertTokenizer, BertConfig
  8. import time
  9. import random
  10. import os
  11. class RunBertAttCRF(object):
  12. def __init__(self, config: Config):
  13. """
  14. :param config: 本次项目需要传入的参数配置
  15. """
  16. self.config = config
  17. # 优先使用GPU
  18. self.device = self.config.device
  19. # Bert
  20. self.tokenizer = BertTokenizer.from_pretrained(self.config.Bert_path,
  21. do_lower_case=self.config.do_lower_case)
  22. self.pre_config = BertConfig.from_pretrained(self.config.Bert_path,
  23. num_labels=self.config.label_num)
  24. self.model = BertAttCRF(self.config, pre_config=self.pre_config).to(self.device)
  25. # 初始化模型参数优化器
  26. self.optimizer = torch.optim.Adam(self.model.parameters(),
  27. lr=self.config.learning_rate)
  28. def train(self, excel_path: str, sheet_name: str = 'Sheet1', ):
  29. """
  30. :param excel_path: 训练数据表格路径
  31. :param sheet_name: 表格名字
  32. :return:
  33. """
  34. self.model.train()
  35. data_set = NERDataset(self.config, self.tokenizer) # 实例化数据处理类
  36. # 获取预处理的数据
  37. process_line, label_tag_dict = data_set.data_process(excel_path,
  38. sheet_name=sheet_name)
  39. # process_line = process_line[:int(len(process_line)*0.02)]
  40. # 走一遍数据需要的批数
  41. batch_num = (len(process_line) // self.config.batch_size
  42. if len(process_line) % self.config.batch_size == 0
  43. else (len(process_line) // self.config.batch_size) + 1)
  44. random.shuffle(process_line) # 打乱
  45. max_acc = 0
  46. for e in range(self.config.epoch):
  47. all_loss = [] # 汇总一遍数据的损失值
  48. start_time = time.time() # 记时
  49. for batch in range(batch_num):
  50. # 选的批次数据位置没超过最大数据长度
  51. if (batch + 1) * self.config.batch_size <= len(process_line):
  52. batch_line = process_line[batch * self.config.batch_size:
  53. (batch + 1) * self.config.batch_size]
  54. else:
  55. batch_line = process_line + process_line
  56. batch_line = batch_line[batch * self.config.batch_size:
  57. (batch + 1) * self.config.batch_size]
  58. text = torch.tensor([item['text'] for item in batch_line], dtype=torch.long)
  59. mask = torch.tensor([item['mask'] for item in batch_line], dtype=torch.float)
  60. token_type_ids = torch.tensor([item['token_type_ids'] for item in batch_line],
  61. dtype=torch.long)
  62. label_ = torch.tensor([item['label'] for item in batch_line])
  63. # 开始训练,计算梯度
  64. self.optimizer.zero_grad()
  65. loss = self.model.forward(text.to(self.device), mask.to(self.device),
  66. token_type_ids.to(self.device), label_.to(self.device))
  67. loss.mean().backward() # 损失反传
  68. self.optimizer.step() # 更新梯度
  69. all_loss += loss.tolist()
  70. print(f'\repoch:{e},batch:{(batch + 1)}, '
  71. f'LOSS:{round(loss.mean().item(), 3)}', end='') #
  72. need_time = (time.time() - start_time) / 60 # 获取一个epoch的运行时间
  73. mean_loss = round(sum(all_loss) / len(all_loss), 3)
  74. print(f'\repoch:{e}, mean_LOSS:{mean_loss},'
  75. f' time:{round(need_time, 3)}m')
  76. if (e + 1) % 2 == 0:
  77. # 记录参数的验证效果
  78. verify_result, verify_label, _ = self.test(self.config.data_set_path, sheet_name='verify')
  79. accuracy_, precision_, recall_, f1_, conf_matrix_ = self.acc_prf1(verify_result,
  80. verify_label)
  81. print(f'acc{accuracy_}\np{precision_}\nr{recall_}\nf1{f1_}\n') # {conf_matrix_}\n
  82. os.makedirs(f'./{self.config.save_file_name}', exist_ok=True)
  83. # 保存训练过程
  84. file_ = open(f'./{self.config.save_file_name}/verify_result.txt', 'a', encoding='utf-8')
  85. file_.write(f'参数:epoch:{e}, mean_loss:{mean_loss}, lr:{self.config.learning_rate}, '
  86. f'drop_rate:{self.config.drop_rate}, '
  87. f'batch_size:{self.config.batch_size}, layer_num:{self.config.layer_num}\n'
  88. f'verify评估:acc:{accuracy_}, p:{precision_}, r:{recall_}, f1:{f1_}, '
  89. f'time:{round(need_time, 3)}\n\n') # , \nconf_matrix:{conf_matrix_}
  90. # 如果模型效果更好,保存模型
  91. if accuracy_ - max_acc >= 0:
  92. # 保存模型
  93. torch.save(self.model.state_dict(), self.config.save_model_path)
  94. max_acc = accuracy_ # 准确率更新
  95. # 加载目前效果最好的权重
  96. self.model.load_state_dict(torch.load(myconfig.save_model_path))
  97. def test(self, excel_path: str, sheet_name: str = 'Sheet1') -> (list, list, dict):
  98. """
  99. :param excel_path: 训练数据表格路径
  100. :param sheet_name: 表格名字
  101. :return:
  102. """
  103. self.model.eval()
  104. # 获取预处理的数据
  105. data_set = NERDataset(self.config, self.tokenizer)
  106. process_line, label_tag_dict = data_set.data_process(excel_path, sheet_name=sheet_name)
  107. batch_num = len(process_line) // self.config.batch_size
  108. all_result_ = []
  109. all_label = []
  110. for batch in range(batch_num):
  111. # 按顺序取批数据, 多出来不够一个batch_size的不要了
  112. batch_line = process_line[batch * self.config.batch_size: (batch + 1) * self.config.batch_size]
  113. text = torch.tensor([item['text'] for item in batch_line], dtype=torch.long)
  114. mask = torch.tensor([item['mask'] for item in batch_line], dtype=torch.float)
  115. token_type_ids = torch.tensor([item['token_type_ids'] for item in batch_line], dtype=torch.long)
  116. label_ = [item['label'] for item in batch_line]
  117. # 模型预测
  118. result_ = self.model.predict(text.to(self.device), mask.to(self.device),
  119. token_type_ids.to(self.device))
  120. # 结果汇总
  121. all_result_ += result_
  122. all_label += label_
  123. # 将测试结果加上填充符发标签,方便进行评估指标计算
  124. new_all_result = []
  125. for item in all_result_:
  126. if len(item) < self.config.max_seq_len: # 预测结果小于最大长度进行填充
  127. item = item + [-1] * (self.config.max_seq_len - len(item))
  128. new_all_result.append(item)
  129. return new_all_result, all_label, label_tag_dict['label2id']
  130. def predict(self, excel_path: str, sheet_name: str = 'Sheet1') -> list:
  131. """
  132. :param excel_path: 训练数据表格路径
  133. :param sheet_name: 表格名字
  134. :return:
  135. """
  136. self.model.eval()
  137. # 获取预处理的数据
  138. print('数据加载中···')
  139. data_set = NERDataset(self.config, self.tokenizer) # 实例化数据处理类
  140. process_line, label_tag_dict = data_set.data_process(excel_path, sheet_name=sheet_name, train_mode=False)
  141. batch_num = len(process_line) // self.config.batch_size
  142. all_result_ = []
  143. for batch in tqdm(range(batch_num + 1)):
  144. end_id = None # 用于去掉最后凑batch size部分
  145. # 按顺序取批数据
  146. if (batch + 1) * self.config.batch_size <= len(process_line): # 选的批次数据位置没超过最大数据长度
  147. batch_line = process_line[batch * self.config.batch_size: (batch + 1) * self.config.batch_size]
  148. else: # 最后凑batch size
  149. batch_line = process_line + process_line
  150. batch_line = batch_line[batch * self.config.batch_size: (batch + 1) * self.config.batch_size]
  151. end_id = len(process_line) - batch * self.config.batch_size # 记录数据结束位置
  152. text = torch.tensor([item['text'] for item in batch_line], dtype=torch.long)
  153. mask = torch.tensor([item['mask'] for item in batch_line], dtype=torch.float)
  154. token_type_ids = torch.tensor([item['token_type_ids'] for item in batch_line], dtype=torch.long)
  155. title = [item['label'] for item in batch_line]
  156. str_text = [item['str_text'] for item in batch_line]
  157. # 模型预测
  158. result_ = self.model.predict(text.to(self.device), mask.to(self.device),
  159. token_type_ids.to(self.device))
  160. # 如果存在凑batch size,去掉凑的部分
  161. if end_id is not None:
  162. result_ = result_[:end_id]
  163. all_result_ += [(result_[i], title[i], str_text[i]) for i in range(len(result_))]
  164. return all_result_
  165. def acc_prf1(self, result_: list, result_label: list):
  166. """
  167. :param result_: 预测结果
  168. :param result_label: 标签
  169. :return:
  170. """
  171. # 预测值和标签值
  172. predicted = torch.tensor(result_)
  173. target = torch.tensor(result_label)
  174. # acc
  175. correct = torch.sum((predicted == target).int()).item() # 计算准确预测的样本数量
  176. accuracy_ = correct / target.numel() # 计算准确率
  177. # 计算混淆矩阵
  178. conf_matrix_ = torch.zeros((self.config.label_num, self.config.label_num))
  179. for t, p in zip(target, predicted):
  180. for i in range(len(t)):
  181. conf_matrix_[t[i], p[i]] += 1
  182. p = torch.diag(conf_matrix_) / (conf_matrix_.sum(dim=0) + 1e-8) # 计算精确率
  183. r = torch.diag(conf_matrix_) / (conf_matrix_.sum(dim=1) + 1e-8) # 计算召回率
  184. f1_ = 2 * p * r / (p + r + 1e-8) # 计算 F1 值
  185. return accuracy_, p, r, f1_, (conf_matrix_ / conf_matrix_.sum(dim=1, keepdim=True))
  186. if __name__ == '__main__':
  187. # 设置TensorFlow的OneDNN自定义操作环境变量
  188. os.environ['TF_ENABLE_ONEDNN_OPTS'] = '0'
  189. myconfig = Config()
  190. """设置不同参数组训练,完成后比较效果进行调参"""
  191. # params = {'drop_rate':[1e-1,1e-2,1e-3,1e-4],
  192. # 'learning_rate':[1e-1,1e-2,1e-3,1e-4],
  193. # 'layer_num':[1,2,3,4],}
  194. # for key in params.keys():
  195. # for item in params[key]:
  196. # setattr(myconfig, key, item) # 使用setattr函数将参数值赋给类的属性
  197. # print(key, item)
  198. # the_model = RunBertAttCRF(myconfig)
  199. # the_model.train(myconfig.data_set_path,sheet_name='train')
  200. """训练模型"""
  201. run = RunBertAttCRF(myconfig)
  202. run.train(myconfig.data_set_path, sheet_name='train')
  203. # """最终测试模型效果"""
  204. run = RunBertAttCRF(myconfig)
  205. run.model.load_state_dict(torch.load(myconfig.save_model_path))
  206. run.model.eval()
  207. result, label, label2id = run.test(myconfig.data_set_path, sheet_name='test')
  208. accuracy, precision, recall, f1, conf_matrix = run.acc_prf1(result, label)
  209. print(f'acc{accuracy}\np{precision}\nr{recall}\nf1{f1}\n') # {conf_matrix}
  210. os.makedirs(f'./{myconfig.save_file_name}', exist_ok=True)
  211. file = open(f'./{myconfig.save_file_name}/test_result.txt', 'a', encoding='utf-8')
  212. file.write(f'test评估:acc:{accuracy}, p:{precision}, r:{recall}, f1:{f1}\n'
  213. f'\n\n') # conf_matrix:{conf_matrix}
  214. """模型应用,预测未标注数据"""
  215. run = RunBertAttCRF(myconfig)
  216. run.model.load_state_dict(torch.load(myconfig.save_model_path))
  217. run.model.eval()
  218. all_result = run.predict(r'data/斗破苍穹(标注与未标注数据).xlsx', sheet_name='未标注数据部分')
  219. header = ['标签', '标题', '文本']
  220. all_result = pandas.DataFrame(all_result, columns=header)
  221. all_result.to_excel("data/斗破苍穹_未标注数据实体预测结果.xlsx")
实体预测结果提取.py
  1. import pickle
  2. import pandas as pd
  3. from config import Config
  4. import os
  5. # 不同类别实体的标签
  6. jz_entity_target = ['B-jz', 'I-jz']
  7. zmsl_entity_target = ['B-zmsl', 'I-zmsl']
  8. zy_entity_target = ['B-zy', 'I-zy']
  9. djhj_entity_target = ['B-djhj', 'I-djhj']
  10. gf_entity_target = ['B-gf', 'I-gf']
  11. mf_entity_target = ['B-mf', 'I-mf']
  12. yh_entity_target = ['B-yh', 'I-yh']
  13. wq_entity_target = ['B-wq', 'I-wq']
  14. zw_entity_target = ['B-zw', 'I-zw']
  15. rw_entity_target = ['B-rw', 'I-rw']
  16. # 方便查找标签属于哪个类别
  17. entity_data_dict = {'jz': jz_entity_target,
  18. 'zmsl': zmsl_entity_target,
  19. 'zy': zy_entity_target,
  20. 'djhj': djhj_entity_target,
  21. 'gf': gf_entity_target,
  22. 'mf': mf_entity_target,
  23. 'yh': yh_entity_target,
  24. 'wq': wq_entity_target,
  25. 'zw': zw_entity_target,
  26. 'rw': rw_entity_target,
  27. }
  28. # 加载id2label信息
  29. config = Config()
  30. label_pkl_path = os.path.join(config.base_path, config.save_file_name, "label_dict.pkl")
  31. label_dict = open(label_pkl_path, 'rb')
  32. label_dict = pickle.load(label_dict)
  33. id2label = label_dict['id2label']
  34. results = pd.read_excel('data/斗破苍穹_未标注数据实体预测结果.xlsx')
  35. last_title = ''
  36. last_title_label = []
  37. last_title_text = ''
  38. last_title_entity = []
  39. all_title_label = []
  40. for index, row in results.iterrows():
  41. label = row['标签'] # 字符串形式的列表
  42. label = eval(label) # 转回列表
  43. text = row['文本']
  44. text = eval(text)
  45. text = [item for item in text if item != '[PAD]'] # 去除填充符
  46. title = row['标题']
  47. entity = []
  48. assert len(text) == len(label)
  49. start = None
  50. start_type = None
  51. end = None
  52. label_id_0_type = None
  53. label_id_1_type = None
  54. for i in range(len(label)-1):
  55. # 查看当前标签
  56. str_label_0 = id2label[label[i]] # 将数字标签转为字符串标签
  57. if str_label_0 == '<START>' or str_label_0 == '<END>': # 特殊符号跳过
  58. continue
  59. elif str_label_0 == 'O': # 非实体标签
  60. label_id_0 = 9999
  61. else: # 实体标签
  62. label_id_0 = str_label_0.split('-')[-1]
  63. label_id_0_type = label_id_0 # 标签对应的实体类型
  64. label_id_0 = entity_data_dict[label_id_0] # 该类型的标签列表
  65. label_id_0 = label_id_0.index(str_label_0) # 获取该字符标签在该列表里的索引
  66. # 查看下一个标签
  67. str_label_1 = id2label[label[i + 1]] # 将数字标签转为字符串标签
  68. if str_label_1 == '<START>' or str_label_1 == '<END>':
  69. label_id_1 = '特殊符号'
  70. elif str_label_1 == 'O':
  71. label_id_1 = 9999
  72. else:
  73. label_id_1 = str_label_1.split('-')[-1] # 获取标签对应的实体类型
  74. label_id_1_type = label_id_1
  75. label_id_1 = entity_data_dict[label_id_1] # 根据类型获取改类型的标签列表
  76. label_id_1 = label_id_1.index(str_label_1) # 获取该字符标签在该列表里的索引
  77. # 匹配(B,O) {B:0, I:1, O:9999}
  78. if ((label_id_0 == 0 and label_id_1 == 9999) or # B、O情况
  79. (label_id_0 == 0 and label_id_1 == 1 and label_id_0_type != label_id_1_type)): # 不同类型的B、I情况
  80. print("(B,O):", str_label_0)
  81. # start = i
  82. # start_type = label_id_0_type
  83. # end = i + 1
  84. # 本数据没这种情况
  85. start = None
  86. start_type = None
  87. end = None
  88. # 匹配(B,I,···,O)、(B,I,O)
  89. else:
  90. if label_id_0 == 0 and label_id_1 == 1 and label_id_0_type == label_id_1_type: # 同类型B、I情况(开始位置)
  91. print("(B,I)start:", str_label_0)
  92. start = i
  93. start_type = label_id_0_type
  94. elif ((label_id_0 == 1 and label_id_1 == 0) or # I、B情况(结束位置)
  95. (label_id_0 == 1 and label_id_1 == 9999) # I、O情况(结束位置)
  96. ): # I、B情况(结束位置)
  97. print("(I,O)end:", str_label_0)
  98. end = i + 1
  99. elif label_id_0 == 9999: # 当前出现O,清空标记 # O、?情况(无用位置, 表示已经获取完成)
  100. start = None
  101. start_type = None
  102. end = None
  103. else: pass
  104. # 根据start和end截取实体
  105. if start is not None and end is not None and start_type is not None and int(start) < int(end):
  106. this_entity = text[start:end]
  107. this_entity = ''.join(this_entity)
  108. print('result————>', this_entity)
  109. entity.append((this_entity, start_type))
  110. start = None
  111. start_type = None
  112. end = None
  113. # 按章节归类
  114. if title == last_title: # 同一章节名
  115. last_title_label += label
  116. last_title_text += text
  117. last_title_entity += entity
  118. if int(index) == len(results)-1: # 最后一章最后一条数据,添加最后一章
  119. all_title_label.append([last_title, list(set(last_title_entity))])
  120. else: # 不同章节名,则上一个章节提取完成
  121. if int(index) > 0: # 除去开始,之后的标题不同,说明上一章节处理完成
  122. all_title_label.append([last_title, list(set(last_title_entity))])
  123. last_title = title
  124. last_title_label = label
  125. last_title_text = text
  126. last_title_entity = entity
  127. # 添加进对应的章节内容(完整的章节内容)
  128. text_file = pd.read_excel(r'data/斗破苍穹(标注与未标注数据).xlsx', sheet_name='未标注数据部分')
  129. for index, row in text_file.iterrows():
  130. title = row['标题']
  131. text = row['文本'].replace(' ', '')
  132. if str(title) != str(all_title_label[int(index)][0]):
  133. print(title, all_title_label[int(index)][0])
  134. all_title_label[int(index)].append(text)
  135. header = ['标题', '识别结果', '文本']
  136. data = pd.DataFrame(all_title_label, columns=header)
  137. data.to_excel('data/斗破苍穹_预测结果提取.xlsx')

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/你好赵伟/article/detail/523058
推荐阅读
相关标签
  

闽ICP备14008679号