当前位置:   article > 正文

Bert实现意图分类_token_type_ids

token_type_ids

来自保姆级教程,用PyTorch和BERT进行文本分类

一、bert

bert模型的下载:去抱抱脸网站bert-base-cased at main下载预训练模型,下载对应的这三个文件,这里下载的是pytorch版本

 下载后放入对应文件夹,是这样的:

 验证bert能不能调用成功:

  1. from transformers import BertModel,BertTokenizer
  2. BERT_PATH = './bert-base-cased'
  3. tokenizer = BertTokenizer.from_pretrained(BERT_PATH)
  4. print(tokenizer.tokenize('I have a good time, thank you.'))
  5. bert = BertModel.from_pretrained(BERT_PATH)
  6. print('load bert model over')
  7. ['I', 'have', 'a', 'good', 'time',
  8. ',', 'thank', 'you', '.']
  9. load bert model over

BertTokenizer解析:BertTokenizer将数据处理成bert需要的格式

  1. from transformers import BertTokenizer
  2. tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
  3. example_text = 'I will watch Memento tonight'
  4. bert_input = tokenizer(example_text,padding='max_length',
  5. max_length = 10,
  6. truncation=True,
  7. return_tensors="pt")
  8. # ------- bert_input ------
  9. print(bert_input['input_ids'])
  10. print(bert_input['token_type_ids'])
  11. print(bert_input['attention_mask'])
  12. tensor([[ 101, 146, 1209, 2824, 2508,
  13. 26173, 3568, 102, 0, 0]])
  14. tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
  15. tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0]])

BertTokenizer参数:

  • padding:将每个sequence填充到指定的最大长度。
  • max_length: 每个sequence的最大长度。本示例中我们使用 10,但对于本文实际数据集,我们将使用 512,这是 BERT 允许的sequence 的最大长度。
  • truncation:如果为True,则每个序列中超过最大长度的标记将被截断。
  • return_tensors:将返回的张量类型。由于我们使用的是 Pytorch,所以我们使用pt;如果你使用 Tensorflow,那么你需要使用tf

BertTokenizer输出:bert_input

  • input_ids,它是每个 token 的 id 表示,101代表[CLS],102代表[SEP],0代表[PAD]
  • token_type_ids,它是一个 binary mask,用于标识 token 属于哪个 sequence。如果我们只有一个 sequence,那么所有的 token 类型 id 都将为 0。对于文本分类任务,token_type_ids是 BERT 模型的可选输入参数。
  • attention_mask,它是一个 binary mask,用于标识 token 是真实 word 还是只是由填充得到。如果 token 包含 [CLS]、[SEP] 或任何真实单词,则 mask 将为 1。如果 token 只是 [PAD] 填充,则 mask 将为 0

二、定义模型

bert的输出送入一层全连接层,再通过一层relu层

  1. from torch import nn
  2. from transformers import BertModel
  3. import torch
  4. class BertClassifier(nn.Module):
  5. def __init__(self, dropout=0.5):
  6. super(BertClassifier, self).__init__()
  7. #self.bert = BertModel.from_pretrained('bert-base-cased')
  8. self.bert = BertModel.from_pretrained('/home/jiqiboyi03/chenpp/bert-classification/bert-base-cased')
  9. self.dropout = nn.Dropout(dropout)
  10. self.linear = nn.Linear(768, 7)
  11. self.relu = nn.ReLU()
  12. def forward(self, input_id, mask):
  13. _, pooled_output = self.bert(input_ids= input_id, attention_mask=mask,return_dict=False)
  14. # print(pooled_output.size())#[batch_size,768] CLS的向量
  15. dropout_output = self.dropout(pooled_output)
  16. linear_output = self.linear(dropout_output)
  17. final_layer = self.relu(linear_output)
  18. return final_layer

输入input_ids和mask的格式应该是:以batch_size=2为例,input_ids应该是二维的,mask二维三维都可以

  1. input_id=torch.tensor([[ 101, 178, 112, 173, 1176, 170, 189, 3624, 3043, 1121,
  2. 17496, 1396, 11305, 1106, 1207, 26063, 4661, 1664, 26645, 102],
  3. [ 101, 1110, 1175, 170, 20811, 3043, 1121, 10552, 4121, 1106,
  4. 21718, 1179, 175, 4047, 21349, 2528, 102, 0, 0, 0]])
  5. mask=torch.tensor([[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
  6. [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]]])

三、数据预处理

  1. import torch
  2. import numpy as np
  3. from transformers import BertTokenizer
  4. import torch.utils.data as data
  5. bert_path='/home/jiqiboyi03/chenpp/bert-classification/bert-base-cased'
  6. tokenizer = BertTokenizer.from_pretrained(bert_path)
  7. labels={'AddToPlaylist':0,'BookRestaurant':1,'GetWeather':2,'PlayMusic':3,'RateBook':4,'SearchCreativeWork':5,'SearchScreeningEvent':6}
  8. class Dataset(data.Dataset):
  9. def __init__(self, df):
  10. self.labels = [labels[label] for label in df['category']]
  11. self.texts = [tokenizer(text,
  12. padding='max_length',
  13. max_length = 20,
  14. truncation=True,
  15. return_tensors="pt")
  16. for text in df['text']]
  17. def classes(self):
  18. return self.labels
  19. def __len__(self):
  20. return len(self.labels)
  21. def get_batch_labels(self, idx):
  22. # Fetch a batch of labels
  23. return np.array(self.labels[idx])
  24. def get_batch_texts(self, idx):
  25. # Fetch a batch of inputs
  26. return self.texts[idx]
  27. def __getitem__(self, idx):
  28. batch_texts = self.get_batch_texts(idx)
  29. batch_y = self.get_batch_labels(idx)
  30. return batch_texts, batch_y

其中df的格式应该是:

  1. {'category': ['PlayMusic', .....],
  2. 'text': ['open groove shark and play native us',..... }

四、训练

  1. from torch.optim import Adam
  2. from tqdm import tqdm
  3. from snips_process import Dataset
  4. import torch
  5. import torch.utils.data as data
  6. import torch.nn as nn
  7. import matplotlib.pyplot as plt
  8. def train(model, train_data, val_data, learning_rate, epochs):
  9. # 通过Dataset类获取训练和验证集
  10. train, val = Dataset(train_data), Dataset(val_data)
  11. # DataLoader根据batch_size获取数据,训练时选择打乱样本
  12. train_dataloader = torch.utils.data.DataLoader(train, batch_size=32, shuffle=True)
  13. val_dataloader = torch.utils.data.DataLoader(val, batch_size=32)
  14. # 判断是否使用GPU
  15. use_cuda = torch.cuda.is_available()
  16. device = torch.device("cuda" if use_cuda else "cpu")
  17. # 定义损失函数和优化器
  18. criterion = nn.CrossEntropyLoss()
  19. optimizer = Adam(model.parameters(), lr=learning_rate)
  20. train_loss=[]
  21. train_acc=[]
  22. val_loss=[]
  23. val_acc=[]
  24. EPOCH=[]
  25. if use_cuda:
  26. model = model.cuda()
  27. criterion = criterion.cuda()
  28. # 开始进入训练循环
  29. for epoch_num in range(epochs):
  30. # 定义两个变量,用于存储训练集的准确率和损失
  31. total_acc_train = 0
  32. total_loss_train = 0
  33. # 进度条函数tqdm
  34. for train_input, train_label in tqdm(train_dataloader):
  35. train_label = train_label.to(device)#1
  36. mask = train_input['attention_mask'].to(device)
  37. input_id = train_input['input_ids'].squeeze(1).to(device)
  38. # print("input_id size:",input_id.size())#[32,20]
  39. # print("mask size:",mask.size())
  40. # 通过模型得到输出
  41. output = model(input_id, mask)#[32,21]
  42. # 计算损失
  43. batch_loss = criterion(output, train_label)
  44. total_loss_train += batch_loss.item()
  45. # 计算精度
  46. acc = (output.argmax(dim=1) == train_label).sum().item()
  47. total_acc_train += acc
  48. # 模型更新
  49. model.zero_grad()
  50. batch_loss.backward()
  51. optimizer.step()
  52. # ------ 验证模型 -----------
  53. # 定义两个变量,用于存储验证集的准确率和损失
  54. total_acc_val = 0
  55. total_loss_val = 0
  56. # 不需要计算梯度
  57. with torch.no_grad():
  58. # 循环获取数据集,并用训练好的模型进行验证
  59. for val_input, val_label in val_dataloader:
  60. # 如果有GPU,则使用GPU,接下来的操作同训练
  61. val_label = val_label.to(device)
  62. mask = val_input['attention_mask'].to(device)
  63. input_id = val_input['input_ids'].squeeze(1).to(device)
  64. output = model(input_id, mask)
  65. batch_loss = criterion(output, val_label)
  66. total_loss_val += batch_loss.item()
  67. acc = (output.argmax(dim=1) == val_label).sum().item()
  68. total_acc_val += acc
  69. train_loss.append(total_loss_train / len(train_data['text']))
  70. train_acc.append(total_acc_train / len(train_data['text']))
  71. val_loss.append(total_loss_val / len(val_data['text']))
  72. val_acc.append(total_acc_val / len(val_data['text']))
  73. EPOCH.append(epoch_num+1)
  74. print(
  75. f'''Epochs: {epoch_num + 1}
  76. | Train Loss: {total_loss_train / len(train_data['text']): .3f}
  77. | Train Accuracy: {total_acc_train / len(train_data['text']): .3f}
  78. | Val Loss: {total_loss_val / len(val_data['text']): .3f}
  79. | Val Accuracy: {total_acc_val / len(val_data['text']): .3f}''')
  80. print("saving bert model......")
  81. torch.save(model.state_dict(),'../bert-base-cased/bert_trained_snips_full.pt')
  82. #画图
  83. plt.plot(EPOCH,train_loss,'b',label='train_loss')
  84. plt.plot(EPOCH, train_acc,'g',label='train_acc')
  85. plt.plot(EPOCH, val_loss, 'r', label='val_loss')
  86. plt.plot(EPOCH, val_acc, 'c', label='val_acc')
  87. plt.show()

注:train_data即数据预处理中的df,经过DataLoader会加一维batch_size,变成3维,但是bert的输入得是2维,因此经过了squeeze(1)操作

五、验证

  1. from snips_process import Dataset,df_test
  2. import torch
  3. from model import BertClassifier
  4. import torch.utils.data as data
  5. def evaluate(model, test_data):
  6. test = Dataset(test_data)
  7. length=len(test_data['text'])
  8. test_dataloader = torch.utils.data.DataLoader(test, batch_size=2)
  9. use_cuda = torch.cuda.is_available()
  10. device = torch.device("cuda" if use_cuda else "cpu")
  11. if use_cuda:
  12. model = model.cuda()
  13. total_acc_test = 0
  14. with torch.no_grad():
  15. for test_input, test_label in test_dataloader:
  16. test_label = test_label.to(device)
  17. mask = test_input['attention_mask'].to(device)
  18. input_id = test_input['input_ids'].squeeze(1).to(device)
  19. output = model(input_id, mask)
  20. acc = (output.argmax(dim=1) == test_label).sum().item()
  21. total_acc_test += acc
  22. print(f'Test Accuracy: {total_acc_test / length: .3f}')

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/凡人多烦事01/article/detail/348692?site
推荐阅读
  • 相关标签
      

    闽ICP备14008679号