当前位置:   article > 正文

HW7-BERT

HW7-BERT

HW7-BERT(Question Answering)

1.Task description

  • Chinese Extractive Question Answering

    • Input: Paragraph + Question
    • Output: Answer
  • Objective: Learn how to fine tune a pretrained model on downstream task using transformers

  • Todo

    • Fine tune a pretrained chinese BERT model
    • Change hyperparameters (e.g. doc_stride)
    • Apply linear learning rate decay
    • Try other pretrained models
    • Improve preprocessing
    • Improve postprocessing
  • Training tips

    • Automatic mixed precision
    • Gradient accumulation
    • Ensemble
  • Estimated training time (tesla t4 with automatic mixed precision enabled)

    • Simple: 8mins
    • Medium: 8mins
    • Strong: 25mins
    • Boss: 2.5hrs

2.Code

本人在Sample Code上进行了若干尝试(Linear learning rate&doc_stride&Automatic mixed precision即fp16&Try other pretrained models&Preprocessing &Postprocessing ),均体现在代码中。
有些效果较好,有些并不理想。

2.1 Download Dataset

# Download link 1
!gdown --id '1AVgZvy3VFeg0fX-6WQJMHPVrx3A-M1kb' --output hw7_data.zip

# Download Link 2 (if the above link fails)
# !gdown --id '1qwjbRjq481lHsnTrrF4OjKQnxzgoLEFR' --output hw7_data.zip

# Download Link 3 (if the above link fails)
# !gdown --id '1QXuWjNRZH6DscSd6QcRER0cnxmpZvijn' --output hw7_data.zip

!unzip -o hw7_data.zip

# For this HW, K80 < P4 < T4 < P100 <= T4(fp16) < V100
!nvidia-smi
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13

2.2 Install transformers

Documentation for the toolkit: https://huggingface.co/transformers/

# You are allowed to change version of transformers or use other toolkits
!pip install transformers==4.5.0
  • 1
  • 2

2.3 Import Packages

import json
import numpy as np
import random
import torch
from torch.utils.data import DataLoader, Dataset
from transformers import AdamW, BertForQuestionAnswering, BertTokenizerFast

from tqdm.auto import tqdm

device = "cuda" if torch.cuda.is_available() else "cpu"

# Fix random seed for reproducibility
def same_seeds(seed):
	  torch.manual_seed(seed)
	  if torch.cuda.is_available():
		    torch.cuda.manual_seed(seed)
		    torch.cuda.manual_seed_all(seed)
	  np.random.seed(seed)
	  random.seed(seed)
	  torch.backends.cudnn.benchmark = False
	  torch.backends.cudnn.deterministic = True
same_seeds(0)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
# Change "fp16_training" to True to support automatic mixed precision training (fp16) 由32bit浮点数改为16bit加速计算
fp16_training = True

if fp16_training:
	#============================NEW BEGIN============================
    # !pip install accelerate==0.2.0
    !pip install accelerate==0.16.0
    from accelerate import Accelerator
    # accelerator = Accelerator(fp16=True)
    accelerator = Accelerator(mixed_precision='fp16')
    #============================NEW END============================

    device = accelerator.device

# Documentation for the toolkit:  https://huggingface.co/docs/accelerate/
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15

Automatic mixed precision:使用16位浮点数替代32位浮点数,加速训练。但是由于accelerate的版本迭代,直接运行Sample Code会出现TypeError: init() got an unexpected keyword argument 'fp16'的报错。
错误解决
fp16的时候要将代码中的accelerator = Accelerator(fp16=True)改为accelerator = Accelerator(mixed_precision='fp16')同时要将版本改为!pip install accelerate==0.16.0才会运行成功。

2.4 Load Model and Tokenizer

model = BertForQuestionAnswering.from_pretrained("bert-base-chinese").to(device)
tokenizer = BertTokenizerFast.from_pretrained("bert-base-chinese")

#============================NEW BEGIN============================
### Try to use other model
# model = BertForQuestionAnswering.from_pretrained("bert-base-multilingual-cased").to(device)
# tokenizer = BertTokenizerFast.from_pretrained("bert-base-multilingual-cased")
#============================NEW END============================

# You can safely ignore the warning message (it pops up because new prediction heads for QA are initialized randomly)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

Try other pretrained model: 尝试了bert-base-multilingual-cased预训练模型。

2.5 Read Data

  • Training set: 31690 QA pairs

  • Dev set: 4131 QA pairs

  • Test set: 4957 QA pairs

  • {train/dev/test}_questions:

    • List of dicts with the following keys:
    • id (int)
    • paragraph_id (int)
    • question_text (string)
    • answer_text (string)
    • answer_start (int)
    • answer_end (int)
  • {train/dev/test}_paragraphs:

    • List of strings
    • paragraph_ids in questions correspond to indexs in paragraphs
    • A paragraph may be used by several questions

具体的数据格式

def read_data(file):
    with open(file, 'r', encoding="utf-8") as reader:
        data = json.load(reader)
    return data["questions"], data["paragraphs"]

train_questions, train_paragraphs = read_data("hw7_train.json")
dev_questions, dev_paragraphs = read_data("hw7_dev.json")
test_questions, test_paragraphs = read_data("hw7_test.json")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

2.6 Tokenize Data

# Tokenize questions and paragraphs separately
# 「add_special_tokens」 is set to False since special tokens will be added when tokenized questions and paragraphs are combined in datset __getitem__

train_questions_tokenized = tokenizer([train_question["question_text"] for train_question in train_questions], add_special_tokens=False)
dev_questions_tokenized = tokenizer([dev_question["question_text"] for dev_question in dev_questions], add_special_tokens=False)
test_questions_tokenized = tokenizer([test_question["question_text"] for test_question in test_questions], add_special_tokens=False)

train_paragraphs_tokenized = tokenizer(train_paragraphs, add_special_tokens=False)
dev_paragraphs_tokenized = tokenizer(dev_paragraphs, add_special_tokens=False)
test_paragraphs_tokenized = tokenizer(test_paragraphs, add_special_tokens=False)

# You can safely ignore the warning message as tokenized sequences will be futher processed in datset __getitem__ before passing to model
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12

2.7 Dataset and Dataloader

class QA_Dataset(Dataset):
    def __init__(self, split, questions, tokenized_questions, tokenized_paragraphs):
        self.split = split
        self.questions = questions
        self.tokenized_questions = tokenized_questions
        self.tokenized_paragraphs = tokenized_paragraphs
        self.max_question_len = 40
        self.max_paragraph_len = 150

        ##### TODO: Change value of doc_stride #####
        # self.doc_stride = 150 # doc_stride=max_paragraph_len=150 使得两个window不会产生overlap
        
        #============================NEW BEGIN============================
        self.doc_stride = 80 # 将doc_stride设小可以让两个window产生overlap
		#============================NEW END============================
		
        # Input sequence length = [CLS] + question + [SEP] + paragraph + [SEP]
        self.max_seq_len = 1 + self.max_question_len + 1 + self.max_paragraph_len + 1

    def __len__(self):
        return len(self.questions)

    def __getitem__(self, idx):
        question = self.questions[idx]
        tokenized_question = self.tokenized_questions[idx] # 问题idx的题目
        tokenized_paragraph = self.tokenized_paragraphs[question["paragraph_id"]] # 题目答案对应的段落

        ##### TODO: Preprocessing #####
        # Hint: How to prevent model from learning something it should not learn
        # 比如,答案一般在分句窗口的中间,但模型不应该学习到只从每句中间找答案

        if self.split == "train":
            # Convert answer's start/end positions in paragraph_text to start/end positions in tokenized_paragraph
            answer_start_token = tokenized_paragraph.char_to_token(question["answer_start"]) # return int: answer_start在答案对应段落中为第几个token
            answer_end_token = tokenized_paragraph.char_to_token(question["answer_end"])

            # A single window is obtained by slicing the portion of paragraph containing the answer
            mid = (answer_start_token + answer_end_token) // 2 # 整除2,答案序列中点
            paragraph_start = max(0, min(mid - self.max_paragraph_len // 2, len(tokenized_paragraph) - self.max_paragraph_len))
            paragraph_end = paragraph_start + self.max_paragraph_len # 在段落中找到一个包含答案的起始

            #============================NEW BEGIN============================
            # 在答案附近随机切一个window,让答案不位于window中间
            # https://github.com/Wangdaoshuai/LHYML2021-Spring/blob/main/HW7-medium.ipynb
            # '''
            # mid = (answer_start_token + answer_end_token) // 2
            # answer_length = answer_end_token - answer_start_token + 1
            # if answer_length // 2 < self.max_paragraph_len - answer_length // 2:
            #   rnd = random.randint(answer_length // 2, self.max_paragraph_len - answer_length // 2)
            # else:
            #   rnd = self.max_paragraph_len // 2
            # paragraph_start = max(0, min(mid - rnd, len(tokenized_paragraph) - self.max_paragraph_len))
            # paragraph_end = paragraph_start + self.max_paragraph_len
			#============================NEW END============================

            # Slice question/paragraph and add special tokens (101: CLS, 102: SEP)
            input_ids_question = [101] + tokenized_question.ids[:self.max_question_len] + [102]
            input_ids_paragraph = tokenized_paragraph.ids[paragraph_start : paragraph_end] + [102]

            # Convert answer's start/end positions in tokenized_paragraph to start/end positions in the window
            answer_start_token += len(input_ids_question) - paragraph_start
            answer_end_token += len(input_ids_question) - paragraph_start

            # Pad sequence and obtain inputs to model
            input_ids, token_type_ids, attention_mask = self.padding(input_ids_question, input_ids_paragraph)
            return torch.tensor(input_ids), torch.tensor(token_type_ids), torch.tensor(attention_mask), answer_start_token, answer_end_token

        # Validation/Testing
        else:
            input_ids_list, token_type_ids_list, attention_mask_list = [], [], []

            # Paragraph is split into several windows, each with start positions separated by step "doc_stride"
            for i in range(0, len(tokenized_paragraph), self.doc_stride):

                # Slice question/paragraph and add special tokens (101: CLS, 102: SEP)
                input_ids_question = [101] + tokenized_question.ids[:self.max_question_len] + [102]
                input_ids_paragraph = tokenized_paragraph.ids[i : i + self.max_paragraph_len] + [102]

                # Pad sequence and obtain inputs to model
                input_ids, token_type_ids, attention_mask = self.padding(input_ids_question, input_ids_paragraph)

                input_ids_list.append(input_ids)
                token_type_ids_list.append(token_type_ids)
                attention_mask_list.append(attention_mask)

            return torch.tensor(input_ids_list), torch.tensor(token_type_ids_list), torch.tensor(attention_mask_list)

    def padding(self, input_ids_question, input_ids_paragraph):
        # Pad zeros if sequence length is shorter than max_seq_len
        padding_len = self.max_seq_len - len(input_ids_question) - len(input_ids_paragraph)
        # Indices of input sequence tokens in the vocabulary
        input_ids = input_ids_question + input_ids_paragraph + [0] * padding_len
        # Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]
        token_type_ids = [0] * len(input_ids_question) + [1] * len(input_ids_paragraph) + [0] * padding_len
        # Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
        attention_mask = [1] * (len(input_ids_question) + len(input_ids_paragraph)) + [0] * padding_len

        return input_ids, token_type_ids, attention_mask

train_set = QA_Dataset("train", train_questions, train_questions_tokenized, train_paragraphs_tokenized)
dev_set = QA_Dataset("dev", dev_questions, dev_questions_tokenized, dev_paragraphs_tokenized)
test_set = QA_Dataset("test", test_questions, test_questions_tokenized, test_paragraphs_tokenized)

train_batch_size = 32

# Note: Do NOT change batch size of dev_loader / test_loader !
# Although batch size=1, it is actually a batch consisting of several windows from the same QA pair
train_loader = DataLoader(train_set, batch_size=train_batch_size, shuffle=True, pin_memory=True)
dev_loader = DataLoader(dev_set, batch_size=1, shuffle=False, pin_memory=True)
test_loader = DataLoader(test_set, batch_size=1, shuffle=False, pin_memory=True)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110

2.8 Function for Evaluation

def evaluate(data, output):
    ##### TODO: Postprocessing #####
    # There is a bug and room for improvement in postprocessing
    # Hint: Open your prediction file to see what is wrong
    # the but probably is that the start_index > end_idex

    answer = ''
    max_prob = float('-inf')
    num_of_windows = data[0].shape[1]

    for k in range(num_of_windows):
        # Obtain answer by choosing the most probable start position / end position
        start_prob, start_index = torch.max(output.start_logits[k], dim=0)
        end_prob, end_index = torch.max(output.end_logits[k], dim=0)

        # Probability of answer is calculated as sum of start_prob and end_prob
        prob = start_prob + end_prob

        # Replace answer if calculated probability is larger than previous windows
        #============================NEW BEGIN============================
        # if prob > max_prob and start_index <= end_index: 位置特判
        #============================NEW END============================
        if prob > max_prob:
            max_prob = prob
            # Convert tokens to chars (e.g. [1920, 7032] --> "大 金")
            answer = tokenizer.decode(data[0][0][k][start_index : end_index + 1])

    # Remove spaces in answer (e.g. "大 金" --> "大金")
    return answer.replace(' ','')
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29

2.9 Training

num_epoch = 1
validation = True
logging_step = 100
learning_rate = 1e-4
optimizer = AdamW(model.parameters(), lr=learning_rate)

if fp16_training:
    model, optimizer, train_loader = accelerator.prepare(model, optimizer, train_loader)

model.train()

print("Start Training ...")

for epoch in range(num_epoch):
    step = 1
    train_loss = train_acc = 0

    for data in tqdm(train_loader):
        # Load all data into GPU
        data = [i.to(device) for i in data]

        # Model inputs: input_ids, token_type_ids, attention_mask, start_positions, end_positions (Note: only "input_ids" is mandatory)
        # Model outputs: start_logits, end_logits, loss (return when start_positions/end_positions are provided)
        output = model(input_ids=data[0], token_type_ids=data[1], attention_mask=data[2], start_positions=data[3], end_positions=data[4])

        # Choose the most probable start position / end position
        start_index = torch.argmax(output.start_logits, dim=1)
        end_index = torch.argmax(output.end_logits, dim=1)

        # Prediction is correct only if both start_index and end_index are correct
        train_acc += ((start_index == data[3]) & (end_index == data[4])).float().mean()
        train_loss += output.loss

        if fp16_training:
            accelerator.backward(output.loss)
        else:
            output.loss.backward()

        optimizer.step()
        optimizer.zero_grad()
        step += 1

        ##### TODO: Apply linear learning rate decay #####
        #============================NEW BEGIN============================
        # check finally whether learning rate is close to zero after training
        optimizer.param_groups[0]["lr"] -= learning_rate / len(train_loader)
        #============================NEW END============================
        
        # Print training loss and accuracy over past logging step
        if step % logging_step == 0:
            print(f"Epoch {epoch + 1} | Step {step} | loss = {train_loss.item() / logging_step:.3f}, acc = {train_acc / logging_step:.3f}")
            train_loss = train_acc = 0
    a = optimizer.param_groups[0]["lr"]
    print(f"Learning rate is {a}")
    if validation:
        print("Evaluating Dev Set ...")
        model.eval()
        with torch.no_grad():
            dev_acc = 0
            for i, data in enumerate(tqdm(dev_loader)):
                output = model(input_ids=data[0].squeeze(dim=0).to(device), token_type_ids=data[1].squeeze(dim=0).to(device),
                       attention_mask=data[2].squeeze(dim=0).to(device))
                # prediction is correct only if answer text exactly matches
                dev_acc += evaluate(data, output) == dev_questions[i]["answer_text"]
            print(f"Validation | Epoch {epoch + 1} | acc = {dev_acc / len(dev_loader):.3f}")
        model.train()

# Save a model and its configuration file to the directory 「saved_model」
# i.e. there are two files under the direcory 「saved_model」: 「pytorch_model.bin」 and 「config.json」
# Saved model can be re-loaded using 「model = BertForQuestionAnswering.from_pretrained("saved_model")」
print("Saving Model ...")
model_save_dir = "saved_model"
model.save_pretrained(model_save_dir)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73

2.10 Testing

print("Evaluating Test Set ...")

result = []

model.eval()
with torch.no_grad():
    for data in tqdm(test_loader):
        output = model(input_ids=data[0].squeeze(dim=0).to(device), token_type_ids=data[1].squeeze(dim=0).to(device),
                       attention_mask=data[2].squeeze(dim=0).to(device))
        result.append(evaluate(data, output))

result_file = "result.csv"
with open(result_file, 'w') as f:
	  f.write("ID,Answer\n")
	  for i, test_question in enumerate(test_questions):
        # Replace commas in answers with empty strings (since csv is separated by comma)
        # Answers in kaggle are processed in the same way
		    f.write(f"{test_question['id']},{result[i].replace(',','')}\n")

print(f"Completed! Result is in {result_file}")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20

2.11 Look the result(读取csv文件)

import pandas
df = pandas.read_csv("./result.csv")
# for i in range(9):
print(type(df))
print(df.iloc[0:8,:])
  • 1
  • 2
  • 3
  • 4
  • 5

3.训练结果

3.1 Criterion

在这里插入图片描述在这里插入图片描述在这里插入图片描述

3.2 Sample Code

在这里插入图片描述
在这里插入图片描述

  • 没有开fp16:运行时间为18mins左右,acc=0.46,达到Sample Baseline;测试集输出答案发现只有个别正确答案。

在这里插入图片描述
在这里插入图片描述

  • fp16:运行时间为6mins左右,且acc=0.504,速度不仅有所提高甚至acc也略有提高。

3.3 Medium (fp16+linear learning rate + doc_stride)

在这里插入图片描述
fp16 & doc_stride=80:没有linear learning rate decay时的acc为0.481。
在这里插入图片描述
在这里插入图片描述
fp16 & linear learning rate decay & doc_stride=80:运行时间6mins左右,且训练结束后发现learning rate趋近于0,且acc=0.651近似达到Medium Baseline,与上面没有使用linear learning rate decay 的精确度相比发现有所提升。但是发现验证集的答案中answer 0 = NaN,此时应该是答案序列的start_index > end_index导致的,应该加上答案首尾的特判。

3.4 Strong

在这里插入图片描述在这里插入图片描述
在上面基础上添加随机window让答案不能总出现在window的中间部分从而导致模型只在window中间寻找答案。发现acc达到了0.724。
在这里插入图片描述
在这里插入图片描述
将预训练模型换为了bert-base-multilingual-cased发现acc为0.732,较未换模型前略有提高,且answer 0不为NaN,但仍没有达到Strong Baseline。。不知道是不是没有换成合适的模型。

3.5 Boss

Sample Code中确定start_index & end_index是取概率和最大的首尾,并没有固定start_index <= end_index
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
特判了预测–答案首位置要在尾位置前面,训练的时候loss下降的非常快但是acc没什么提高,所以特判可能加的不对。


Medium中只使用fp16 & doc_stride & linear learning rate的时候answer 0是NaN,加上位置特判之后answer 0不为NaN了.在这里插入图片描述
由于只使用fp16 & doc_stride & linear learning rate跑的结果较早,又跑了一遍发现answer 0又不是NaN了,,所以这个位置特判不知道是否起了作用,汗。

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/空白诗007/article/detail/909079
推荐阅读
相关标签
  

闽ICP备14008679号