【关系抽取-R-BERT】加载数据集_max_steps is given, it will override any value giv

作者：很楠不爱3 | 2024-05-06 08:13:04

踩

max_steps is given, it will override any value given in num_train_epochs

Component-Whole(e2,e1) The system as described above has its greatest application in an arrayed configuration of antenna elements .
Other The child was carefully wrapped and bound into the cradle by means of a cord.
Instrument-Agency(e2,e1) The author of a keygen uses a disassembler to look at the raw assembly code.
Other A misty ridge uprises from the surge .
Member-Collection(e1,e2) The student association is the voice of the undergraduate student population of the State University of New York at Buffalo.
Other This is the sprawling complex that is Peru’s largest producer of silver.
Cause-Effect(e2,e1) The current view is that the chronic inflammation in the distal part of the stomach caused by Helicobacter pylori infection results in an increased acid production from the non-infected upper corpus region of the stomach.
Entity-Destination(e1,e2) People have been moving back into downtown .
Content-Container(e1,e2) The lawsonite was contained in a platinum crucible and the counter-weight was a plastic crucible with metal pieces.
Entity-Destination(e1,e2) The solute was placed inside a beaker and 5 mL of the solvent was pipetted into a 25 mL glass flask for each trial.
Member-Collection(e1,e2) The fifty essays collected in this volume testify to most of the prominent themes from Professor Quispel’s scholarly career.
Other Their composer has sunk into oblivion .
该数据是SemEval2010 Task8数据集，数据，具体介绍可以参考：https://blog.csdn.net/qq_29883591/article/details/88567561

处理数据相关代码：data_loader.py
import copy
import csv
import json
import logging
import os

import torch
from torch.utils.data import TensorDataset

from utils import get_label

logger = logging.getLogger(name)

class InputExample(object):
“”"
A single training/test example for simple sequence classification.

Args:
    guid: Unique id for the example.
    text_a: string. The untokenized text of the first sequence. For single
    sequence tasks, only this sequence must be specified.
    label: (Optional) string. The label of the example. This should be
    specified for train and dev examples, but not for test examples.
"""

def __init__(self, guid, text_a, label):
    self.guid = guid
    self.text_a = text_a
    self.label = label

def __repr__(self):
    return str(self.to_json_string())

def to_dict(self):
    """Serializes this instance to a Python dictionary."""
    output = copy.deepcopy(self.__dict__)
    return output

def to_json_string(self):
    """Serializes this instance to a JSON string."""
    return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

class InputFeatures(object):
“”"
A single set of features of data.

Args:
    input_ids: Indices of input sequence tokens in the vocabulary.
    attention_mask: Mask to avoid performing attention on padding token indices.
        Mask values selected in ``[0, 1]``:
        Usually  ``1`` for tokens that are NOT MASKED, ``0`` for MASKED (padded) tokens.
    token_type_ids: Segment token indices to indicate first and second portions of the inputs.
"""

def __init__(self, input_ids, attention_mask, token_type_ids, label_id, e1_mask, e2_mask):
    self.input_ids = input_ids
    self.attention_mask = attention_mask
    self.token_type_ids = token_type_ids
    self.label_id = label_id
    self.e1_mask = e1_mask
    self.e2_mask = e2_mask

def __repr__(self):
    return str(self.to_json_string())

def to_dict(self):
    """Serializes this instance to a Python dictionary."""
    output = copy.deepcopy(self.__dict__)
    return output

def to_json_string(self):
    """Serializes this instance to a JSON string."""
    return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

class SemEvalProcessor(object):
“”“Processor for the Semeval data set “””

def __init__(self, args):
    self.args = args
    self.relation_labels = get_label(args)

@classmethod
def _read_tsv(cls, input_file, quotechar=None):
    """Reads a tab separated value file."""
    with open(input_file, "r", encoding="utf-8") as f:
        reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
        lines = []
        for line in reader:
            lines.append(line)
        return lines

def _create_examples(self, lines, set_type):
    """Creates examples for the training and dev sets."""
    examples = []
    for (i, line) in enumerate(lines):
        guid = "%s-%s" % (set_type, i)
        text_a = line[1]
        label = self.relation_labels.index(line[0])
        if i % 1000 == 0:
            logger.info(line)
        examples.append(InputExample(guid=guid, text_a=text_a, label=label))
    return examples

def get_examples(self, mode):
    """
    Args:
        mode: train, dev, test
    """
    file_to_read = None
    if mode == "train":
        file_to_read = self.args.train_file
    elif mode == "dev":
        file_to_read = self.args.dev_file
    elif mode == "test":
        file_to_read = self.args.test_file

    logger.info("LOOKING AT {}".format(os.path.join(self.args.data_dir, file_to_read)))
    return self._create_examples(self._read_tsv(os.path.join(self.args.data_dir, file_to_read)), mode)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41

processors = {“semeval”: SemEvalProcessor}

def convert_examples_to_features(
examples,
max_seq_len,
tokenizer,
cls_token="[CLS]",
cls_token_segment_id=0,
sep_token="[SEP]",
pad_token=0,
pad_token_segment_id=0,
sequence_a_segment_id=0,
add_sep_token=False,
mask_padding_with_zero=True,
):
features = []
for (ex_index, example) in enumerate(examples):
if ex_index % 5000 == 0:
logger.info(“Writing example %d of %d” % (ex_index, len(examples)))

    tokens_a = tokenizer.tokenize(example.text_a)

    e11_p = tokens_a.index("<e1>")  # the start position of entity1
    e12_p = tokens_a.index("</e1>")  # the end position of entity1
    e21_p = tokens_a.index("<e2>")  # the start position of entity2
    e22_p = tokens_a.index("</e2>")  # the end position of entity2

    # Replace the token
    tokens_a[e11_p] = "$"
    tokens_a[e12_p] = "$"
    tokens_a[e21_p] = "#"
    tokens_a[e22_p] = "#"

    # Add 1 because of the [CLS] token
    e11_p += 1
    e12_p += 1
    e21_p += 1
    e22_p += 1

    # Account for [CLS] and [SEP] with "- 2" and with "- 3" for RoBERTa.
    if add_sep_token:
        special_tokens_count = 2
    else:
        special_tokens_count = 1
    if len(tokens_a) > max_seq_len - special_tokens_count:
        tokens_a = tokens_a[: (max_seq_len - special_tokens_count)]

    tokens = tokens_a
    if add_sep_token:
        tokens += [sep_token]

    token_type_ids = [sequence_a_segment_id] * len(tokens)

    tokens = [cls_token] + tokens
    token_type_ids = [cls_token_segment_id] + token_type_ids

    input_ids = tokenizer.convert_tokens_to_ids(tokens)

    # The mask has 1 for real tokens and 0 for padding tokens. Only real tokens are attended to.
    attention_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)

    # Zero-pad up to the sequence length.
    padding_length = max_seq_len - len(input_ids)
    input_ids = input_ids + ([pad_token] * padding_length)
    attention_mask = attention_mask + ([0 if mask_padding_with_zero else 1] * padding_length)
    token_type_ids = token_type_ids + ([pad_token_segment_id] * padding_length)

    # e1 mask, e2 mask
    e1_mask = [0] * len(attention_mask)
    e2_mask = [0] * len(attention_mask)

    for i in range(e11_p, e12_p + 1):
        e1_mask[i] = 1
    for i in range(e21_p, e22_p + 1):
        e2_mask[i] = 1

    assert len(input_ids) == max_seq_len, "Error with input length {} vs {}".format(len(input_ids), max_seq_len)
    assert len(attention_mask) == max_seq_len, "Error with attention mask length {} vs {}".format(
        len(attention_mask), max_seq_len
    )
    assert len(token_type_ids) == max_seq_len, "Error with token type length {} vs {}".format(
        len(token_type_ids), max_seq_len
    )

    label_id = int(example.label)

    if ex_index < 5:
        logger.info("*** Example ***")
        logger.info("guid: %s" % example.guid)
        logger.info("tokens: %s" % " ".join([str(x) for x in tokens]))
        logger.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
        logger.info("attention_mask: %s" % " ".join([str(x) for x in attention_mask]))
        logger.info("token_type_ids: %s" % " ".join([str(x) for x in token_type_ids]))
        logger.info("label: %s (id = %d)" % (example.label, label_id))
        logger.info("e1_mask: %s" % " ".join([str(x) for x in e1_mask]))
        logger.info("e2_mask: %s" % " ".join([str(x) for x in e2_mask]))

    features.append(
        InputFeatures(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            label_id=label_id,
            e1_mask=e1_mask,
            e2_mask=e2_mask,
        )
    )

return features
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89

def load_and_cache_examples(args, tokenizer, mode):
processor = processorsargs.task

# Load data features from cache or dataset file
cached_features_file = os.path.join(
    args.data_dir,
    "cached_{}_{}_{}_{}".format(
        mode,
        args.task,
        list(filter(None, args.model_name_or_path.split("/"))).pop(),
        args.max_seq_len,
    ),
)

if os.path.exists(cached_features_file):
    logger.info("Loading features from cached file %s", cached_features_file)
    features = torch.load(cached_features_file)
else:
    logger.info("Creating features from dataset file at %s", args.data_dir)
    if mode == "train":
        examples = processor.get_examples("train")
    elif mode == "dev":
        examples = processor.get_examples("dev")
    elif mode == "test":
        examples = processor.get_examples("test")
    else:
        raise Exception("For mode, Only train, dev, test is available")

    features = convert_examples_to_features(
        examples, args.max_seq_len, tokenizer, add_sep_token=args.add_sep_token
    )
    logger.info("Saving features into cached file %s", cached_features_file)
    torch.save(features, cached_features_file)

# Convert to Tensors and build dataset
all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
all_attention_mask = torch.tensor([f.attention_mask for f in features], dtype=torch.long)
all_token_type_ids = torch.tensor([f.token_type_ids for f in features], dtype=torch.long)
all_e1_mask = torch.tensor([f.e1_mask for f in features], dtype=torch.long)  # add e1 mask
all_e2_mask = torch.tensor([f.e2_mask for f in features], dtype=torch.long)  # add e2 mask

all_label_ids = torch.tensor([f.label_id for f in features], dtype=torch.long)

dataset = TensorDataset(
    all_input_ids,
    all_attention_mask,
    all_token_type_ids,
    all_label_ids,
    all_e1_mask,
    all_e2_mask,
)
return dataset
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49

这里面用到了utils.py中的get_label函数：

def get_label(args):
return [label.strip() for label in open(os.path.join(args.data_dir, args.label_file), “r”, encoding=“utf-8”)]
其中label.txt中的内容如下：

Other
Cause-Effect(e1,e2)
Cause-Effect(e2,e1)
Instrument-Agency(e1,e2)
Instrument-Agency(e2,e1)
Product-Producer(e1,e2)
Product-Producer(e2,e1)
Content-Container(e1,e2)
Content-Container(e2,e1)
Entity-Origin(e1,e2)
Entity-Origin(e2,e1)
Entity-Destination(e1,e2)
Entity-Destination(e2,e1)
Component-Whole(e1,e2)
Component-Whole(e2,e1)
Member-Collection(e1,e2)
Member-Collection(e2,e1)
Message-Topic(e1,e2)
Message-Topic(e2,e1)
最后是这么使用的：

import argparse

from data_loader import load_and_cache_examples
from trainer import Trainer
from utils import init_logger, load_tokenizer, set_seed

def main(args):
init_logger()
set_seed(args)
tokenizer = load_tokenizer(args)

train_dataset = load_and_cache_examples(args, tokenizer, mode="train")
1

其中用到了utils.py中的init_logger，load_tokenizer,set_seed：

import logging
import os
import random

import numpy as np
import torch
from transformers import BertTokenizer

ADDITIONAL_SPECIAL_TOKENS = ["", “”, “”, “”]

def get_label(args):
return [label.strip() for label in open(os.path.join(args.data_dir, args.label_file), “r”, encoding=“utf-8”)]

def load_tokenizer(args):
tokenizer = BertTokenizer.from_pretrained(args.model_name_or_path)
tokenizer.add_special_tokens({“additional_special_tokens”: ADDITIONAL_SPECIAL_TOKENS})
return tokenizer
其中使用的相关参数的定义如下：

parser = argparse.ArgumentParser()

parser.add_argument("--task", default="semeval", type=str, help="The name of the task to train")
parser.add_argument(
    "--data_dir",
    default="./data",
    type=str,
    help="The input data dir. Should contain the .tsv files (or other data files) for the task.",
)
parser.add_argument("--model_dir", default="./model", type=str, help="Path to model")
parser.add_argument(
    "--eval_dir",
    default="./eval",
    type=str,
    help="Evaluation script, result directory",
)
parser.add_argument("--train_file", default="train.tsv", type=str, help="Train file")
parser.add_argument("--test_file", default="test.tsv", type=str, help="Test file")
parser.add_argument("--label_file", default="label.txt", type=str, help="Label file")

parser.add_argument(
    "--model_name_or_path",
    type=str,
    default="bert-base-uncased",
    help="Model Name or Path",
)

parser.add_argument("--seed", type=int, default=77, help="random seed for initialization")
parser.add_argument("--train_batch_size", default=16, type=int, help="Batch size for training.")
parser.add_argument("--eval_batch_size", default=32, type=int, help="Batch size for evaluation.")
parser.add_argument(
    "--max_seq_len",
    default=384,
    type=int,
    help="The maximum total input sequence length after tokenization.",
)
parser.add_argument(
    "--learning_rate",
    default=2e-5,
    type=float,
    help="The initial learning rate for Adam.",
)
parser.add_argument(
    "--num_train_epochs",
    default=10.0,
    type=float,
    help="Total number of training epochs to perform.",
)
parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
parser.add_argument(
    "--gradient_accumulation_steps",
    type=int,
    default=1,
    help="Number of updates steps to accumulate before performing a backward/update pass.",
)
parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
parser.add_argument(
    "--max_steps",
    default=-1,
    type=int,
    help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
)
parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.")
parser.add_argument(
    "--dropout_rate",
    default=0.1,
    type=float,
    help="Dropout for fully-connected layers",
)

parser.add_argument("--logging_steps", type=int, default=250, help="Log every X updates steps.")
parser.add_argument(
    "--save_steps",
    type=int,
    default=250,
    help="Save checkpoint every X updates steps.",
)

parser.add_argument("--do_train", action="store_true", help="Whether to run training.")
parser.add_argument("--do_eval", action="store_true", help="Whether to run eval on the test set.")
parser.add_argument("--no_cuda", action="store_true", help="Avoid using CUDA when available")
parser.add_argument(
    "--add_sep_token",
    action="store_true",
    help="Add [SEP] token at the end of the sentence",
)

args = parser.parse_args()

main(args)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89

分步解析数据处理代码
使用的时候是调用load_and_cache_examples(args, tokenizer, mode)函数，其中args参数用于传入初始化的一些参数设置，tokenizer用于将字或符号转换为相应的数字,mode用于标识是训练数据还是验证或者测试数据。
在load_and_cache_examples函数中首先调用processorsargs.task，这个processors是一个字典，字典的键是数据集名称，值是处理该数据集的函数名，当我们使用其它的数据集的时候，我们也要在这里面添加相关的键值对表示。
随后将args参数传入到SemEvalProcessor()函数中。该函数的作用就是生成每一个样本，每一个样本用InputExample类表示，包括样本的唯一标识，文本，标签，最后返回的是包含多个InputExample的列表。
随后通过
convert_examples_to_features(
examples, args.max_seq_len, tokenizer, add_sep_token=args.add_sep_token
)
针对于每一个example，都要求得：
input_ids=input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
label_id=label_id,
e1_mask=e1_mask,
e2_mask=e2_mask,
然后封装成一个InputFeatures，最后返回一个包含多个InputFeatures的列表。
其中还有一些细节我们要清楚的：
需要将实体、用$表示，实体、用#表示
由于加入了[cls]，因此其对应的索引位置要+1
是否需要加入[sep]时要考虑
句子不够长要进行填补，句子太长了要进行截断
最后我们得到相关的列表：
dataset = TensorDataset(
all_input_ids,
all_attention_mask,
all_token_type_ids,
all_label_ids,
all_e1_mask,
all_e2_mask,
)
将其转换成TensorDataset并返回。
亚马逊测评 www.yisuping.cn

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/很楠不爱3/article/detail/543296