赞
踩
使用Transformers.trainer 进行实现
code_dir:
https://gitee.com/liuyu_1997/ml-nlp/blob/master/%E6%B7%B1%E5%BA%A6%E5%AD%A6%E4%B9%A0/nlp/BertNER/BertNER.ipynb
加载数据以及数据的展示,这里使用最常见的conll2003数据集进行实验
task = "ner" # Should be one of "ner", "pos" or "chunk"
model_checkpoint = "distilbert-base-uncased"
batch_size = 16
from datasets import load_dataset, load_metric,Dataset
datasets = load_dataset("conll2003")
展示数据集的第一条数据
datasets["train"][0]
{'id': '0',
'tokens': ['EU',
'rejects',
'German',
'call',
'to',
'boycott',
'British',
'lamb',
'.'],
'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}
datasets["train"].features[f"ner_tags"].feature.names
['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']
datasets["train"].features[f"chunk_tags"].feature.names
['O', 'B-ADJP', 'I-ADJP', 'B-ADVP', 'I-ADVP', 'B-CONJP', 'I-CONJP', 'B-INTJ', 'I-INTJ', 'B-LST', 'I-LST', 'B-NP', 'I-NP', 'B-PP', 'I-PP', 'B-PRT', 'I-PRT', 'B-SBAR', 'I-SBAR', 'B-UCP', 'I-UCP', 'B-VP', 'I-VP']
datasets["train"].features[f"pos_tags"].feature.names
['"', "''", '#', '$', '(', ')', ',', '.', ':', '``', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'NN|SYM', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB']
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
tokenizer不仅可以 对 字符串进行序列化 还可以对 分词后的token进行序列化 需要设置 is_split_into_words=True
tokenizer(["Hello", ",", "this", "is", "one", "sentence", "split", "into", "words", "."], is_split_into_words=True)
{'input_ids': [101, 7592, 1010, 2023, 2003, 2028, 6251, 3975, 2046, 2616, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
tokenizer.decode([101, 7592, 1010, 2023, 2003, 2028, 6251, 3975, 2046, 2616, 1012, 102])
'[CLS] hello, this is one sentence split into words. [SEP]'
值得注意的是 tokenizer可能将 单词 分割成 单词的 词根或词缀 即 经过 tokenizer后 序列的长度可能发生改变
Transformers通常使用子词标记器进行预训练,这意味着即使您的输入已经被分割成单词,这些单词中的每一个都可以被标记器再次分割。让我们看一个例子:
example = datasets["train"][4]
print("原始token:",example["tokens"])
print("-"*100)
tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
print("转换后的 token:",tokens)
print("-"*100)
tokenizer.decode(tokenized_input["input_ids"])
原始token: ['Germany', "'s", 'representative', 'to', 'the', 'European', 'Union', "'s", 'veterinary', 'committee', 'Werner', 'Zwingmann', 'said', 'on', 'Wednesday', 'consumers', 'should', 'buy', 'sheepmeat', 'from', 'countries', 'other', 'than', 'Britain', 'until', 'the', 'scientific', 'advice', 'was', 'clearer', '.']
----------------------------------------------------------------------------------------------------
转换后的 token: ['[CLS]', 'germany', "'", 's', 'representative', 'to', 'the', 'european', 'union', "'", 's', 'veterinary', 'committee', 'werner', 'z', '##wing', '##mann', 'said', 'on', 'wednesday', 'consumers', 'should', 'buy', 'sheep', '##me', '##at', 'from', 'countries', 'other', 'than', 'britain', 'until', 'the', 'scientific', 'advice', 'was', 'clearer', '.', '[SEP]']
----------------------------------------------------------------------------------------------------
"[CLS] germany's representative to the european union's veterinary committee werner zwingmann said on wednesday consumers should buy sheepmeat from countries other than britain until the scientific advice was clearer. [SEP]"
这意味着我们需要对标签 做一些处理。 因为 tokenizer 返回的 id 比我们的数据集所包含的标签列表要长,其原因了是 单词被再次拆分
或者 添加了一些特殊的标记 例如 CLS 和 SEP
len(example[task+"_tags"]), len(tokenized_input["input_ids"]) # ner_tags长度 和 input_ids 长度无法匹配
(31, 39)
为此,我们可以使用 tokenized_input.word_ids()方法 来进行操作
print(tokenized_input.word_ids())
print(len(tokenized_input.word_ids()))
[None, 0, 1, 1, 2, 3, 4, 5, 6, 7, 7, 8, 9, 10, 11, 11, 11, 12, 13, 14, 15, 16, 17, 18, 18, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, None]
39
正如我们所看到的,它返回一个列表,其中的元素数量与我们处理过的输入id相同,将特殊标记映射为None,将所有其他标记映射为各自的词。
这样,我们就可以将标签与处理后的输入id对齐。(其中 相同的数字表示 由同一个词 拆分而成的 子token )
word_ids = tokenized_input.word_ids()
aligned_labels = [-100 if i is None else example[f"{task}_tags"][i] for i in word_ids]
print(len(aligned_labels), len(tokenized_input["input_ids"]))
39 39
在这里,我们将所有特殊标记的标签设置为-100(PyTorch所忽略的索引),将所有其他标记的标签设置为它们所来自的单词的标签。另一种策略是只对从一个给定的单词中获得的第一个标记设置标签,而对来自同一单词的其他子标记给予-100的标签。我们在此提出这两种策略,只需改变以下标志的值。
label_all_tokens = True # True 是第一种策略 Fale 是第二种策略
def tokenize_and_align_labels(examples): tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True) labels = [] for i, label in enumerate(examples[f"{task}_tags"]): word_ids = tokenized_inputs.word_ids(batch_index=i) previous_word_idx = None label_ids = [] for word_idx in word_ids: if word_idx is None: label_ids.append(-100) elif word_idx != previous_word_idx: label_ids.append(label[word_idx]) # For the other tokens in a word, we set the label to either the current label or -100, depending on # the label_all_tokens flag. else: label_ids.append(label[word_idx] if label_all_tokens else -100) previous_word_idx = word_idx labels.append(label_ids) tokenized_inputs["labels"] = labels return tokenized_inputs
tokenize_and_align_labels(datasets['train'][:1])
{'input_ids': [[101, 7327, 19164, 2446, 2655, 2000, 17757, 2329, 12559, 1012, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, -100]]}
tokenizer.batch_decode(tokenize_and_align_labels(datasets['train'][:1])["input_ids"])
['[CLS] eu rejects german call to boycott british lamb. [SEP]']
将数据集整体 进行token对齐操作 调用map()
datasets
DatasetDict({
train: Dataset({
features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
num_rows: 14042
})
validation: Dataset({
features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
num_rows: 3251
})
test: Dataset({
features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
num_rows: 3454
})
})
tokenized_datasets = datasets.map(tokenize_and_align_labels, batched=True)
0%| | 0/15 [00:00<?, ?ba/s]
0%| | 0/4 [00:00<?, ?ba/s]
0%| | 0/4 [00:00<?, ?ba/s]
tokenized_datasets
DatasetDict({
train: Dataset({
features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags', 'input_ids', 'attention_mask', 'labels'],
num_rows: 14042
})
validation: Dataset({
features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags', 'input_ids', 'attention_mask', 'labels'],
num_rows: 3251
})
test: Dataset({
features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags', 'input_ids', 'attention_mask', 'labels'],
num_rows: 3454
})
})
tokenized_datasets["train"]["labels"][0]
[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, -100]
tokenized_datasets["train"]["ner_tags"][0]
[3, 0, 7, 0, 0, 0, 7, 0, 0]
进行模型Fine-tuning
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
label_list = datasets["train"].features[f"{task}_tags"].feature.names
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=len(label_list))
Downloading: 0%| | 0.00/256M [00:00<?, ?B/s]
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForTokenClassification: ['vocab_projector.bias', 'vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
model_name = model_checkpoint.split("/")[-1]
args = TrainingArguments(
f"{model_name}-finetuned-{task}",
evaluation_strategy = "epoch",
learning_rate=2e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
num_train_epochs=3,
weight_decay=0.01,
)
from transformers import DataCollatorForTokenClassification
data_collator = DataCollatorForTokenClassification(tokenizer)
! pip install seqeval
metric = load_metric("seqeval")
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) Looking in indexes: https://mirrors.ustc.edu.cn/pypi/web/simple [33mWARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fc0e6d54cd0>: Failed to establish a new connection: [Errno -2] Name or service not known')': /pypi/web/simple/seqeval/[0m[33m [0m[33mWARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fc0e6d6e190>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /pypi/web/simple/seqeval/[0m[33m [0mCollecting seqeval Downloading https://mirrors.bfsu.edu.cn/pypi/web/packages/9d/2d/233c79d5b4e5ab1dbf111242299153f3caddddbb691219f363ad55ce783d/seqeval-1.2.2.tar.gz (43 kB) [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 KB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m [?25h Preparing metadata (setup.py) ... [?25ldone [?25hRequirement already satisfied: numpy>=1.14.0 in /home/zutnlp/miniconda3/envs/liuyu/lib/python3.7/site-packages (from seqeval) (1.19.5) Requirement already satisfied: scikit-learn>=0.21.3 in /home/zutnlp/miniconda3/envs/liuyu/lib/python3.7/site-packages (from seqeval) (1.0.2) Requirement already satisfied: scipy>=1.1.0 in /home/zutnlp/miniconda3/envs/liuyu/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval) (1.7.3) Requirement already satisfied: threadpoolctl>=2.0.0 in /home/zutnlp/miniconda3/envs/liuyu/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval) (2.2.0) Requirement already satisfied: joblib>=0.11 in /home/zutnlp/miniconda3/envs/liuyu/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval) (1.1.0) Building wheels for collected packages: seqeval Building wheel for seqeval (setup.py) ... [?25ldone [?25h Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16180 sha256=c2ceaaad2863968f5428daced7949c6ef089729d9617dc5ed07bd6a4893a8ac6 Stored in directory: /home/zutnlp/.cache/pip/wheels/11/f9/5f/edc55bc2839444a3a60c455e3a9e75879a3e489c06fd92bdf2 Successfully built seqeval Installing collected packages: seqeval Successfully installed seqeval-1.2.2 [33mWARNING: You are using pip version 22.0.3; however, version 22.1 is available. You should consider upgrading via the '/home/zutnlp/miniconda3/envs/liuyu/bin/python -m pip install --upgrade pip' command.[0m[33m [0m
评估算法
import numpy as np def compute_metrics(p): predictions, labels = p predictions = np.argmax(predictions, axis=2) # 删除忽略的索引(特殊令牌) true_predictions = [ [label_list[p] for (p, l) in zip(prediction, label) if l != -100] for prediction, label in zip(predictions, labels) ] true_labels = [ [label_list[l] for (p, l) in zip(prediction, label) if l != -100] for prediction, label in zip(predictions, labels) ] results = metric.compute(predictions=true_predictions, references=true_labels) return { "precision": results["overall_precision"], "recall": results["overall_recall"], "f1": results["overall_f1"], "accuracy": results["overall_accuracy"], }
trainer = Trainer(
model,
args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics
)
trainer.train()
The following columns in the training set don't have a corresponding argument in `DistilBertForTokenClassification.forward` and have been ignored: id, chunk_tags, pos_tags, ner_tags, tokens. If id, chunk_tags, pos_tags, ner_tags, tokens are not expected by `DistilBertForTokenClassification.forward`, you can safely ignore this message. /home/zutnlp/miniconda3/envs/liuyu/lib/python3.7/site-packages/transformers/optimization.py:309: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning FutureWarning, ***** Running training ***** Num examples = 14042 Num Epochs = 3 Instantaneous batch size per device = 16 Total train batch size (w. parallel, distributed & accumulation) = 16 Gradient Accumulation steps = 1 Total optimization steps = 2634 <div> <progress value='2' max='2634' style='width:300px; height:20px; vertical-align: middle;'></progress> [ 2/2634 : < :, Epoch 0.00/3] </div> <table border="1" class="dataframe">
Saving model checkpoint to distilbert-base-uncased-finetuned-ner/checkpoint-500 Configuration saved in distilbert-base-uncased-finetuned-ner/checkpoint-500/config.json Model weights saved in distilbert-base-uncased-finetuned-ner/checkpoint-500/pytorch_model.bin tokenizer config file saved in distilbert-base-uncased-finetuned-ner/checkpoint-500/tokenizer_config.json Special tokens file saved in distilbert-base-uncased-finetuned-ner/checkpoint-500/special_tokens_map.json The following columns in the evaluation set don't have a corresponding argument in `DistilBertForTokenClassification.forward` and have been ignored: id, chunk_tags, pos_tags, ner_tags, tokens. If id, chunk_tags, pos_tags, ner_tags, tokens are not expected by `DistilBertForTokenClassification.forward`, you can safely ignore this message. ***** Running Evaluation ***** Num examples = 3251 Batch size = 16 Saving model checkpoint to distilbert-base-uncased-finetuned-ner/checkpoint-1000 Configuration saved in distilbert-base-uncased-finetuned-ner/checkpoint-1000/config.json Model weights saved in distilbert-base-uncased-finetuned-ner/checkpoint-1000/pytorch_model.bin tokenizer config file saved in distilbert-base-uncased-finetuned-ner/checkpoint-1000/tokenizer_config.json Special tokens file saved in distilbert-base-uncased-finetuned-ner/checkpoint-1000/special_tokens_map.json Saving model checkpoint to distilbert-base-uncased-finetuned-ner/checkpoint-1500 Configuration saved in distilbert-base-uncased-finetuned-ner/checkpoint-1500/config.json Model weights saved in distilbert-base-uncased-finetuned-ner/checkpoint-1500/pytorch_model.bin tokenizer config file saved in distilbert-base-uncased-finetuned-ner/checkpoint-1500/tokenizer_config.json Special tokens file saved in distilbert-base-uncased-finetuned-ner/checkpoint-1500/special_tokens_map.json The following columns in the evaluation set don't have a corresponding argument in `DistilBertForTokenClassification.forward` and have been ignored: id, chunk_tags, pos_tags, ner_tags, tokens. If id, chunk_tags, pos_tags, ner_tags, tokens are not expected by `DistilBertForTokenClassification.forward`, you can safely ignore this message. ***** Running Evaluation ***** Num examples = 3251 Batch size = 16 Saving model checkpoint to distilbert-base-uncased-finetuned-ner/checkpoint-2000 Configuration saved in distilbert-base-uncased-finetuned-ner/checkpoint-2000/config.json Model weights saved in distilbert-base-uncased-finetuned-ner/checkpoint-2000/pytorch_model.bin tokenizer config file saved in distilbert-base-uncased-finetuned-ner/checkpoint-2000/tokenizer_config.json Special tokens file saved in distilbert-base-uncased-finetuned-ner/checkpoint-2000/special_tokens_map.json Saving model checkpoint to distilbert-base-uncased-finetuned-ner/checkpoint-2500 Configuration saved in distilbert-base-uncased-finetuned-ner/checkpoint-2500/config.json Model weights saved in distilbert-base-uncased-finetuned-ner/checkpoint-2500/pytorch_model.bin tokenizer config file saved in distilbert-base-uncased-finetuned-ner/checkpoint-2500/tokenizer_config.json Special tokens file saved in distilbert-base-uncased-finetuned-ner/checkpoint-2500/special_tokens_map.json The following columns in the evaluation set don't have a corresponding argument in `DistilBertForTokenClassification.forward` and have been ignored: id, chunk_tags, pos_tags, ner_tags, tokens. If id, chunk_tags, pos_tags, ner_tags, tokens are not expected by `DistilBertForTokenClassification.forward`, you can safely ignore this message. ***** Running Evaluation ***** Num examples = 3251 Batch size = 16 Training completed. Do not forget to share your model on huggingface.co/models =) TrainOutput(global_step=2634, training_loss=0.08670986667218857, metrics={'train_runtime': 96.0161, 'train_samples_per_second': 438.739, 'train_steps_per_second': 27.433, 'total_flos': 510309848641824.0, 'train_loss': 0.08670986667218857, 'epoch': 3.0})
trainer.evaluate()
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForTokenClassification.forward` and have been ignored: id, chunk_tags, pos_tags, ner_tags, tokens. If id, chunk_tags, pos_tags, ner_tags, tokens are not expected by `DistilBertForTokenClassification.forward`, you can safely ignore this message.
***** Running Evaluation *****
Num examples = 3251
Batch size = 16
[ 1/204 : < :]
{'eval_loss': 0.061025530099868774,
'eval_precision': 0.9237063246351173,
'eval_recall': 0.9345564380803222,
'eval_f1': 0.9290997052772062,
'eval_accuracy': 0.9831127774159213,
'eval_runtime': 2.6055,
'eval_samples_per_second': 1247.758,
'eval_steps_per_second': 78.297,
'epoch': 3.0}
predictions, labels, _ = trainer.predict(tokenized_datasets["test"])
predictions = np.argmax(predictions, axis=2)
# Remove ignored index (special tokens)
true_predictions = [
[label_list[p] for (p, l) in zip(prediction, label) if l != -100]
for prediction, label in zip(predictions, labels)
]
true_labels = [
[label_list[l] for (p, l) in zip(prediction, label) if l != -100]
for prediction, label in zip(predictions, labels)
]
results = metric.compute(predictions=true_predictions, references=true_labels)
results
The following columns in the test set don't have a corresponding argument in `DistilBertForTokenClassification.forward` and have been ignored: id, chunk_tags, pos_tags, ner_tags, tokens. If id, chunk_tags, pos_tags, ner_tags, tokens are not expected by `DistilBertForTokenClassification.forward`, you can safely ignore this message. ***** Running Prediction ***** Num examples = 3454 Batch size = 16 {'LOC': {'precision': 0.8881818181818182, 'recall': 0.9199623352165726, 'f1': 0.9037927844588344, 'number': 2124}, 'MISC': {'precision': 0.7567567567567568, 'recall': 0.7309236947791165, 'f1': 0.7436159346271707, 'number': 996}, 'ORG': {'precision': 0.8615443134271586, 'recall': 0.875193199381762, 'f1': 0.8683151236342725, 'number': 2588}, 'PER': {'precision': 0.9602673598217601, 'recall': 0.9514348785871964, 'f1': 0.9558307152097579, 'number': 2718}, 'overall_precision': 0.887906647807638, 'overall_recall': 0.8940185141229527, 'overall_f1': 0.8909520993494974, 'overall_accuracy': 0.974760030384642}
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。