赞
踩
情感分析预训练模型SKEP使用教程 本项目将演示如何使用情感分析预训练模型SKEP完成句子级情感分析、对象级情感分析以及观点抽取任务。 此外,通过从情感分析任务,引入和介绍传统文本分类模型如TextCNN等、预训练模型SKEP及其在 PaddleNLP 的使用方式。 本项目主要包括“任务介绍”、“常用数据”、“传统情感分析模型TextCNN”、“情感分析预训练模型SKEP”等四个部分。 In [ ] !pip install --upgrade paddlenlp 情感分析任务 众所周知,人类自然语言中包含了丰富的情感色彩:表达人的情绪(如悲伤、快乐)、表达人的心情(如倦怠、忧郁)、表达人的喜好(如喜欢、讨厌)、表达人的个性特征和表达人的立场等等。情感分析在商品喜好、消费决策、舆情分析等场景中均有应用。利用机器自动分析这些情感倾向,不但有助于帮助企业了解消费者对其产品的感受,为产品改进提供依据;同时还有助于企业分析商业伙伴们的态度,以便更好地进行商业决策。 通常情况下,人们把情感分析任务看成一个三分类问题: 情感分析任务 正向: 表示正面积极的情感,如高兴,幸福,惊喜,期待等。 负向: 表示负面消极的情感,如难过,伤心,愤怒,惊恐等。 其他: 其他类型的情感。 情感分析数据 ChnSenticorp数据集是公开中文情感分析数据集, 其为2分类数据集。PaddleNLP已经内置该数据集,一键即可加载。 In [ ] from paddlenlp.datasets import load_dataset train_ds, dev_ds, test_ds = load_dataset("chnsenticorp", splits=["train", "dev", "test"]) idx = 0 for data in train_ds: print(data) idx += 1 if idx >= 3: break 传统情感分类模型TextCNN 传统情感分类模型通过CNN、RNN、LSTM、GRU等网络,将文本表征为一个向量。由于RNN、LSTM、GRU等循环神经网络不能并行计算,而CNN在速度方面却有着无可比拟的效果,且由于它的可并行性广被工业界喜爱。2014年Yoon Kim提出TextCNN网络用于文本分类任务中,同时取得不错的效果。在文本中,并不是所有的文本都是全部依赖,可与利用n-gram信息,捕捉文本的局部相关性特征。CNN的原理也是如此,通过卷积核,来补捉文本的局部相关性特征。同时可以使用多个不同的卷积核,来捕捉多个ngram信息。 PaddleNLP提供了序列化建模模块paddlenlp.seq2vec模块,该模块可以将文本抽象成一个携带语义的文本向量。 关于seq2vec模块更多信息参考:[paddlenlp.seq2vec是什么?快来看看如何用它完成情感分析任务]https://aistudio.baidu.com/aistudio/projectdetail/1283423() 接下来,我们看看如何实现TextCNN模型。 paddle.nn.Embedding组建word-embedding层 paddlenlp.seq2vec.CNNEncoder组建句子建模层 paddle.nn.Linear构造二分类器 In [ ] import paddle import paddle.nn as nn import paddle.nn.functional as F import paddlenlp as nlp class TextCNNModel(nn.Layer): """ This class implements the Text Convolution Neural Network model. At a high level, the model starts by embedding the tokens and running them through a word embedding. Then, we encode these representations with a `CNNEncoder`. The CNN has one convolution layer for each ngram filter size. Each convolution operation gives out a vector of size num_filter. The number of times a convolution layer will be used is `num_tokens - ngram_size + 1`. The corresponding maxpooling layer aggregates all these outputs from the convolution layer and outputs the max. Lastly, we take the output of the encoder to create a final representation, which is passed through some feed-forward layers to output a logits (`output_layer`). """ def __init__(self, vocab_size, num_classes, emb_dim=128, padding_idx=0, num_filter=128, ngram_filter_sizes=(1, 2, 3), fc_hidden_size=96): super().__init__() self.embedder = nn.Embedding( vocab_size, emb_dim, padding_idx=padding_idx) self.encoder = nlp.seq2vec.CNNEncoder( emb_dim=emb_dim, num_filter=num_filter, ngram_filter_sizes=ngram_filter_sizes) self.fc = nn.Linear(self.encoder.get_output_dim(), fc_hidden_size) self.output_layer = nn.Linear(fc_hidden_size, num_classes) def forward(self, text): # Shape: (batch_size, num_tokens, embedding_dim) embedded_text = self.embedder(text) # Shape: (batch_size, len(ngram_filter_sizes)*num_filter) encoder_out = self.encoder(embedded_text) encoder_out = paddle.tanh(encoder_out) # Shape: (batch_size, fc_hidden_size) fc_out = paddle.tanh(self.fc(encoder_out)) # Shape: (batch_size, num_classes) logits = self.output_layer(fc_out) return logits model = TextCNNModel( len(vocab.idx_to_token), len(train_ds.label_list), padding_idx=vocab.to_indices('[PAD]')) model = paddle.Model(model) 构建词汇表 由于TextCNN模型输入的是文本单词,所以我们还需要对文本进行切词操作。 首先需要对整体语料构造词表。通过切词统计词频,去除低频词,从而完成构造词表。我们使用jieba作为中文切词工具。 停用词表,我们从网上直接获取:https://github.com/goto456/stopwords/blob/master/baidu_stopwords.txt In [ ] import os from collections import Counter from itertools import chain import jieba def sort_and_write_words(all_words, file_path): words = list(chain(*all_words)) words_vocab = Counter(words).most_common() with open(file_path, "w", encoding="utf8") as f: f.write('[UNK]\n[PAD]\n') # filter the count of words below 5 # 过滤低频词,词频<5 for word, num in words_vocab: if num < 5: continue f.write(word + "\n") all_texts = [data['text'] for data in train_ds] all_texts += [data['text'] for data in dev_ds] all_texts += [data['text'] for data in test_ds] all_words = [] for text in all_texts: words = jieba.lcut(text) words = [word for word in words if word.strip() !=''] all_words.append(words) # 写入词表 sort_and_write_words(all_words, "work/vocab.txt") In [ ] # 词汇表大小 !wc -l work/vocab.txt # 停用词表大小 !wc -l work/stopwords.txt 还需对数据作以下处理: 将原始数据处理成模型可以读入的格式。首先使用jieba切词,之后将jieba切完后的单词映射词表中单词id。 使用paddle.io.DataLoader接口多线程异步加载数据。 In [ ] from functools import partial from paddlenlp.data import JiebaTokenizer, Pad, Stack, Tuple, Vocab from utils import create_dataloader,convert_example vocab = Vocab.load_vocabulary( "work/vocab.txt", unk_token='[UNK]', pad_token='[PAD]') tokenizer = JiebaTokenizer(vocab) trans_fn = partial(convert_example, tokenizer=tokenizer, is_test=False) # 将读入的数据batch化处理,便于模型batch化运算。 # batch中的每个句子将会padding到这个batch中的文本最大长度batch_max_seq_len。 # 当文本长度大于batch_max_seq时,将会截断到batch_max_seq_len;当文本长度小于batch_max_seq时,将会padding补齐到batch_max_seq_len. batch_size = 64 batchify_fn = lambda samples, fn=Tuple( Pad(axis=0, pad_val=vocab.token_to_idx.get('[PAD]', 1)), # word_ids Stack(dtype="int64") # label ): [data for data in fn(samples)] train_loader = create_dataloader( train_ds, trans_fn=trans_fn, batch_size=batch_size, mode='train', batchify_fn=batchify_fn) dev_loader = create_dataloader( dev_ds, trans_fn=trans_fn, batch_size=batch_size, mode='validation', batchify_fn=batchify_fn) TextCNN模型训练 处理完了数据之后,还需要定义优化器和损失函数。此处选择准确率Accuracy作为评价指标。 In [ ] # 定义优化器、损失和评价指标. optimizer = paddle.optimizer.Adam( parameters=model.parameters(), learning_rate=5e-5) criterion = paddle.nn.CrossEntropyLoss() metric = paddle.metric.Accuracy() model.prepare(optimizer, criterion, metric) # 开始训练和评估 model.fit(train_loader, dev_loader, epochs=5, save_dir='./textcnn_ckpt') 情感分析预训练模型SKEP 近年来,大量的研究表明基于大型语料库的预训练模型(Pretrained Models, PTM)可以学习通用的语言表示,有利于下游NLP任务,同时能够避免从零开始训练模型。随着计算能力的发展,深度模型的出现(即 Transformer)和训练技巧的增强使得 PTM 不断发展,由浅变深。 情感预训练模型SKEP(Sentiment Knowledge Enhanced Pre-training for Sentiment Analysis)。SKEP利用情感知识增强预训练模型, 在14项中英情感分析典型任务上全面超越SOTA,此工作已经被ACL 2020录用。SKEP是百度研究团队提出的基于情感知识增强的情感预训练算法,此算法采用无监督方法自动挖掘情感知识,然后利用情感知识构建预训练目标,从而让机器学会理解情感语义。SKEP为各类情感分析任务提供统一且强大的情感语义表示。 论文地址:https://arxiv.org/abs/2005.05635 百度研究团队在三个典型情感分析任务,句子级情感分类(Sentence-level Sentiment Classification),评价对象级情感分类(Aspect-level Sentiment Classification)、观点抽取(Opinion Role Labeling),共计14个中英文数据上进一步验证了情感预训练模型SKEP的效果。 实验表明,以通用预训练模型ERNIE(内部版本)作为初始化,SKEP相比ERNIE平均提升约1.2%,并且较原SOTA平均提升约2%,具体效果如下表: 同样地,以之前的句子级情感分类ChnSentiCorp为例,我们看看SKEP的性能表现如何。 SKEP模型加载 PaddleNLP已经实现了SKEP预训练模型,可以通过一行代码实现SKEP加载。 In [ ] from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer model = SkepForSequenceClassification.from_pretrained(pretrained_model_name_or_path="skep_ernie_1.0_large_ch", num_classes=2)#len(train_ds.label_list)) tokenizer = SkepTokenizer.from_pretrained(pretrained_model_name_or_path="skep_ernie_1.0_large_ch") SkepForSequenceClassification可用于句子级情感分析和对象级情感分析任务。其通过预训练模型SKEP获取输入文本的表示,之后将文本表示进行分类。 pretrained_model_name_or_path:模型名称。支持"skep_ernie_1.0_large_ch","skep_ernie_2.0_large_en","skep_roberta_large_en"。 "skep_ernie_1.0_large_ch":是SKEP模型在预训练ernie_1.0_large_ch基础之上在海量中文数据上继续预训练得到的中文预训练模型; "skep_ernie_2.0_large_en":是SKEP模型在预训练ernie_2.0_large_en基础之上在海量英文数据上继续预训练得到的英文预训练模型; "skep_roberta_large_en":是SKEP模型在预训练roberta_large_en基础之上在海量英文数据上继续预训练得到的英文预训练模型; num_classes: 数据集分类类别数。 关于SKEP模型实现详细信息参考:https://github.com/PaddlePaddle/PaddleNLP/tree/develop/paddlenlp/transformers/skep 数据处理 同样地,我们需要将原始ChnSentiCorp数据处理成模型可以读入的数据格式。 SKEP模型对中文文本处理按照字粒度进行处理,我们可以使用PaddleNLP内置的SkepTokenizer完成一键式处理。 In [ ] def convert_example(example, tokenizer, max_seq_length=512, is_test=False): """ Builds model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. And creates a mask from the two sequences passed to be used in a sequence-pair classification task. A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence has the following format: :: - single sequence: ``[CLS] X [SEP]`` - pair of sequences: ``[CLS] A [SEP] B [SEP]`` A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence pair mask has the following format: :: 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | first sequence | second sequence | If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s). Args: example(obj:`list[str]`): List of input data, containing text and label if it have label. tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` which contains most of the methods. Users should refer to the superclass for more information regarding methods. max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded. is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. Returns: input_ids(obj:`list[int]`): The list of token ids. token_type_ids(obj: `list[int]`): List of sequence pair mask. label(obj:`int`, optional): The input label if not is_test. """ encoded_inputs = tokenizer( text=example["text"], max_seq_len=max_seq_length) input_ids = encoded_inputs["input_ids"] token_type_ids = encoded_inputs["token_type_ids"] if not is_test: label = example["label"] return input_ids, token_type_ids, label else: return input_ids, token_type_ids train_ds, dev_ds, test_ds = load_dataset("chnsenticorp", splits=["train", "dev", "test"]) batch_size = 32 max_seq_length = 128 trans_func = partial( convert_example, tokenizer=tokenizer, max_seq_length=max_seq_length) batchify_fn = lambda samples, fn=Tuple( Pad(axis=0, pad_val=tokenizer.pad_token_id), # input_ids Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # token_type_ids Stack(dtype="int64") # labels ): [data for data in fn(samples)] train_data_loader = create_dataloader( train_ds, mode='train', batch_size=batch_size, batchify_fn=batchify_fn, trans_fn=trans_func) dev_data_loader = create_dataloader( dev_ds, mode='dev', batch_size=batch_size, batchify_fn=batchify_fn, trans_fn=trans_func) 模型训练和评估 定义损失函数、优化器以及评价指标后,即可开始训练。 In [13] import time from utils import evaluate epochs = 1 ckpt_dir = "skep_ckpt" num_training_steps = len(train_data_loader) * epochs # 除所有的bias和LayerNorm参数,其他参数均需权重衰减 decay_params = [ p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"]) ] optimizer = paddle.optimizer.AdamW( learning_rate=3e-6, parameters=model.parameters(), weight_decay=0.01, apply_decay_param_fun=lambda x: x in decay_params) criterion = paddle.nn.loss.CrossEntropyLoss() metric = paddle.metric.Accuracy() global_step = 0 tic_train = time.time() for epoch in range(1, epochs + 1): for step, batch in enumerate(train_data_loader, start=1): input_ids, token_type_ids, labels = batch logits = model(input_ids, token_type_ids) loss = criterion(logits, labels) probs = F.softmax(logits, axis=1) correct = metric.compute(probs, labels) metric.update(correct) acc = metric.accumulate() global_step += 1 if global_step % 10 == 0: print( "global step %d, epoch: %d, batch: %d, loss: %.5f, accu: %.5f, speed: %.2f step/s" % (global_step, epoch, step, loss, acc, 10 / (time.time() - tic_train))) tic_train = time.time() loss.backward() optimizer.step() optimizer.clear_grad() if global_step % 100 == 0: save_dir = os.path.join(ckpt_dir, "model_%d" % global_step) if not os.path.exists(save_dir): os.makedirs(save_dir) evaluate(model, criterion, metric, dev_data_loader) model.save_pretrained(save_dir) tokenizer.save_pretrained(save_dir) 模型预测 使用训练得到的模型还可以对文本进行情感预测。 In [ ] from utils import predict data = [ '这个宾馆比较陈旧了,特价的房间也很一般。总体来说一般', '怀着十分激动的心情放映,可是看着看着发现,在放映完毕后,出现一集米老鼠的动画片', '作为老的四星酒店,房间依然很整洁,相当不错。机场接机服务很好,可以在车上办理入住手续,节省时间。', ] label_map = {0: 'negative', 1: 'positive'} results = predict( model, data, tokenizer, label_map, batch_size, max_seq_length) for idx, text in enumerate(data): print('Data: {} \t Label: {}'.format(text, results[idx])) 对象级情感分析 在情感分析任务中,研究人员除了分析句子的情感类型外,还细化到以句子中具体的“方面”为分析主体进行情感分析(aspect-level),如下: 这个薯片口味有点咸,太辣了,不过口感很脆。 关于薯片的口味方面是一个负向评价(咸,太辣),然而对于口感方面却是一个正向评价(很脆)。 我很喜欢夏威夷,就是这边的海鲜太贵了。 关于夏威夷是一个正向评价(喜欢),然而对于夏威夷的海鲜却是一个负向评价(价格太贵)。 同样SKEP支持对象级情感分析任务。运行以下命令即可完成对象级情感分析任务。 In [ ] # 对象级情感分析训练 !python train_aspect.py --save_dir skep_aspect In [ ] # 对象级情感分析预测 !python predict_aspect.py --params_path skep_aspect/model_900/model_state.pdparams 观点抽取 给定一个用户评论文本,抽取其中表达观点的三元组(维度词、评价词、情感极性) 示例:这家旅店服务还是不错的,但是房间比较简陋 观点1:<服务,不错,积极> 观点2:<房间,简陋,消极> In [ ] # 观点抽取训练 !python train_opinion.py --save_dir skep_opinion In [ ] # 观点抽取预测 !python predict_opinion.py --params_path skep_opinion/model_900/model_state.pdparams
utils
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License" # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. import numpy as np import paddle def read_vocab(vocab_path): vocab = {} with open(vocab_path, "r", encoding="utf8") as f: for idx, line in enumerate(f): word = line.strip("\n") vocab[word] = idx return vocab def create_dataloader(dataset, trans_fn=None, mode='train', batch_size=1, batchify_fn=None): """ Creats dataloader. Args: dataset(obj:`paddle.io.Dataset`): Dataset instance. trans_fn(obj:`callable`, optional, defaults to `None`): function to convert a data sample to input ids, etc. mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly. batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch. batchify_fn(obj:`callable`, optional, defaults to `None`): function to generate mini-batch data by merging the sample list, None for only stack each fields of sample in axis 0(same as :attr::`np.stack(..., axis=0)`). Returns: dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches. """ if trans_fn: dataset = dataset.map(trans_fn) shuffle = True if mode == 'train' else False if mode == "train": sampler = paddle.io.DistributedBatchSampler( dataset=dataset, batch_size=batch_size, shuffle=shuffle) else: sampler = paddle.io.BatchSampler( dataset=dataset, batch_size=batch_size, shuffle=shuffle) dataloader = paddle.io.DataLoader( dataset, batch_sampler=sampler, collate_fn=batchify_fn) return dataloader def convert_example(example, tokenizer, is_test=False): """ Builds model inputs from a sequence for sequence classification tasks. It use `jieba.cut` to tokenize text. Args: example(obj:`list[str]`): List of input data, containing text and label if it have label. tokenizer(obj: paddlenlp.data.JiebaTokenizer): It use jieba to cut the chinese string. is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. Returns: input_ids(obj:`list[int]`): The list of token ids. valid_length(obj:`int`): The input sequence valid length. label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test. """ input_ids = tokenizer.encode(example["text"]) input_ids = np.array(input_ids, dtype='int64') if not is_test: label = np.array(example["label"], dtype="int64") return input_ids, label else: return input_ids @paddle.no_grad() def evaluate(model, criterion, metric, data_loader): """ Given a dataset, it evals model and computes the metric. Args: model(obj:`paddle.nn.Layer`): A model to classify texts. criterion(obj:`paddle.nn.Layer`): It can compute the loss. metric(obj:`paddle.metric.Metric`): The evaluation metric. data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches. """ model.eval() metric.reset() losses = [] for batch in data_loader: input_ids, token_type_ids, labels = batch logits = model(input_ids, token_type_ids) loss = criterion(logits, labels) losses.append(loss.numpy()) correct = metric.compute(logits, labels) metric.update(correct) accu = metric.accumulate() print("eval loss: %.5f, accu: %.5f" % (np.mean(losses), accu)) model.train() metric.reset() @paddle.no_grad() def predict(model, data, tokenizer, label_map, batch_size=1, max_seq_length=128): examples = [] for text in data: input_ids, token_type_ids = convert_example( text, tokenizer, max_seq_length=max_seq_length, is_test=True) examples.append((input_ids, token_type_ids)) # Seperates data into some batches. batches = [ examples[idx:idx + batch_size] for idx in range(0, len(examples), batch_size) ] batchify_fn = lambda samples, fn=Tuple( Pad(axis=0, pad_val=tokenizer.pad_token_id), # input ids Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # token type ids ): [data for data in fn(samples)] results = [] model.eval() for batch in batches: input_ids, token_type_ids = batchify_fn(batch) input_ids = paddle.to_tensor(input_ids) token_type_ids = paddle.to_tensor(token_type_ids) logits = model(input_ids, token_type_ids) probs = F.softmax(logits, axis=1) idx = paddle.argmax(probs, axis=1).numpy() idx = idx.tolist() labels = [label_map[i] for i in idx] results.extend(labels) return results
train_aspect
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. from functools import partial import argparse import os import random import time import numpy as np import paddle import paddle.nn.functional as F from paddlenlp.data import Stack, Tuple, Pad from paddlenlp.datasets import load_dataset from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer # yapf: disable parser = argparse.ArgumentParser() parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.") parser.add_argument("--max_seq_length", default=400, type=int, help="The maximum total input sequence length after tokenization. " "Sequences longer than this will be truncated, sequences shorter will be padded.") parser.add_argument("--batch_size", default=6, type=int, help="Batch size per GPU/CPU for training.") parser.add_argument("--learning_rate", default=3e-6, type=float, help="The initial learning rate for Adam.") parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") parser.add_argument("--epochs", default=50, type=int, help="Total number of training epochs to perform.") parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization") parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.") args = parser.parse_args() # yapf: enable def set_seed(seed): """Sets random seed.""" random.seed(seed) np.random.seed(seed) paddle.seed(seed) def convert_example(example, tokenizer, max_seq_length=512, is_test=False, dataset_name="chnsenticorp"): """ Builds model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. And creates a mask from the two sequences passed to be used in a sequence-pair classification task. A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence has the following format: :: - single sequence: ``[CLS] X [SEP]`` - pair of sequences: ``[CLS] A [SEP] B [SEP]`` A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence pair mask has the following format: :: 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | first sequence | second sequence | If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s). note: There is no need token type ids for skep_roberta_large_ch model. Args: example(obj:`list[str]`): List of input data, containing text and label if it have label. tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` which contains most of the methods. Users should refer to the superclass for more information regarding methods. max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded. is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. dataset_name((obj:`str`, defaults to "chnsenticorp"): The dataset name, "chnsenticorp" or "sst-2". Returns: input_ids(obj:`list[int]`): The list of token ids. token_type_ids(obj: `list[int]`): List of sequence pair mask. label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test. """ encoded_inputs = tokenizer( text=example["text"], text_pair=example["text_pair"], max_seq_len=max_seq_length) input_ids = encoded_inputs["input_ids"] token_type_ids = encoded_inputs["token_type_ids"] if not is_test: label = np.array([example["label"]], dtype="int64") return input_ids, token_type_ids, label else: return input_ids, token_type_ids def create_dataloader(dataset, mode='train', batch_size=1, batchify_fn=None, trans_fn=None): if trans_fn: dataset = dataset.map(trans_fn) shuffle = True if mode == 'train' else False if mode == 'train': batch_sampler = paddle.io.DistributedBatchSampler( dataset, batch_size=batch_size, shuffle=shuffle) else: batch_sampler = paddle.io.BatchSampler( dataset, batch_size=batch_size, shuffle=shuffle) return paddle.io.DataLoader( dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) if __name__ == "__main__": set_seed(args.seed) paddle.set_device(args.device) rank = paddle.distributed.get_rank() if paddle.distributed.get_world_size() > 1: paddle.distributed.init_parallel_env() train_ds = load_dataset("seabsa16", "phns", splits=["train"]) model = SkepForSequenceClassification.from_pretrained( 'skep_ernie_1.0_large_ch', num_classes=len(train_ds.label_list)) tokenizer = SkepTokenizer.from_pretrained('skep_ernie_1.0_large_ch') trans_func = partial( convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length) batchify_fn = lambda samples, fn=Tuple( Pad(axis=0, pad_val=tokenizer.pad_token_id), # input_ids Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # token_type_ids Stack(dtype="int64") # labels ): [data for data in fn(samples)] train_data_loader = create_dataloader( train_ds, mode='train', batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func) if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): state_dict = paddle.load(args.init_from_ckpt) model.set_dict(state_dict) model = paddle.DataParallel(model) num_training_steps = len(train_data_loader) * args.epochs # Generate parameter names needed to perform weight decay. # All bias and LayerNorm parameters are excluded. decay_params = [ p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"]) ] optimizer = paddle.optimizer.AdamW( learning_rate=args.learning_rate, parameters=model.parameters(), weight_decay=args.weight_decay, apply_decay_param_fun=lambda x: x in decay_params) criterion = paddle.nn.loss.CrossEntropyLoss() metric = paddle.metric.Accuracy() global_step = 0 tic_train = time.time() for epoch in range(1, args.epochs + 1): for step, batch in enumerate(train_data_loader, start=1): input_ids, token_type_ids, labels = batch logits = model(input_ids, token_type_ids) loss = criterion(logits, labels) probs = F.softmax(logits, axis=1) correct = metric.compute(probs, labels) metric.update(correct) acc = metric.accumulate() global_step += 1 if global_step % 10 == 0 and rank == 0: print( "global step %d, epoch: %d, batch: %d, loss: %.5f, accu: %.5f, speed: %.2f step/s" % (global_step, epoch, step, loss, acc, 10 / (time.time() - tic_train))) tic_train = time.time() loss.backward() optimizer.step() optimizer.clear_grad() if global_step % 100 == 0 and rank == 0: save_dir = os.path.join(args.save_dir, "model_%d" % global_step) if not os.path.exists(save_dir): os.makedirs(save_dir) # Need better way to get inner model of DataParallel model._layers.save_pretrained(save_dir) tokenizer.save_pretrained(save_dir)
#predict_aspect
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. from functools import partial import argparse import os import random import time import numpy as np import paddle import paddle.nn.functional as F from paddlenlp.data import Stack, Tuple, Pad from paddlenlp.datasets import load_dataset from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer # yapf: disable parser = argparse.ArgumentParser() parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.") parser.add_argument("--max_seq_length", default=400, type=int, help="The maximum total input sequence length after tokenization. " "Sequences longer than this will be truncated, sequences shorter will be padded.") parser.add_argument("--batch_size", default=6, type=int, help="Batch size per GPU/CPU for prediction.") parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.") args = parser.parse_args() # yapf: enable @paddle.no_grad() def predict(model, data_loader, label_map): """ Given a prediction dataset, it gives the prediction results. Args: model(obj:`paddle.nn.Layer`): A model to classify texts. data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches. label_map(obj:`dict`): The label id (key) to label str (value) map. """ model.eval() results = [] for batch in data_loader: input_ids, token_type_ids = batch logits = model(input_ids, token_type_ids) probs = F.softmax(logits, axis=1) idx = paddle.argmax(probs, axis=1).numpy() idx = idx.tolist() labels = [label_map[i] for i in idx] results.extend(labels) return results def convert_example(example, tokenizer, max_seq_length=512, is_test=False, dataset_name="chnsenticorp"): """ Builds model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. And creates a mask from the two sequences passed to be used in a sequence-pair classification task. A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence has the following format: :: - single sequence: ``[CLS] X [SEP]`` - pair of sequences: ``[CLS] A [SEP] B [SEP]`` A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence pair mask has the following format: :: 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | first sequence | second sequence | If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s). note: There is no need token type ids for skep_roberta_large_ch model. Args: example(obj:`list[str]`): List of input data, containing text and label if it have label. tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` which contains most of the methods. Users should refer to the superclass for more information regarding methods. max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded. is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. dataset_name((obj:`str`, defaults to "chnsenticorp"): The dataset name, "chnsenticorp" or "sst-2". Returns: input_ids(obj:`list[int]`): The list of token ids. token_type_ids(obj: `list[int]`): List of sequence pair mask. label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test. """ encoded_inputs = tokenizer( text=example["text"], text_pair=example["text_pair"], max_seq_len=max_seq_length) input_ids = encoded_inputs["input_ids"] token_type_ids = encoded_inputs["token_type_ids"] if not is_test: label = np.array([example["label"]], dtype="int64") return input_ids, token_type_ids, label else: return input_ids, token_type_ids def create_dataloader(dataset, mode='train', batch_size=1, batchify_fn=None, trans_fn=None): if trans_fn: dataset = dataset.map(trans_fn) shuffle = True if mode == 'train' else False if mode == 'train': batch_sampler = paddle.io.DistributedBatchSampler( dataset, batch_size=batch_size, shuffle=shuffle) else: batch_sampler = paddle.io.BatchSampler( dataset, batch_size=batch_size, shuffle=shuffle) return paddle.io.DataLoader( dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) if __name__ == "__main__": test_ds = load_dataset("seabsa16", "phns", splits=["test"]) label_map = {0: 'negative', 1: 'positive'} model = SkepForSequenceClassification.from_pretrained( 'skep_ernie_1.0_large_ch', num_classes=len(label_map)) tokenizer = SkepTokenizer.from_pretrained('skep_ernie_1.0_large_ch') trans_func = partial( convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length, is_test=True) batchify_fn = lambda samples, fn=Tuple( Pad(axis=0, pad_val=tokenizer.pad_token_id), # input_ids Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # token_type_ids ): [data for data in fn(samples)] test_data_loader = create_dataloader( test_ds, mode='test', batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func) if args.params_path and os.path.isfile(args.params_path): state_dict = paddle.load(args.params_path) model.set_dict(state_dict) print("Loaded parameters from %s" % args.params_path) results = predict(model, test_data_loader, label_map) for idx, text in enumerate(test_ds.data): print('Data: {} \t Label: {}'.format(text, results[idx]))
train_opinion
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. from functools import partial import argparse import os import random import time import numpy as np import paddle import paddle.nn.functional as F from paddlenlp.data import Stack, Tuple, Pad from paddlenlp.datasets import load_dataset from paddlenlp.metrics import ChunkEvaluator from paddlenlp.transformers import SkepCrfForTokenClassification, SkepModel, SkepTokenizer # yapf: disable parser = argparse.ArgumentParser() parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.") parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. " "Sequences longer than this will be truncated, sequences shorter will be padded.") parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") parser.add_argument("--learning_rate", default=5e-7, type=float, help="The initial learning rate for Adam.") parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") parser.add_argument("--epochs", default=10, type=int, help="Total number of training epochs to perform.") parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization") parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.") args = parser.parse_args() # yapf: enable def set_seed(seed): """Sets random seed.""" random.seed(seed) np.random.seed(seed) paddle.seed(seed) def convert_example_to_feature(example, tokenizer, max_seq_len=512, no_entity_label="O", is_test=False): """ Builds model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. And creates a mask from the two sequences passed to be used in a sequence-pair classification task. A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence has the following format: :: - single sequence: ``[CLS] X [SEP]`` - pair of sequences: ``[CLS] A [SEP] B [SEP]`` A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence pair mask has the following format: :: 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | first sequence | second sequence | If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s). Args: example(obj:`list[str]`): List of input data, containing text and label if it have label. tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` which contains most of the methods. Users should refer to the superclass for more information regarding methods. max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded. no_entity_label(obj:`str`, defaults to "O"): The label represents that the token isn't an entity. is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. Returns: input_ids(obj:`list[int]`): The list of token ids. token_type_ids(obj: `list[int]`): List of sequence pair mask. label(obj:`list[int]`, optional): The input label if not test data. """ tokens = example['tokens'] labels = example['labels'] tokenized_input = tokenizer( tokens, return_length=True, is_split_into_words=True, max_seq_len=max_seq_len) input_ids = tokenized_input['input_ids'] token_type_ids = tokenized_input['token_type_ids'] seq_len = tokenized_input['seq_len'] if is_test: return input_ids, token_type_ids, seq_len else: labels = labels[:(max_seq_len - 2)] encoded_label = np.array( [no_entity_label] + labels + [no_entity_label], dtype="int64") return input_ids, token_type_ids, seq_len, encoded_label def create_dataloader(dataset, mode='train', batch_size=1, batchify_fn=None, trans_fn=None): if trans_fn: dataset = dataset.map(trans_fn) shuffle = True if mode == 'train' else False if mode == 'train': batch_sampler = paddle.io.DistributedBatchSampler( dataset, batch_size=batch_size, shuffle=shuffle) else: batch_sampler = paddle.io.BatchSampler( dataset, batch_size=batch_size, shuffle=shuffle) return paddle.io.DataLoader( dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) if __name__ == "__main__": paddle.set_device(args.device) rank = paddle.distributed.get_rank() if paddle.distributed.get_world_size() > 1: paddle.distributed.init_parallel_env() train_ds = load_dataset("cote", "dp", splits=['train']) # The COTE_DP dataset labels with "BIO" schema. label_map = {label: idx for idx, label in enumerate(train_ds.label_list)} # `no_entity_label` represents that the token isn't an entity. no_entity_label_idx = label_map.get("O", 2) # `ignore_label` is using to pad input labels. ignore_label = -1 set_seed(args.seed) skep = SkepModel.from_pretrained('skep_ernie_1.0_large_ch') model = SkepCrfForTokenClassification( skep, num_classes=len(train_ds.label_list)) tokenizer = SkepTokenizer.from_pretrained('skep_ernie_1.0_large_ch') trans_func = partial( convert_example_to_feature, tokenizer=tokenizer, max_seq_len=args.max_seq_length, no_entity_label=no_entity_label_idx, is_test=False) batchify_fn = lambda samples, fn=Tuple( Pad(axis=0, pad_val=tokenizer.vocab[tokenizer.pad_token]), # input ids Pad(axis=0, pad_val=tokenizer.vocab[tokenizer.pad_token]), # token type ids Stack(dtype='int64'), # sequence lens Pad(axis=0, pad_val=ignore_label) # labels ): [data for data in fn(samples)] train_data_loader = create_dataloader( train_ds, mode='train', batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func) if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): state_dict = paddle.load(args.init_from_ckpt) model.set_dict(state_dict) model = paddle.DataParallel(model) num_training_steps = len(train_data_loader) * args.epochs # Generate parameter names needed to perform weight decay. # All bias and LayerNorm parameters are excluded. decay_params = [ p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"]) ] optimizer = paddle.optimizer.AdamW( learning_rate=args.learning_rate, parameters=model.parameters(), weight_decay=args.weight_decay, apply_decay_param_fun=lambda x: x in decay_params) metric = ChunkEvaluator(label_list=train_ds.label_list, suffix=True) global_step = 0 tic_train = time.time() for epoch in range(1, args.epochs + 1): for step, batch in enumerate(train_data_loader, start=1): input_ids, token_type_ids, seq_lens, labels = batch loss = model( input_ids, token_type_ids, seq_lens=seq_lens, labels=labels) avg_loss = paddle.mean(loss) global_step += 1 if global_step % 10 == 0 and rank == 0: print( "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s" % (global_step, epoch, step, avg_loss, 10 / (time.time() - tic_train))) tic_train = time.time() loss.backward() optimizer.step() optimizer.clear_grad() if global_step % 100 == 0 and rank == 0: save_dir = os.path.join(args.save_dir, "model_%d" % global_step) if not os.path.exists(save_dir): os.makedirs(save_dir) file_name = os.path.join(save_dir, "model_state.pdparam") # Need better way to get inner model of DataParallel paddle.save(model._layers.state_dict(), file_name)
predict_opinion
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. import argparse import os from functools import partial import numpy as np import paddle import paddle.nn.functional as F from paddlenlp.data import Stack, Tuple, Pad from paddlenlp.datasets import load_dataset from paddlenlp.transformers import SkepCrfForTokenClassification, SkepModel, SkepTokenizer # yapf: disable parser = argparse.ArgumentParser() parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.") parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. " "Sequences longer than this will be truncated, sequences shorter will be padded.") parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.") args = parser.parse_args() # yapf: enable def convert_example(example, tokenizer, max_seq_length=512, is_test=False): """ Builds model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. And creates a mask from the two sequences passed to be used in a sequence-pair classification task. A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence has the following format: :: - single sequence: ``[CLS] X [SEP]`` - pair of sequences: ``[CLS] A [SEP] B [SEP]`` A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence pair mask has the following format: :: 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | first sequence | second sequence | If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s). Args: example(obj:`list[str]`): List of input data, containing text and label if it have label. tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` which contains most of the methods. Users should refer to the superclass for more information regarding methods. max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded. Returns: input_ids(obj:`list[int]`): The list of token ids. token_type_ids(obj: `list[int]`): List of sequence pair mask. """ tokens = example["tokens"] encoded_inputs = tokenizer( tokens, return_length=True, is_split_into_words=True, max_seq_len=max_seq_length) input_ids = encoded_inputs["input_ids"] token_type_ids = encoded_inputs["token_type_ids"] seq_len = encoded_inputs["seq_len"] return input_ids, token_type_ids, seq_len @paddle.no_grad() def predict(model, data_loader, label_map): """ Given a prediction dataset, it gives the prediction results. Args: model(obj:`paddle.nn.Layer`): A model to classify texts. data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches. label_map(obj:`dict`): The label id (key) to label str (value) map. """ model.eval() results = [] for input_ids, token_type_ids, seq_lens in data_loader: preds = model(input_ids, token_type_ids, seq_lens=seq_lens) tags = parse_predict_result(preds.numpy(), seq_lens.numpy(), label_map) results.extend(tags) return results def parse_predict_result(predictions, seq_lens, label_map): """ Parses the prediction results to the label tag. """ pred_tag = [] for idx, pred in enumerate(predictions): seq_len = seq_lens[idx] # drop the "[CLS]" and "[SEP]" token tag = [label_map[i] for i in pred[1:seq_len - 1]] pred_tag.append(tag) return pred_tag def create_dataloader(dataset, mode='train', batch_size=1, batchify_fn=None, trans_fn=None): if trans_fn: dataset = dataset.map(trans_fn) shuffle = True if mode == 'train' else False if mode == 'train': batch_sampler = paddle.io.DistributedBatchSampler( dataset, batch_size=batch_size, shuffle=shuffle) else: batch_sampler = paddle.io.BatchSampler( dataset, batch_size=batch_size, shuffle=shuffle) return paddle.io.DataLoader( dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) if __name__ == "__main__": paddle.set_device(args.device) test_ds = load_dataset("cote", "dp", splits=['test']) # The COTE_DP dataset labels with "BIO" schema. label_map = {0: "B", 1: "I", 2: "O"} # `no_entity_label` represents that the token isn't an entity. no_entity_label_idx = 2 skep = SkepModel.from_pretrained('skep_ernie_1.0_large_ch') model = SkepCrfForTokenClassification( skep, num_classes=len(test_ds.label_list)) tokenizer = SkepTokenizer.from_pretrained('skep_ernie_1.0_large_ch') if args.params_path and os.path.isfile(args.params_path): state_dict = paddle.load(args.params_path) model.set_dict(state_dict) print("Loaded parameters from %s" % args.params_path) trans_func = partial( convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length) batchify_fn = lambda samples, fn=Tuple( Pad(axis=0, pad_val=tokenizer.vocab[tokenizer.pad_token]), # input ids Pad(axis=0, pad_val=tokenizer.vocab[tokenizer.pad_token]), # token type ids Stack(dtype='int64'), # sequence lens ): [data for data in fn(samples)] test_data_loader = create_dataloader( test_ds, mode='test', batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func) results = predict(model, test_data_loader, label_map) for idx, example in enumerate(test_ds.data): print(len(example['tokens']), len(results[idx])) print('Data: {} \t Label: {}'.format(example, results[idx]))
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。