Prompt Learning是当前NLP的一个重要话题,已经有许多文章进行论述。
从本质上来说,Prompt Learning 可以理解为一种下游任务的重定义方法,将几乎所有的下游任务均统一为预训练语言模型任务,从而避免了预训练模型和下游任务之间存在的 gap。
如此一来,几乎所有的下游 NLP 任务均可以使用,不需要训练数据,在小样本数据集的基础上也可以取得超越 Fine-Tuning 的效果,使得所有任务在使用方法上变得更加一致,而局限于字面意义上的理解还远远不够,我们可以通过一种简单、明了的方式进行讲述。
- import torch.nn as nn
- from transformers import BertModel,BertForMaskedLM
- class Bert_Model(nn.Module):
- def __init__(self, bert_path ,config_file ):
- super(Bert_Model, self).__init__()
- self.bert = BertForMaskedLM.from_pretrained(bert_path,config=config_file) # 加载预训练模型权重
- def forward(self, input_ids, attention_mask, token_type_ids):
- outputs = self.bert(input_ids, attention_mask, token_type_ids) #masked LM 输出的是 mask的值 对应的ids的概率 ,输出 会是词表大小,里面是概率
- logit = outputs[0] # 池化后的输出 [bs, config.hidden_size]
- return logit
下面一段代码,简单的使用了hugging face中的bert-base-uncased进行空缺词预测,先可以得到预训练模型对指定[MASK]位置上概率最大的词语【词语来自于预训练语言模型的词表】。
例如给定句子"natural language processing is a [MASK] technology.",要求预测出其中的[MASK]的词:
- >>> from transformers import pipeline
- >>> unmasker = pipeline('fill-mask', model='bert-base-uncased')
- >>> unmasker("natural language processing is a [MASK] technology.")
- [{'score': 0.18927036225795746, 'token': 3274, 'token_str': 'computer', 'sequence': 'natural language processing is a computer technology.'},
- {'score': 0.14354903995990753, 'token': 4807, 'token_str': 'communication', 'sequence': 'natural language processing is a communication technology.'},
- {'score': 0.09429361671209335, 'token': 2047, 'token_str': 'new', 'sequence': 'natural language processing is a new technology.'},
- {'score': 0.05184786394238472, 'token': 2653, 'token_str': 'language', 'sequence': 'natural language processing is a language technology.'},
- {'score': 0.04084266722202301, 'token': 15078, 'token_str': 'computational', 'sequence': 'natural language processing is a computational technology.'}]
- >>> from transformers import pipeline
- >>> unmasker = pipeline('fill-mask', model='bert-base-uncased')
- >>> text = "I really like the film a lot."
- >>> prompt_template = "Because it was [MASK]."
- >>> pred1 = unmasker(text + prompt_template)
- >>> pred1
- [
- {'score': 0.14730973541736603, 'token': 2307, 'token_str': 'great', 'sequence': 'i really like the film a lot. because it was great.'},
- {'score': 0.10884211212396622, 'token': 6429, 'token_str': 'amazing', 'sequence': 'i really like the film a lot. because it was amazing.'},
- {'score': 0.09781625121831894, 'token': 2204, 'token_str': 'good', 'sequence': 'i really like the film a lot. because it was good.'},
- {'score': 0.04627735912799835, 'token': 4569, 'token_str': 'fun', 'sequence': 'i really like the film a lot. because it was fun.'},
- {'score': 0.043138038367033005, 'token': 10392, 'token_str': 'fantastic', 'sequence': 'i really like the film a lot. because it was fantastic.'}]
- >>> text = "this movie makes me very disgusting. "
- >>> prompt_template = "Because it was [MASK]."
- >>> pred2 = unmasker(text + prompt_template)
- >>> pred2
- [
- {'score': 0.05464331805706024, 'token': 9643, 'token_str': 'awful', 'sequence': 'this movie makes me very disgusting. because it was awful.'},
- {'score': 0.050322480499744415, 'token': 2204, 'token_str': 'good', 'sequence': 'this movie makes me very disgusting. because it was good.'},
- {'score': 0.04008950665593147, 'token': 9202, 'token_str': 'horrible', 'sequence': 'this movie makes me very disgusting. because it was horrible.'},
- {'score': 0.03569378703832626, 'token': 3308, 'token_str': 'wrong', 'sequence': 'this movie makes me very disgusting. because it was wrong.'},
- {'score': 0.033358603715896606, 'token': 2613, 'token_str': 'real', 'sequence': 'this movie makes me very disgusting. because it was real.'}]

与构造prompt-template之外,另一个重要的点是verblize,做词语到类型的映射,因为MLM模型预测的词语很不确定,需要将词语与具体的类别进行对齐,比如将"great", "amazing", "good", "fun", "fantastic", "better"等词对齐到"positive"上,当模型预测结果出现这些词时,就可以将整个预测的类别设定为positive;
同理,将"awful", "horrible", "bad", "wrong", "ugly"等词映射为“negative”时,即可以将整个预测的类别设定为negative;
- >>> verblize_dict = {"pos": ["great", "amazing", "good", "fun", "fantastic", "better"], "neg": ["awful", "horrible", "bad", "wrong", "ugly"]
- ... }
- >>> hash_dict = dict()
- >>> for k, v in verblize_dict.items():
- ... for v_ in v:
- ... hash_dict[v_] = k
- >>> hash_dict
- {'great': 'pos', 'amazing': 'pos', 'good': 'pos', 'fun': 'pos', 'fantastic': 'pos', 'better': 'pos', 'awful': 'neg', 'horrible': 'neg', 'bad': 'neg', 'wrong': 'neg', 'ugly': 'neg'}
- >>> [{"label":hash_dict[i["token_str"]], "score":i["score"]} for i in pred1]
- [{'label': 'pos', 'score': 0.14730973541736603}, {'label': 'pos', 'score': 0.10884211212396622}, {'label': 'pos', 'score': 0.09781625121831894}, {'label': 'pos', 'score': 0.04627735912799835}, {'label': 'pos', 'score': 0.043138038367033005}]
- >>> [{"label":hash_dict.get(i["token_str"], i["token_str"]), "score":i["score"]} for i in pred2]
- [{'label': 'neg', 'score': 0.05464331805706024}, {'label': 'pos', 'score': 0.050322480499744415}, {'label': 'neg', 'score': 0.04008950665593147}, {'label': 'neg', 'score': 0.03569378703832626}, {'label': 'real', 'score': 0.033358603715896606}]
- {
- "text":"I really like the film a lot.", "label": "pos"
- "text":"this movie makes me very disgusting. ", "label":"neg"
- }
- from transformers import AutoModelForMaskedLM , AutoTokenizer
- import torch
- class Prompting(object):
- def __init__(self, **kwargs):
- model_path=kwargs['model']
- tokenizer_path= kwargs['model']
- if "tokenizer" in kwargs.keys():
- tokenizer_path= kwargs['tokenizer']
- self.model = AutoModelForMaskedLM.from_pretrained(model_path)
- self.tokenizer = AutoTokenizer.from_pretrained(model_path)
- def prompt_pred(self,text):
- """
- 输入带有[MASK]的序列,输出LM模型Vocab中的词语列表及其概率
- """
- indexed_tokens=self.tokenizer(text, return_tensors="pt").input_ids
- tokenized_text= self.tokenizer.convert_ids_to_tokens (indexed_tokens[0])
- mask_pos=tokenized_text.index(self.tokenizer.mask_token)
- self.model.eval()
- with torch.no_grad():
- outputs = self.model(indexed_tokens)
- predictions = outputs[0]
- values, indices=torch.sort(predictions[0, mask_pos], descending=True)
- result=list(zip(self.tokenizer.convert_ids_to_tokens(indices), values))
- self.scores_dict={a:b for a,b in result}
- return result
- def compute_tokens_prob(self, text, token_list1, token_list2):
- """
- 给定两个词表,token_list1表示表示正面情感positive的词,如good, great,token_list2表示表示负面情感positive的词,如good, great,bad, terrible.
- 在计算概率时候,统计每个类别词所占的比例,score1/(score1+score2)并归一化,作为最终类别概率。
- """
- _=self.prompt_pred(text)
- score1=[self.scores_dict[token1] if token1 in self.scores_dict.keys() else 0\
- for token1 in token_list1]
- score1= sum(score1)
- score2=[self.scores_dict[token2] if token2 in self.scores_dict.keys() else 0\
- for token2 in token_list2]
- score2= sum(score2)
- softmax_rt=torch.nn.functional.softmax(torch.Tensor([score1,score2]), dim=0)
- return softmax_rt
- def fine_tune(self, sentences, labels, prompt=" Since it was [MASK].",goodToken="good",badToken="bad"):
- """
- 对已有标注数据进行Fine tune训练。
- """
- good=tokenizer.convert_tokens_to_ids(goodToken)
- bad=tokenizer.convert_tokens_to_ids(badToken)
- from transformers import AdamW
- optimizer = AdamW(self.model.parameters(),lr=1e-3)
- for sen, label in zip(sentences, labels):
- tokenized_text = self.tokenizer.tokenize(sen+prompt)
- indexed_tokens = self.tokenizer.convert_tokens_to_ids(tokenized_text)
- tokens_tensor = torch.tensor([indexed_tokens])
- mask_pos=tokenized_text.index(self.tokenizer.mask_token)
- outputs = self.model(tokens_tensor)
- predictions = outputs[0]
- pred=predictions[0, mask_pos][[good,bad]]
- prob=torch.nn.functional.softmax(pred, dim=0)
- lossFunc = torch.nn.CrossEntropyLoss()
- loss=lossFunc(prob.unsqueeze(0), torch.tensor([label]))
- loss.backward()
- optimizer.step()

- >>from transformers import AutoModelForMaskedLM , AutoTokenizer
- >>import torch
- >>model_path="bert-base-uncased"
- >>tokenizer = AutoTokenizer.from_pretrained(model_path)
- >>from prompt import Prompting
- >>prompting= Prompting(model=model_path)
- >>prompt="Because it was [MASK]."
- >>text="I really like the film a lot."
- >>prompting.prompt_pred(text+prompt)[:10]
- [('great', tensor(9.5558)),
- ('amazing', tensor(9.2532)),
- ('good', tensor(9.1464)),
- ('fun', tensor(8.3979)),
- ('fantastic', tensor(8.3277)),
- ('wonderful', tensor(8.2719)),
- ('beautiful', tensor(8.1584)),
- ('awesome', tensor(8.1071)),
- ('incredible', tensor(8.0140)),
- ('funny', tensor(7.8785))]
- >>text="I did not like the film."
- >>prompting.prompt_pred(text+prompt)[:10]
- [('bad', tensor(8.6784)),
- ('funny', tensor(8.1660)),
- ('good', tensor(7.9858)),
- ('awful', tensor(7.7454)),
- ('scary', tensor(7.3526)),
- ('boring', tensor(7.1553)),
- ('wrong', tensor(7.1402)),
- ('terrible', tensor(7.1296)),
- ('horrible', tensor(6.9923)),
- ('ridiculous', tensor(6.7731))]

- >>text="not worth watching"
- >>prompting.compute_tokens_prob(text+prompt, token_list1=["great","amazin","good"], token_list2= ["bad","awfull","terrible"])
- tensor([0.1496, 0.8504])
- >>text="I strongly recommend that moview"
- >>prompting.compute_tokens_prob(text+prompt, token_list1=["great","amazin","good"], token_list2= ["bad","awfull","terrible"])
- tensor([0.9321, 0.0679])
- >>text="I strongly recommend that moview"
- >>prompting.compute_tokens_prob(text+prompt, token_list1=["good"], token_list2= ["bad"])
- tensor([0.9223, 0.0777])
Sentence. John is a type of [MASK]
2、使用prompt_pred直接进行预测 我们直接进行处理,可以看看效果:
- >>prompting.prompt_pred("John went to Paris to visit the University. John is a type of [MASK].")[:5]
- [('man', tensor(8.1382)),
- ('john', tensor(7.1325)),
- ('guy', tensor(6.9672)),
- ('writer', tensor(6.4336)),
- ('philosopher', tensor(6.3823))]
- >>prompting.prompt_pred("Savaş went to Paris to visit the university. Savaş is a type of [MASK].")[:5]
- [('philosopher', tensor(7.6558)),
- ('poet', tensor(7.5621)),
- ('saint', tensor(7.0104)),
- ('man', tensor(6.8890)),
- ('pigeon', tensor(6.6780))]
- >>> prompting.compute_tokens_prob("It is a type of [MASK].",
- token_list1=["person","man"], token_list2=["location","city","place"])
- tensor([0.7603, 0.2397])
- >>> prompting.compute_tokens_prob("Savaş went to Paris to visit the parliament. Savaş is a type of [MASK].",
- token_list1=["person","man"], token_list2=["location","city","place"])//确定概率为0.76,将大于0.76的作为判定为person的概率
- tensor([9.9987e-01, 1.2744e-04])
从上面的结果中,我们可以看到,利用分类方式来实现zero shot实体识别,是直接有效的,“Savaş”判定为person的概率为0.99,
- prompting.compute_tokens_prob("Savaş went to Laris to visit the parliament. Laris is a type of [MASK].",
- token_list1=["person","man"], token_list2=["location","city","place"])
- tensor([0.3263, 0.6737])
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。