当前位置:   article > 正文

工程师&程序员的自我修养 Episode.4 基于百度飞桨PaddlePaddle框架的女朋友情绪分析&防被打消息推荐深度学习系统_baidu paddle uie情感分析

baidu paddle uie情感分析

具体为什么想到这个题目呢。。。大概是我也想不出别的什么有趣的话题或者项目的工作了吧。

有一天,柏拉图问老师苏格拉底什么是爱情?老师就让他到理论麦田里去,摘一棵全麦田里最大最金黄的麦穗来,期间只能摘一次,并且只可向前走,不能回头。柏拉图于是按照老师说的去做了。结果他两手空空走出了田地,老师问他为什么摘不到。他说:“因为期间只能摘一次,又不能走回头路,期间即使见到最大最金黄的,因为不知前面是否有更好的,所以没有摘;走到前面时,又发觉总不及之前见到的好,原来最大最金黄的麦穗早错过了;于是我什么也没有摘。”老师说:这就是“爱情”。

但是现实很骨感。当前在读信息、计科及机械等硬核工科男大多还处在“啊女朋友啊,今天一句话都没来得及说。”的状态中,而且经常代码写不来,实验做不好,debug完不成的卷容易造成说话用语的玄学。因而简单实现一个简单的利用微信qq消息推测女朋友当前隐含情绪以及推荐防被打消息的工具是大有裨益的。

友情提示:本深度学习模型的实战测试可能会有生命危险,请谨慎操作!

文章整体结构分为:

一、百度UNIT架构实现的自然多轮对话机器人功能

二、对话情绪识别理论

三、PaddleHub实现对话情绪识别

四、PaddleNLP实现对话情绪识别升级版

五、PaddlePaddle开发的情绪识别详细代码

六、PaddlePaddle结合Python爬虫的女票微博情绪监控


一、百度UNIT架构实现的自然多轮对话机器人功能:

以下感谢 @没入门的研究生 相关文章及百度暂开放使用的UNIT对话机器人平台服务,深受启发。

1、准备工作:

百度UNIT,即百度智能对话定制与服务平台(Understanding and Interaction Technology),是百度积累多年的自然语言处理技术的集大成者,通过简单的四个步骤:创建技能、配置意图及词槽、配置训练数据、训练模型即可从无到有得到一个对话系统。

UNIT是一个商业服务平台,对于高级的对话功能提供了付费服务。不过,对于注册的开发者,UNIT提供了五个免费的技能对话的研发环境。想试一试的朋友可以利用这个免费技能实现一些非常有意思的功能。通过UNIT实现多轮对话的前提,是我们要注册成为该平台的开发者,并在百度控制台申请到UNIT功能的ID和KEY。以下详谈。

进入UNIT官网后,点击进入即可看到注册界面,按照要求注册完成后,便可以进入到UNIT的技能库,如下图所示:

此时我们的技能库还是空的。左上角有“我的机器人”和“我的闲聊”两个板块,实现多轮对话便会用到这两块。

之后点击“我的闲聊”,然后点击新建我的闲聊技能,这里有普通,专业,增强版可选。其中增强版即百度Plato对话模型的中文对话模型。

点击“我的机器人”,然后点击“+”号,创建一个新的机器人:

创建完成后,点击刚刚创建的机器人,进入机器人页面,点击“添加技能”,添加刚刚创建的闲聊技能。

最后登录百度控制台,需要注意的是这个也需要注册,没有注册的朋友请自行注册完成。进入主页后,在产品服务下找到“智能对话定制与服务平台UNIT”,点击进入;点击“创建应用”,输入应用名称,应用描述后后点击“立即创建”即可创建应用。创建后,可以在页面看到创建应用的信息,纪录APPID和APIKEY信息。

至此,多轮对话的准备工作便已经完成了。

2、本机代码调用实现:

准备工作完成后,我们可以通过请求调用API的形式,接收返回结果进行多轮对话。更加详细的调用方法可以在UNIT官网技术文档获取,这里我直接提供出现成的调用对话函数,如下。其中APPID,SECRETKEY为控制台中申请应用得到的APPID和APIKEY,SERVICEID为机器人的ID,SKILLID为闲聊技能ID,而函数的参数解释如下:

text: 当前对话内容(人发出的对话)

user_id: 对话人的编号(对不同人进行不同的编号,来区分不同对象)

session_id, history: 见后边解释

log_id: 非必要,纪录日志的id号

  1. # encoding:utf-8
  2. import requests
  3. APPID = "*************"
  4. SECRETKEY = "****************"
  5. SERVICEID = '****************'
  6. SKILLID = '***************'
  7. # client_id 为官网获取的AK, client_secret 为官网获取的SK
  8. host = 'https://aip.baidubce.com/oauth/2.0/token?grant_type=client_credentials&client_id=%s&client_secret=%s' \
  9. % (APPKEY, SECRETKEY)
  10. response = requests.get(host)
  11. access_token = response.json()["access_token"]
  12. url = 'https://aip.baidubce.com/rpc/2.0/unit/service/chat?access_token=' + access_token
  13. def dialog(text, user_id, session_id='', history='', log_id='LOG_FOR_TEST'):
  14. post_data = "{\"log_id\":\"%s\",\"version\":\"2.0\",\"service_id\":\"%s\",\"session_id\":\"%s\"," \
  15. "\"request\":{\"query\":\"%s\",\"user_id\":\"%s\"}," \
  16. "\"dialog_state\":{\"contexts\":{\"SYS_REMEMBERED_SKILLS\":[\"%s\"]}}}"\
  17. %(log_id, SERVICEID, session_id, text, user_id, SKILLID)
  18. if len(history) > 0:
  19. post_data = "{\"log_id\":\"%s\",\"version\":\"2.0\",\"service_id\":\"%s\",\"session_id\":\"\"," \
  20. "\"request\":{\"query\":\"%s\",\"user_id\":\"%s\"}," \
  21. "\"dialog_state\":{\"contexts\":{\"SYS_REMEMBERED_SKILLS\":[\"%s\"], " \
  22. "\"SYS_CHAT_HIST\":[%s]}}}" \
  23. %(log_id, SERVICEID, text, user_id, SKILLID, history)
  24. post_data = post_data.encode('utf-8')
  25. headers = {'content-type': 'application/x-www-form-urlencoded'}
  26. response = requests.post(url, data=post_data, headers=headers)
  27. resp = response.json()
  28. ans = resp["result"]["response_list"][0]["action_list"][0]['say']
  29. session_id = resp['result']['session_id']
  30. return ans, session_id

需要说明的是,多轮对话的历史信息,我们可以通过记录请求响应结果中的session_id实现,也可以手动记录对话纪录history,并写入请求的dialog_state中的SYS_CHAT_HIST。为此,我们定义一个用户类,来存储每个用户的对话历史,或相应的session_id。当输入了history时,session_id无效;当没有history时,务必输入session_id以确保对话过程中可以考虑历史对话信息。如下:

  1. class User:
  2. def __init__(self, user_id):
  3. self.user_id = user_id
  4. self.session_id = ''
  5. self._history = []
  6. self.history = ''
  7. self.MAX_TURN = 7
  8. def get_service_id(self, session_id):
  9. self.session_id = session_id
  10. def update_history(self, text):
  11. self._history.append(text)
  12. self._history = self._history[-self.MAX_TURN*2-1:]
  13. self.history = ','.join(["\""+sent+"\"" for sent in self._history])
  14. def start_new_dialog(self):
  15. self.session_id = ''
  16. self._history = []
  17. self.history = ''
  18. def change_max_turn(self, max_turn):
  19. self.MAX_TURN = max_turn

 接下来,便可以进行对话了:

  1. from dialog import dialog
  2. from user import User
  3. user_id = 'test_user'
  4. user = User(user_id)
  5. while True:
  6. human_ans = input()
  7. if len(human_ans) > 0:
  8. user.update_history(human_ans)
  9. robot_resp, session_id = dialog(human_ans, user.user_id, user.session_id, user.history)
  10. user.session_id = session_id
  11. user.update_history(robot_resp)
  12. print("Robot: %s" % robot_resp)
  13. else:
  14. break
  15. user.start_new_dialog()

 实际演示对话的话可以利用如下代码段实现:

  1. from dialog import dialog
  2. from user import User
  3. user_id = 'test_user'
  4. user = User(user_id)
  5. while True:
  6. human_ans = input()
  7. if len(human_ans) > 0:
  8. user.update_history(human_ans)
  9. robot_resp, session_id = dialog(human_ans, user.user_id, user.session_id, user.history)
  10. user.session_id = session_id
  11. user.update_history(robot_resp)
  12. print("You: %s" % human_ans)
  13. print("Robot: %s" % robot_resp)
  14. else:
  15. break
  16. user.start_new_dialog()

这里You后紧接我们自己想说的话,Robot后为UNIT自动生成的回复,部分问题也可能是AI问我们,挑选出其中非问题的对话内容可以作为之后情绪识别的验证。 


二、对话情绪识别理论:

此处感谢飞桨官方文章 七夕礼物没送对?飞桨PaddlePaddle帮你读懂女朋友的小心思 深有启发。

对话情绪识别适用于聊天、客服等多个场景,能够帮助企业更好地把握对话质量、改善产品的用户交互体验,也能分析客服服务质量、降低人工质检成本。对话情绪识别(Emotion Detection,简称EmoTect),专注于识别智能对话场景中用户的情绪,针对智能对话场景中的用户文本,自动判断该文本的情绪类别并给出相应的置信度,情绪类型分为积极、消极、中性。

基于百度自建测试集(包含闲聊、客服)和nlpcc2014微博情绪数据集评测效果如下表所示,此外,PaddleNLP还开源了百度基于海量数据训练好的模型,该模型在聊天对话语料上fine-tune之后,可以得到更好的效果。

  • BOW:Bag Of Words,是一个非序列模型,使用基本的全连接结构。

  • CNN:浅层CNN模型,能够处理变长的序列输入,提取一个局部区域之内的特征。

  • TextCNN:多卷积核CNN模型,能够更好地捕捉句子局部相关性。

  • LSTM:单层LSTM模型,能够较好地解决序列文本中长距离依赖的问题。

  • BI-LSTM:双向单层LSTM模型,采用双向LSTM结构,更好地捕获句子中的语义特征。

  • ERNIE:百度自研基于海量数据和先验知识训练的通用文本语义表示模型,并基于此在对话情绪分类数据集上进行fine-tune获得。

在百度PaddlePaddle飞桨下属的各个高层API中,PaddleNLP是专门用于自然语言处理的。PaddleHub则是深度学习现成预训练仓库,也可以调用其中训练好的模型进行使用。这两种方法相对简单顶层,但是直观性比较差。此外以下我们会大致给出一个从PaddlePaddle底层搭建的情绪分析代码。

技术上,基于百度飞桨PaddleNLP的“对话情绪识别”模型则特别针对中文表达中口语化、语气词多、词汇乱序等常见情况与难题,优化除去口语化、同义词转换等预处理方式,保证识别数据的干净有效,从而让机器“更懂”用户所表达的“中心思想”。因此,该模型可以广泛的适用于电商、教育、地图导航等场景,帮助“机器”更好地理解“人心”。


三、PaddleHub实现对话情绪识别:

1、准备工作:

包括查看当前的数据集和工作区文件,并且将PaddlePaddle和PaddleHub库统一升级到2.0版本。

  1. # 查看当前挂载的数据集目录, 该目录下的变更重启环境后会自动还原
  2. # View dataset directory. This directory will be recovered automatically after resetting environment.
  3. !ls /home/aistudio/data
  4. # 查看工作区文件, 该目录下的变更将会持久保存. 请及时清理不必要的文件, 避免加载过慢.
  5. # View personal work directory. All changes under this directory will be kept even after reset. Please clean unnecessary files in time to speed up environment loading.
  6. !ls /home/aistudio/work
  7. #需要将PaddleHub和PaddlePaddle统一升级到2.0版本
  8. !pip install paddlehub==2.0.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
  9. !pip install paddlepaddle==2.0.0 -i https://pypi.tuna.tsinghua.edu.cn/simple

2、引入库文件和hub中关于情感分析的模型:

  1. import paddlehub as hub
  2. senta = hub.Module(name="senta_bilstm")

 3、调入我们需要检测的文本内容,放置在text变量中,每一句消息作为一个字符串存放在一个一维列表:

  1. test_text = [
  2. "天啦,千万别多说,扰乱军心,哈哈",
  3. "该做什么的时候就得好好做,别多想了",
  4. "你的老师和伙伴都你需要专心,哈哈" ,
  5. "其实你说你想我我肯定很开心,说明你在乎呀" ,
  6. "成大事者需要专注力哟~",
  7. "晚安啦~"
  8. ]

4、模型参数读取及调用预测:

  1. input_dict = {"text": test_text}
  2. results = senta.sentiment_classify(data=input_dict)
  3. for result in results:
  4. print(result['text'])
  5. print(result['sentiment_label'])
  6. print(result['sentiment_key'])
  7. print(result['positive_probs'])
  8. print(result['negative_probs'])

5、预测结果:

其中数字2表示积极,1表示中性,0表示消极。后两排的概率表示PaddleHub_Senta_bilstm模型认为的积极和消极程度(以可能性作为评判标准)的深浅。


四、基于PaddleNLP的对话情绪分析升级版:

1、准备工作:

包括升级PaddleNLP到2.0版本和导入必要的库文件。

  1. #首先,需要安装paddlenlp2.0。
  2. !pip install paddlenlp
  1. #导入相关的模块
  2. import paddle
  3. import paddlenlp as ppnlp
  4. from paddlenlp.data import Stack, Pad, Tuple
  5. import paddle.nn.functional as F
  6. import numpy as np
  7. from functools import partial #partial()函数可以用来固定某些参数值,并返回一个新的callable对象

2、数据集准备:

数据集为公开中文情感分析数据集ChnSenticorp。使用PaddleNLP的.datasets.ChnSentiCorp.get_datasets方法即可以加载该数据集。

  1. #采用paddlenlp内置的ChnSentiCorp语料,该语料主要可以用来做情感分类。训练集用来训练模型,验证集用来选择模型,测试集用来评估模型泛化性能。
  2. train_ds, dev_ds, test_ds = ppnlp.datasets.ChnSentiCorp.get_datasets(['train','dev','test'])
  3. #获得标签列表
  4. label_list = train_ds.get_labels()
  5. #看看数据长什么样子,分别打印训练集、验证集、测试集的前3条数据。
  6. print("训练集数据:{}\n".format(train_ds[0:3]))
  7. print("验证集数据:{}\n".format(dev_ds[0:3]))
  8. print("测试集数据:{}\n".format(test_ds[0:3]))
  9. print("训练集样本个数:{}".format(len(train_ds)))
  10. print("验证集样本个数:{}".format(len(dev_ds)))
  11. print("测试集样本个数:{}".format(len(test_ds)))

3、数据预处理:

  1. #调用ppnlp.transformers.BertTokenizer进行数据处理,tokenizer可以把原始输入文本转化成模型model可接受的输入数据格式。
  2. tokenizer = ppnlp.transformers.BertTokenizer.from_pretrained("bert-base-chinese")
  3. #数据预处理
  4. def convert_example(example,tokenizer,label_list,max_seq_length=256,is_test=False):
  5. if is_test:
  6. text = example
  7. else:
  8. text, label = example
  9. #tokenizer.encode方法能够完成切分token,映射token ID以及拼接特殊token
  10. encoded_inputs = tokenizer.encode(text=text, max_seq_len=max_seq_length)
  11. input_ids = encoded_inputs["input_ids"]
  12. segment_ids = encoded_inputs["segment_ids"]
  13. if not is_test:
  14. label_map = {}
  15. for (i, l) in enumerate(label_list):
  16. label_map[l] = i
  17. label = label_map[label]
  18. label = np.array([label], dtype="int64")
  19. return input_ids, segment_ids, label
  20. else:
  21. return input_ids, segment_ids
  22. #数据迭代器构造方法
  23. def create_dataloader(dataset, trans_fn=None, mode='train', batch_size=1, use_gpu=False, pad_token_id=0, batchify_fn=None):
  24. if trans_fn:
  25. dataset = dataset.apply(trans_fn, lazy=True)
  26. if mode == 'train' and use_gpu:
  27. sampler = paddle.io.DistributedBatchSampler(dataset=dataset, batch_size=batch_size, shuffle=True)
  28. else:
  29. shuffle = True if mode == 'train' else False #如果不是训练集,则不打乱顺序
  30. sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle) #生成一个取样器
  31. dataloader = paddle.io.DataLoader(dataset, batch_sampler=sampler, return_list=True, collate_fn=batchify_fn)
  32. return dataloader
  33. #使用partial()来固定convert_example函数的tokenizer, label_list, max_seq_length, is_test等参数值
  34. trans_fn = partial(convert_example, tokenizer=tokenizer, label_list=label_list, max_seq_length=128, is_test=False)
  35. batchify_fn = lambda samples, fn=Tuple(Pad(axis=0,pad_val=tokenizer.pad_token_id), Pad(axis=0, pad_val=tokenizer.pad_token_id), Stack(dtype="int64")):[data for data in fn(samples)]
  36. #训练集迭代器
  37. train_loader = create_dataloader(train_ds, mode='train', batch_size=64, batchify_fn=batchify_fn, trans_fn=trans_fn)
  38. #验证集迭代器
  39. dev_loader = create_dataloader(dev_ds, mode='dev', batch_size=64, batchify_fn=batchify_fn, trans_fn=trans_fn)
  40. #测试集迭代器
  41. test_loader = create_dataloader(test_ds, mode='test', batch_size=64, batchify_fn=batchify_fn, trans_fn=trans_fn)

4、模型训练:

  1. #加载预训练模型Bert用于文本分类任务的Fine-tune网络BertForSequenceClassification, 它在BERT模型后接了一个全连接层进行分类。
  2. #由于本任务中的情感分类是二分类问题,设定num_classes为2
  3. model = ppnlp.transformers.BertForSequenceClassification.from_pretrained("bert-base-chinese", num_classes=2)
  4. #设置训练超参数
  5. #学习率
  6. learning_rate = 1e-5
  7. #训练轮次
  8. epochs = 20
  9. #学习率预热比率
  10. warmup_proption = 0.1
  11. #权重衰减系数
  12. weight_decay = 0.01
  13. num_training_steps = len(train_loader) * epochs
  14. num_warmup_steps = int(warmup_proption * num_training_steps)
  15. def get_lr_factor(current_step):
  16. if current_step < num_warmup_steps:
  17. return float(current_step) / float(max(1, num_warmup_steps))
  18. else:
  19. return max(0.0,
  20. float(num_training_steps - current_step) /
  21. float(max(1, num_training_steps - num_warmup_steps)))
  22. #学习率调度器
  23. lr_scheduler = paddle.optimizer.lr.LambdaDecay(learning_rate, lr_lambda=lambda current_step: get_lr_factor(current_step))
  24. #优化器
  25. optimizer = paddle.optimizer.AdamW(
  26. learning_rate=lr_scheduler,
  27. parameters=model.parameters(),
  28. weight_decay=weight_decay,
  29. apply_decay_param_fun=lambda x: x in [
  30. p.name for n, p in model.named_parameters()
  31. if not any(nd in n for nd in ["bias", "norm"])
  32. ])
  33. #损失函数
  34. criterion = paddle.nn.loss.CrossEntropyLoss()
  35. #评估函数
  36. metric = paddle.metric.Accuracy()
  37. #评估函数
  38. def evaluate(model, criterion, metric, data_loader):
  39. model.eval()
  40. metric.reset()
  41. losses = []
  42. for batch in data_loader:
  43. input_ids, segment_ids, labels = batch
  44. logits = model(input_ids, segment_ids)
  45. loss = criterion(logits, labels)
  46. losses.append(loss.numpy())
  47. correct = metric.compute(logits, labels)
  48. metric.update(correct)
  49. accu = metric.accumulate()
  50. print("eval loss: %.5f, accu: %.5f" % (np.mean(losses), accu))
  51. model.train()
  52. metric.reset()
  53. #开始训练
  54. global_step = 0
  55. for epoch in range(1, epochs + 1):
  56. for step, batch in enumerate(train_loader, start=1): #从训练数据迭代器中取数据
  57. input_ids, segment_ids, labels = batch
  58. logits = model(input_ids, segment_ids)
  59. loss = criterion(logits, labels) #计算损失
  60. probs = F.softmax(logits, axis=1)
  61. correct = metric.compute(probs, labels)
  62. metric.update(correct)
  63. acc = metric.accumulate()
  64. global_step += 1
  65. if global_step % 50 == 0 :
  66. print("global step %d, epoch: %d, batch: %d, loss: %.5f, acc: %.5f" % (global_step, epoch, step, loss, acc))
  67. loss.backward()
  68. optimizer.step()
  69. lr_scheduler.step()
  70. optimizer.clear_gradients()
  71. evaluate(model, criterion, metric, dev_loader)

5、模型预测:

在倒数第六行的data变量列表中,同之前PaddleHub中提到的方法,可以有效地传入文本内容参数,以一条消息为单个字符串元素排列此一维列表。

  1. def predict(model, data, tokenizer, label_map, batch_size=1):
  2. examples = []
  3. for text in data:
  4. input_ids, segment_ids = convert_example(text, tokenizer, label_list=label_map.values(), max_seq_length=128, is_test=True)
  5. examples.append((input_ids, segment_ids))
  6. batchify_fn = lambda samples, fn=Tuple(Pad(axis=0, pad_val=tokenizer.pad_token_id), Pad(axis=0, pad_val=tokenizer.pad_token_id)): fn(samples)
  7. batches = []
  8. one_batch = []
  9. for example in examples:
  10. one_batch.append(example)
  11. if len(one_batch) == batch_size:
  12. batches.append(one_batch)
  13. one_batch = []
  14. if one_batch:
  15. batches.append(one_batch)
  16. results = []
  17. model.eval()
  18. for batch in batches:
  19. input_ids, segment_ids = batchify_fn(batch)
  20. input_ids = paddle.to_tensor(input_ids)
  21. segment_ids = paddle.to_tensor(segment_ids)
  22. logits = model(input_ids, segment_ids)
  23. probs = F.softmax(logits, axis=1)
  24. idx = paddle.argmax(probs, axis=1).numpy()
  25. idx = idx.tolist()
  26. labels = [label_map[i] for i in idx]
  27. results.extend(labels)
  28. return results
  29. #待预测文本变量
  30. data = ['有点东西啊', '这个老师讲课水平挺高的', '你在干什么']
  31. label_map = {0: '负向情绪', 1: '正向情绪'}
  32. predictions = predict(model, data, tokenizer, label_map, batch_size=32)
  33. for idx, text in enumerate(data):
  34. print('预测文本: {} \n情绪标签: {}'.format(text, predictions[idx]))

 6、预测结果:

预测结果类同前PaddleHub部分,此略,程序会直接给出这句话是更偏向积极的正向情绪还是消极的负向情绪。


五、PaddlePaddle层从头搭建的自然语言情绪分析(ERNIE模型):

效果上,我们基于百度自建测试集(包含闲聊、客服)和nlpcc2014微博情绪数据集,进行评测,效果如下表所示,此外我们还开源了百度基于海量数据训练好的模型,该模型在聊天对话语料上fine-tune之后,可以得到更好的效果。

对话情绪识别任务输入是一段用户文本,输出是检测到的情绪类别,包括消极、积极、中性,这是一个经典的短文本三分类任务。数据集链接:https://aistudio.baidu.com/aistudio/datasetdetail/9740

数据集解压后生成data目录,data目录下有训练集数据(train.tsv)、开发集数据(dev.tsv)、测试集数据(test.tsv)、 待预测数据(infer.tsv)以及对应词典(vocab.txt)
训练、预测、评估使用的数据示例如下,数据由两列组成,以制表符('\t')分隔,第一列是情绪分类的类别(0表示消极;1表示中性;2表示积极),第二列是以空格分词的中文文本:

ERNIE:百度自研基于海量数据和先验知识训练的通用文本语义表示模型,并基于此在对话情绪分类数据集上进行 fine-tune 获得。

ERNIE 于 2019 年 3 月发布,通过建模海量数据中的词、实体及实体关系,学习真实世界的语义知识。相较于 BERT 学习原始语言信号,ERNIE 直接对先验语义知识单元进行建模,增强了模型语义表示能力。
同年 7 月,百度发布了 ERNIE 2.0ERNIE 2.0 是基于持续学习的语义理解预训练框架,使用多任务学习增量式构建预训练任务。ERNIE 2.0 中,新构建的预训练任务类型可以无缝的加入训练框架,持续的进行语义理解学习。 通过新增的实体预测、句子因果关系判断、文章句子结构重建等语义任务,ERNIE 2.0 语义理解预训练模型从训练数据中获取了词法、句法、语义等多个维度的自然语言信息,极大地增强了通用语义表示能力,示意图如下:

 4 个 cell 定义 ErnieModel 中使用的基本网络结构,包括:

  1. multi_head_attention
  2. positionwise_feed_forward
  3. pre_post_process_layer:增加 residual connection, layer normalization 和 droput,在 multi_head_attention 和 positionwise_feed_forward 前后使用
  4. encoder_layer:调用上述三种结构生成 encoder 层
  5. encoder:堆叠 encoder_layer 生成完整的 encoder

关于 multi_head_attention 和 positionwise_feed_forward 的介绍可以参考:The Annotated Transformer

3 个 cell 定义分词代码类,包括:

  1. FullTokenizer:完整的分词,在数据读取代码中使用,调用 BasicTokenizer 和 WordpieceTokenizer 实现
  2. BasicTokenizer:基本分词,包括标点划分、小写转换等
  3. WordpieceTokenizer:单词划分

 4 个 cell 定义数据读取器和预处理代码,包括:

  1. BaseReader:数据读取器基类
  2. ClassifyReader:用于分类模型的数据读取器,重写 _readtsv 和 _pad_batch_records 方法
  3. pad_batch_data:数据预处理,给数据加 padding,并生成位置数据和 mask
  4. ernie_pyreader:生成训练、验证和预测使用的 pyreader

数据集和ERNIE网络相关配置:

  1. # 数据集相关配置
  2. data_config = {
  3. 'data_dir': 'data/data9740/data', # Directory path to training data.
  4. 'vocab_path': 'pretrained_model/ernie_finetune/vocab.txt', # Vocabulary path.
  5. 'batch_size': 32, # Total examples' number in batch for training.
  6. 'random_seed': 0, # Random seed.
  7. 'num_labels': 3, # label number
  8. 'max_seq_len': 512, # Number of words of the longest seqence.
  9. 'train_set': 'data/data9740/data/test.tsv', # Path to training data.
  10. 'test_set': 'data/data9740/data/test.tsv', # Path to test data.
  11. 'dev_set': 'data/data9740/data/dev.tsv', # Path to validation data.
  12. 'infer_set': 'data/data9740/data/infer.tsv', # Path to infer data.
  13. 'label_map_config': None, # label_map_path
  14. 'do_lower_case': True, # Whether to lower case the input text. Should be True for uncased models and False for cased models.
  15. }
  16. # Ernie 网络结构相关配置
  17. ernie_net_config = {
  18. "attention_probs_dropout_prob": 0.1,
  19. "hidden_act": "relu",
  20. "hidden_dropout_prob": 0.1,
  21. "hidden_size": 768,
  22. "initializer_range": 0.02,
  23. "max_position_embeddings": 513,
  24. "num_attention_heads": 12,
  25. "num_hidden_layers": 12,
  26. "type_vocab_size": 2,
  27. "vocab_size": 18000,
  28. }

训练阶段相关配置:

  1. train_config = {
  2. 'init_checkpoint': 'pretrained_model/ernie_finetune/params',
  3. 'output_dir': 'train_model',
  4. 'epoch': 10,
  5. 'save_steps': 100,
  6. 'validation_steps': 100,
  7. 'lr': 0.00002,
  8. 'skip_steps': 10,
  9. 'verbose': False,
  10. 'use_cuda': True,
  11. }

预测阶段相关配置:

  1. infer_config = {
  2. 'init_checkpoint': 'train_model',
  3. 'use_cuda': True,
  4. }

具体实现的详细代码:

  1. #!/usr/bin/env python
  2. # coding: utf-8
  3. # ### 一、项目背景介绍
  4. # 对话情绪识别(Emotion Detection,简称EmoTect),专注于识别智能对话场景中用户的情绪,针对智能对话场景中的用户文本,自动判断该文本的情绪类别并给出相应的置信度,情绪类型分为积极、消极、中性。
  5. #
  6. # 对话情绪识别适用于聊天、客服等多个场景,能够帮助企业更好地把握对话质量、改善产品的用户交互体验,也能分析客服服务质量、降低人工质检成本。可通过 AI开放平台-对话情绪识别 线上体验。
  7. #
  8. # 效果上,我们基于百度自建测试集(包含闲聊、客服)和nlpcc2014微博情绪数据集,进行评测,效果如下表所示,此外我们还开源了百度基于海量数据训练好的模型,该模型在聊天对话语料上fine-tune之后,可以得到更好的效果。
  9. #
  10. #
  11. # | 模型 | 闲聊 | 客服 | 微博 |
  12. # | -------- | -------- | -------- | -------- |
  13. # | BOW | 90.2% | 87.6% | 74.2% |
  14. # | LSTM | 91.4% | 90.1% | 73.8% |
  15. # | Bi-LSTM | 91.2% | 89.9% | 73.6% |
  16. # | CNN | 90.8% | 90.7% | 76.3% |
  17. # | TextCNN | 91.1% | 91.0% | 76.8% |
  18. # | BERT | 93.6% | 92.3% | 78.6% |
  19. # | ERNIE | 94.4% | 94.0% | 80.6% |
  20. # ### 二、数据集介绍
  21. #
  22. # 对话情绪识别任务输入是一段用户文本,输出是检测到的情绪类别,包括消极、积极、中性,这是一个经典的短文本三分类任务。
  23. #
  24. # 数据集解压后生成data目录,data目录下有训练集数据(train.tsv)、开发集数据(dev.tsv)、测试集数据(test.tsv)、 待预测数据(infer.tsv)以及对应词典(vocab.txt)
  25. # 训练、预测、评估使用的数据示例如下,数据由两列组成,以制表符('\t')分隔,第一列是情绪分类的类别(0表示消极;1表示中性;2表示积极),第二列是以空格分词的中文文本:
  26. #
  27. # label text_a
  28. # 0 谁 骂人 了 ? 我 从来 不 骂人 , 我 骂 的 都 不是 人 , 你 是 人 吗 ?
  29. # 1 我 有事 等会儿 就 回来 和 你 聊
  30. # 2 我 见到 你 很高兴 谢谢 你 帮 我
  31. # In[1]:
  32. # 解压数据集
  33. get_ipython().system('cd /home/aistudio/data/data9740 && unzip -qo 对话情绪识别.zip')
  34. # In[2]:
  35. # 各种引用库
  36. from __future__ import absolute_import
  37. from __future__ import division
  38. from __future__ import print_function
  39. import io
  40. import os
  41. import six
  42. import sys
  43. import time
  44. import random
  45. import string
  46. import logging
  47. import argparse
  48. import collections
  49. import unicodedata
  50. from functools import partial
  51. from collections import namedtuple
  52. import multiprocessing
  53. import paddle
  54. import paddle.fluid as fluid
  55. import paddle.fluid.layers as layers
  56. import numpy as np
  57. # In[3]:
  58. # 统一的 logger 配置
  59. logger = None
  60. def init_log_config():
  61. """
  62. 初始化日志相关配置
  63. :return:
  64. """
  65. global logger
  66. logger = logging.getLogger()
  67. logger.setLevel(logging.INFO)
  68. log_path = os.path.join(os.getcwd(), 'logs')
  69. if not os.path.exists(log_path):
  70. os.makedirs(log_path)
  71. log_name = os.path.join(log_path, 'train.log')
  72. sh = logging.StreamHandler()
  73. fh = logging.FileHandler(log_name, mode='w')
  74. fh.setLevel(logging.DEBUG)
  75. formatter = logging.Formatter("%(asctime)s - %(filename)s[line:%(lineno)d] - %(levelname)s: %(message)s")
  76. fh.setFormatter(formatter)
  77. sh.setFormatter(formatter)
  78. logger.handlers = []
  79. logger.addHandler(sh)
  80. logger.addHandler(fh)
  81. # In[4]:
  82. # util
  83. def print_arguments(args):
  84. """
  85. 打印参数
  86. """
  87. logger.info('----------- Configuration Arguments -----------')
  88. for key in args.keys():
  89. logger.info('%s: %s' % (key, args[key]))
  90. logger.info('------------------------------------------------')
  91. def init_checkpoint(exe, init_checkpoint_path, main_program):
  92. """
  93. 加载缓存模型
  94. """
  95. assert os.path.exists(
  96. init_checkpoint_path), "[%s] cann't be found." % init_checkpoint_path
  97. def existed_persitables(var):
  98. """
  99. If existed presitabels
  100. """
  101. if not fluid.io.is_persistable(var):
  102. return False
  103. return os.path.exists(os.path.join(init_checkpoint_path, var.name))
  104. fluid.io.load_vars(
  105. exe,
  106. init_checkpoint_path,
  107. main_program=main_program,
  108. predicate=existed_persitables)
  109. logger.info("Load model from {}".format(init_checkpoint_path))
  110. def csv_reader(fd, delimiter='\t'):
  111. """
  112. csv 文件读取
  113. """
  114. def gen():
  115. for i in fd:
  116. slots = i.rstrip('\n').split(delimiter)
  117. if len(slots) == 1:
  118. yield slots,
  119. else:
  120. yield slots
  121. return gen()
  122. # ### 三、网络结构构建
  123. # **ERNIE**:百度自研基于海量数据和先验知识训练的通用文本语义表示模型,并基于此在对话情绪分类数据集上进行 fine-tune 获得。
  124. #
  125. # **ERNIE** 于 2019 年 3 月发布,通过建模海量数据中的词、实体及实体关系,学习真实世界的语义知识。相较于 BERT 学习原始语言信号,**ERNIE** 直接对先验语义知识单元进行建模,增强了模型语义表示能力。
  126. # 同年 7 月,百度发布了 **ERNIE 2.0**。**ERNIE 2.0** 是基于持续学习的语义理解预训练框架,使用多任务学习增量式构建预训练任务。**ERNIE 2.0** 中,新构建的预训练任务类型可以无缝的加入训练框架,持续的进行语义理解学习。 通过新增的实体预测、句子因果关系判断、文章句子结构重建等语义任务,**ERNIE 2.0** 语义理解预训练模型从训练数据中获取了词法、句法、语义等多个维度的自然语言信息,极大地增强了通用语义表示能力,示意图如下:
  127. # <img src="https://ai-studio-static-online.cdn.bcebos.com/b2c3107a955147238ecd0d44fff9850f9c47b45c189044009e688de80e7fc826" width="50%" />
  128. #
  129. # 参考资料:
  130. # 1. [ERNIE: Enhanced Representation through Knowledge Integration](https://arxiv.org/abs/1904.09223)
  131. # 2. [ERNIE 2.0: A Continual Pre-training Framework for Language Understanding](https://arxiv.org/abs/1907.12412)
  132. # 3. [ERNIE预览:百度 知识增强语义表示模型ERNIE](https://www.jianshu.com/p/fb66f444bb8c)
  133. # 4. [ERNIE 2.0 GitHub](https://github.com/PaddlePaddle/ERNIE)
  134. # #### 3.1 ERNIE 模型定义
  135. # class ErnieModel 定义 ERNIE encoder 网络结构
  136. # **输入** src_ids、position_ids、sentence_ids 和 input_mask
  137. # **输出** sequence_output 和 pooled_output
  138. # In[5]:
  139. class ErnieModel(object):
  140. """Ernie模型定义"""
  141. def __init__(self,
  142. src_ids,
  143. position_ids,
  144. sentence_ids,
  145. input_mask,
  146. config,
  147. weight_sharing=True,
  148. use_fp16=False):
  149. # Ernie 相关参数
  150. self._emb_size = config['hidden_size']
  151. self._n_layer = config['num_hidden_layers']
  152. self._n_head = config['num_attention_heads']
  153. self._voc_size = config['vocab_size']
  154. self._max_position_seq_len = config['max_position_embeddings']
  155. self._sent_types = config['type_vocab_size']
  156. self._hidden_act = config['hidden_act']
  157. self._prepostprocess_dropout = config['hidden_dropout_prob']
  158. self._attention_dropout = config['attention_probs_dropout_prob']
  159. self._weight_sharing = weight_sharing
  160. self._word_emb_name = "word_embedding"
  161. self._pos_emb_name = "pos_embedding"
  162. self._sent_emb_name = "sent_embedding"
  163. self._dtype = "float16" if use_fp16 else "float32"
  164. # Initialize all weigths by truncated normal initializer, and all biases
  165. # will be initialized by constant zero by default.
  166. self._param_initializer = fluid.initializer.TruncatedNormal(
  167. scale=config['initializer_range'])
  168. self._build_model(src_ids, position_ids, sentence_ids, input_mask)
  169. def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
  170. # padding id in vocabulary must be set to 0
  171. emb_out = fluid.layers.embedding(
  172. input=src_ids,
  173. size=[self._voc_size, self._emb_size],
  174. dtype=self._dtype,
  175. param_attr=fluid.ParamAttr(
  176. name=self._word_emb_name, initializer=self._param_initializer),
  177. is_sparse=False)
  178. position_emb_out = fluid.layers.embedding(
  179. input=position_ids,
  180. size=[self._max_position_seq_len, self._emb_size],
  181. dtype=self._dtype,
  182. param_attr=fluid.ParamAttr(
  183. name=self._pos_emb_name, initializer=self._param_initializer))
  184. sent_emb_out = fluid.layers.embedding(
  185. sentence_ids,
  186. size=[self._sent_types, self._emb_size],
  187. dtype=self._dtype,
  188. param_attr=fluid.ParamAttr(
  189. name=self._sent_emb_name, initializer=self._param_initializer))
  190. emb_out = emb_out + position_emb_out
  191. emb_out = emb_out + sent_emb_out
  192. emb_out = pre_process_layer(
  193. emb_out, 'nd', self._prepostprocess_dropout, name='pre_encoder')
  194. if self._dtype == "float16":
  195. input_mask = fluid.layers.cast(x=input_mask, dtype=self._dtype)
  196. self_attn_mask = fluid.layers.matmul(
  197. x=input_mask, y=input_mask, transpose_y=True)
  198. self_attn_mask = fluid.layers.scale(
  199. x=self_attn_mask, scale=10000.0, bias=-1.0, bias_after_scale=False)
  200. n_head_self_attn_mask = fluid.layers.stack(
  201. x=[self_attn_mask] * self._n_head, axis=1)
  202. n_head_self_attn_mask.stop_gradient = True
  203. self._enc_out = encoder(
  204. enc_input=emb_out,
  205. attn_bias=n_head_self_attn_mask,
  206. n_layer=self._n_layer,
  207. n_head=self._n_head,
  208. d_key=self._emb_size // self._n_head,
  209. d_value=self._emb_size // self._n_head,
  210. d_model=self._emb_size,
  211. d_inner_hid=self._emb_size * 4,
  212. prepostprocess_dropout=self._prepostprocess_dropout,
  213. attention_dropout=self._attention_dropout,
  214. relu_dropout=0,
  215. hidden_act=self._hidden_act,
  216. preprocess_cmd="",
  217. postprocess_cmd="dan",
  218. param_initializer=self._param_initializer,
  219. name='encoder')
  220. def get_sequence_output(self):
  221. """Get embedding of each token for squence labeling"""
  222. return self._enc_out
  223. def get_pooled_output(self):
  224. """Get the first feature of each sequence for classification"""
  225. next_sent_feat = fluid.layers.slice(
  226. input=self._enc_out, axes=[1], starts=[0], ends=[1])
  227. next_sent_feat = fluid.layers.fc(
  228. input=next_sent_feat,
  229. size=self._emb_size,
  230. act="tanh",
  231. param_attr=fluid.ParamAttr(
  232. name="pooled_fc.w_0", initializer=self._param_initializer),
  233. bias_attr="pooled_fc.b_0")
  234. return next_sent_feat
  235. # #### 3.2 基本网络结构定义
  236. # 以下 4 个 cell 定义 ErnieModel 中使用的基本网络结构,包括:
  237. # 1. multi_head_attention
  238. # 2. positionwise_feed_forward
  239. # 3. pre_post_process_layer:增加 residual connection, layer normalization 和 droput,在 multi_head_attention 和 positionwise_feed_forward 前后使用
  240. # 4. encoder_layer:调用上述三种结构生成 encoder 层
  241. # 5. encoder:堆叠 encoder_layer 生成完整的 encoder
  242. #
  243. # 关于 multi_head_attention 和 positionwise_feed_forward 的介绍可以参考:[The Annotated Transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html)
  244. # In[6]:
  245. def multi_head_attention(queries, keys, values, attn_bias, d_key, d_value, d_model, n_head=1, dropout_rate=0.,
  246. cache=None, param_initializer=None, name='multi_head_att'):
  247. """
  248. Multi-Head Attention. Note that attn_bias is added to the logit before
  249. computing softmax activiation to mask certain selected positions so that
  250. they will not considered in attention weights.
  251. """
  252. keys = queries if keys is None else keys
  253. values = keys if values is None else values
  254. if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
  255. raise ValueError(
  256. "Inputs: quries, keys and values should all be 3-D tensors.")
  257. def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
  258. """
  259. Add linear projection to queries, keys, and values.
  260. """
  261. q = layers.fc(input=queries,
  262. size=d_key * n_head,
  263. num_flatten_dims=2,
  264. param_attr=fluid.ParamAttr(
  265. name=name + '_query_fc.w_0',
  266. initializer=param_initializer),
  267. bias_attr=name + '_query_fc.b_0')
  268. k = layers.fc(input=keys,
  269. size=d_key * n_head,
  270. num_flatten_dims=2,
  271. param_attr=fluid.ParamAttr(
  272. name=name + '_key_fc.w_0',
  273. initializer=param_initializer),
  274. bias_attr=name + '_key_fc.b_0')
  275. v = layers.fc(input=values,
  276. size=d_value * n_head,
  277. num_flatten_dims=2,
  278. param_attr=fluid.ParamAttr(
  279. name=name + '_value_fc.w_0',
  280. initializer=param_initializer),
  281. bias_attr=name + '_value_fc.b_0')
  282. return q, k, v
  283. def __split_heads(x, n_head):
  284. """
  285. Reshape the last dimension of inpunt tensor x so that it becomes two
  286. dimensions and then transpose. Specifically, input a tensor with shape
  287. [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
  288. with shape [bs, n_head, max_sequence_length, hidden_dim].
  289. """
  290. hidden_size = x.shape[-1]
  291. # The value 0 in shape attr means copying the corresponding dimension
  292. # size of the input as the output dimension size.
  293. reshaped = layers.reshape(
  294. x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
  295. # permuate the dimensions into:
  296. # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
  297. return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
  298. def __combine_heads(x):
  299. """
  300. Transpose and then reshape the last two dimensions of inpunt tensor x
  301. so that it becomes one dimension, which is reverse to __split_heads.
  302. """
  303. if len(x.shape) == 3:
  304. return x
  305. if len(x.shape) != 4:
  306. raise ValueError("Input(x) should be a 4-D Tensor.")
  307. trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
  308. # The value 0 in shape attr means copying the corresponding dimension
  309. # size of the input as the output dimension size.
  310. return layers.reshape(
  311. x=trans_x,
  312. shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]],
  313. inplace=True)
  314. def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
  315. """
  316. Scaled Dot-Product Attention
  317. """
  318. scaled_q = layers.scale(x=q, scale=d_key**-0.5)
  319. product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
  320. if attn_bias:
  321. product += attn_bias
  322. weights = layers.softmax(product)
  323. if dropout_rate:
  324. weights = layers.dropout(
  325. weights,
  326. dropout_prob=dropout_rate,
  327. dropout_implementation="upscale_in_train",
  328. is_test=False)
  329. out = layers.matmul(weights, v)
  330. return out
  331. q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
  332. if cache is not None: # use cache and concat time steps
  333. # Since the inplace reshape in __split_heads changes the shape of k and
  334. # v, which is the cache input for next time step, reshape the cache
  335. # input from the previous time step first.
  336. k = cache["k"] = layers.concat(
  337. [layers.reshape(
  338. cache["k"], shape=[0, 0, d_model]), k], axis=1)
  339. v = cache["v"] = layers.concat(
  340. [layers.reshape(
  341. cache["v"], shape=[0, 0, d_model]), v], axis=1)
  342. q = __split_heads(q, n_head)
  343. k = __split_heads(k, n_head)
  344. v = __split_heads(v, n_head)
  345. ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key,
  346. dropout_rate)
  347. out = __combine_heads(ctx_multiheads)
  348. # Project back to the model size.
  349. proj_out = layers.fc(input=out,
  350. size=d_model,
  351. num_flatten_dims=2,
  352. param_attr=fluid.ParamAttr(
  353. name=name + '_output_fc.w_0',
  354. initializer=param_initializer),
  355. bias_attr=name + '_output_fc.b_0')
  356. return proj_out
  357. # In[7]:
  358. def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
  359. """
  360. Position-wise Feed-Forward Networks.
  361. This module consists of two linear transformations with a ReLU activation
  362. in between, which is applied to each position separately and identically.
  363. """
  364. hidden = layers.fc(input=x, size=d_inner_hid, num_flatten_dims=2, act=hidden_act,
  365. param_attr=fluid.ParamAttr(
  366. name=name + '_fc_0.w_0',
  367. initializer=param_initializer),
  368. bias_attr=name + '_fc_0.b_0')
  369. if dropout_rate:
  370. hidden = layers.dropout(hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
  371. out = layers.fc(input=hidden, size=d_hid, num_flatten_dims=2,
  372. param_attr=fluid.ParamAttr(
  373. name=name + '_fc_1.w_0', initializer=param_initializer),
  374. bias_attr=name + '_fc_1.b_0')
  375. return out
  376. # In[8]:
  377. def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0.,
  378. name=''):
  379. """
  380. Add residual connection, layer normalization and droput to the out tensor
  381. optionally according to the value of process_cmd.
  382. This will be used before or after multi-head attention and position-wise
  383. feed-forward networks.
  384. """
  385. for cmd in process_cmd:
  386. if cmd == "a": # add residual connection
  387. out = out + prev_out if prev_out else out
  388. elif cmd == "n": # add layer normalization
  389. out_dtype = out.dtype
  390. if out_dtype == fluid.core.VarDesc.VarType.FP16:
  391. out = layers.cast(x=out, dtype="float32")
  392. out = layers.layer_norm(
  393. out,
  394. begin_norm_axis=len(out.shape) - 1,
  395. param_attr=fluid.ParamAttr(
  396. name=name + '_layer_norm_scale',
  397. initializer=fluid.initializer.Constant(1.)),
  398. bias_attr=fluid.ParamAttr(
  399. name=name + '_layer_norm_bias',
  400. initializer=fluid.initializer.Constant(0.)))
  401. if out_dtype == fluid.core.VarDesc.VarType.FP16:
  402. out = layers.cast(x=out, dtype="float16")
  403. elif cmd == "d": # add dropout
  404. if dropout_rate:
  405. out = layers.dropout(
  406. out,
  407. dropout_prob=dropout_rate,
  408. dropout_implementation="upscale_in_train",
  409. is_test=False)
  410. return out
  411. pre_process_layer = partial(pre_post_process_layer, None)
  412. post_process_layer = pre_post_process_layer
  413. # In[9]:
  414. def encoder_layer(enc_input, attn_bias, n_head, d_key, d_value, d_model, d_inner_hid, prepostprocess_dropout,
  415. attention_dropout, relu_dropout, hidden_act, preprocess_cmd="n", postprocess_cmd="da",
  416. param_initializer=None, name=''):
  417. """The encoder layers that can be stacked to form a deep encoder.
  418. This module consits of a multi-head (self) attention followed by
  419. position-wise feed-forward networks and both the two components companied
  420. with the post_process_layer to add residual connection, layer normalization
  421. and droput.
  422. """
  423. attn_output = multi_head_attention(
  424. pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
  425. None, None, attn_bias, d_key, d_value, d_model, n_head, attention_dropout,
  426. param_initializer=param_initializer, name=name + '_multi_head_att')
  427. attn_output = post_process_layer(enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
  428. ffd_output = positionwise_feed_forward(
  429. pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
  430. d_inner_hid, d_model, relu_dropout, hidden_act, param_initializer=param_initializer,
  431. name=name + '_ffn')
  432. return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
  433. def encoder(enc_input, attn_bias, n_layer, n_head, d_key, d_value, d_model, d_inner_hid, prepostprocess_dropout,
  434. attention_dropout, relu_dropout, hidden_act, preprocess_cmd="n", postprocess_cmd="da",
  435. param_initializer=None, name=''):
  436. """
  437. The encoder is composed of a stack of identical layers returned by calling
  438. encoder_layer.
  439. """
  440. for i in range(n_layer):
  441. enc_output = encoder_layer(enc_input, attn_bias, n_head, d_key, d_value, d_model, d_inner_hid,
  442. prepostprocess_dropout, attention_dropout, relu_dropout, hidden_act, preprocess_cmd,
  443. postprocess_cmd, param_initializer=param_initializer, name=name + '_layer_' + str(i))
  444. enc_input = enc_output
  445. enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
  446. return enc_output
  447. # #### 3.3 编码器 和 分类器 定义
  448. # 以下 cell 定义 encoder 和 classification 的组织结构:
  449. # 1. ernie_encoder:根据 ErnieModel 组织输出 embeddings
  450. # 2. create_ernie_model:定义分类网络,以 embeddings 为输入,使用全连接网络 + softmax 做分类
  451. # In[10]:
  452. def ernie_encoder(ernie_inputs, ernie_config):
  453. """return sentence embedding and token embeddings"""
  454. ernie = ErnieModel(
  455. src_ids=ernie_inputs["src_ids"],
  456. position_ids=ernie_inputs["pos_ids"],
  457. sentence_ids=ernie_inputs["sent_ids"],
  458. input_mask=ernie_inputs["input_mask"],
  459. config=ernie_config)
  460. enc_out = ernie.get_sequence_output()
  461. unpad_enc_out = fluid.layers.sequence_unpad(
  462. enc_out, length=ernie_inputs["seq_lens"])
  463. cls_feats = ernie.get_pooled_output()
  464. embeddings = {
  465. "sentence_embeddings": cls_feats,
  466. "token_embeddings": unpad_enc_out,
  467. }
  468. for k, v in embeddings.items():
  469. v.persistable = True
  470. return embeddings
  471. def create_ernie_model(args,
  472. embeddings,
  473. labels,
  474. is_prediction=False):
  475. """
  476. Create Model for sentiment classification based on ERNIE encoder
  477. """
  478. sentence_embeddings = embeddings["sentence_embeddings"]
  479. token_embeddings = embeddings["token_embeddings"]
  480. cls_feats = fluid.layers.dropout(
  481. x=sentence_embeddings,
  482. dropout_prob=0.1,
  483. dropout_implementation="upscale_in_train")
  484. logits = fluid.layers.fc(
  485. input=cls_feats,
  486. size=args['num_labels'],
  487. param_attr=fluid.ParamAttr(
  488. name="cls_out_w",
  489. initializer=fluid.initializer.TruncatedNormal(scale=0.02)),
  490. bias_attr=fluid.ParamAttr(
  491. name="cls_out_b", initializer=fluid.initializer.Constant(0.)))
  492. ce_loss, probs = fluid.layers.softmax_with_cross_entropy(
  493. logits=logits, label=labels, return_softmax=True)
  494. if is_prediction:
  495. return probs
  496. loss = fluid.layers.mean(x=ce_loss)
  497. num_seqs = fluid.layers.create_tensor(dtype='int64')
  498. accuracy = fluid.layers.accuracy(input=probs, label=labels, total=num_seqs)
  499. return loss, accuracy, num_seqs
  500. # #### 3.4 分词代码
  501. # 以下 3 个 cell 定义分词代码类,包括:
  502. # 1. FullTokenizer:完整的分词,在数据读取代码中使用,调用 BasicTokenizer 和 WordpieceTokenizer 实现
  503. # 2. BasicTokenizer:基本分词,包括标点划分、小写转换等
  504. # 3. WordpieceTokenizer:单词划分
  505. # In[11]:
  506. class FullTokenizer(object):
  507. """Runs end-to-end tokenziation."""
  508. def __init__(self, vocab_file, do_lower_case=True):
  509. self.vocab = load_vocab(vocab_file)
  510. self.inv_vocab = {v: k for k, v in self.vocab.items()}
  511. self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
  512. self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
  513. def tokenize(self, text):
  514. split_tokens = []
  515. for token in self.basic_tokenizer.tokenize(text):
  516. for sub_token in self.wordpiece_tokenizer.tokenize(token):
  517. split_tokens.append(sub_token)
  518. return split_tokens
  519. def convert_tokens_to_ids(self, tokens):
  520. return convert_by_vocab(self.vocab, tokens)
  521. # In[12]:
  522. class BasicTokenizer(object):
  523. """Runs basic tokenization (punctuation splitting, lower casing, etc.)."""
  524. def __init__(self, do_lower_case=True):
  525. """Constructs a BasicTokenizer.
  526. Args:
  527. do_lower_case: Whether to lower case the input.
  528. """
  529. self.do_lower_case = do_lower_case
  530. def tokenize(self, text):
  531. """Tokenizes a piece of text."""
  532. text = convert_to_unicode(text)
  533. text = self._clean_text(text)
  534. # This was added on November 1st, 2018 for the multilingual and Chinese
  535. # models. This is also applied to the English models now, but it doesn't
  536. # matter since the English models were not trained on any Chinese data
  537. # and generally don't have any Chinese data in them (there are Chinese
  538. # characters in the vocabulary because Wikipedia does have some Chinese
  539. # words in the English Wikipedia.).
  540. text = self._tokenize_chinese_chars(text)
  541. orig_tokens = whitespace_tokenize(text)
  542. split_tokens = []
  543. for token in orig_tokens:
  544. if self.do_lower_case:
  545. token = token.lower()
  546. token = self._run_strip_accents(token)
  547. split_tokens.extend(self._run_split_on_punc(token))
  548. output_tokens = whitespace_tokenize(" ".join(split_tokens))
  549. return output_tokens
  550. def _run_strip_accents(self, text):
  551. """Strips accents from a piece of text."""
  552. text = unicodedata.normalize("NFD", text)
  553. output = []
  554. for char in text:
  555. cat = unicodedata.category(char)
  556. if cat == "Mn":
  557. continue
  558. output.append(char)
  559. return "".join(output)
  560. def _run_split_on_punc(self, text):
  561. """Splits punctuation on a piece of text."""
  562. chars = list(text)
  563. i = 0
  564. start_new_word = True
  565. output = []
  566. while i < len(chars):
  567. char = chars[i]
  568. if _is_punctuation(char):
  569. output.append([char])
  570. start_new_word = True
  571. else:
  572. if start_new_word:
  573. output.append([])
  574. start_new_word = False
  575. output[-1].append(char)
  576. i += 1
  577. return ["".join(x) for x in output]
  578. def _tokenize_chinese_chars(self, text):
  579. """Adds whitespace around any CJK character."""
  580. output = []
  581. for char in text:
  582. cp = ord(char)
  583. if self._is_chinese_char(cp):
  584. output.append(" ")
  585. output.append(char)
  586. output.append(" ")
  587. else:
  588. output.append(char)
  589. return "".join(output)
  590. def _is_chinese_char(self, cp):
  591. """Checks whether CP is the codepoint of a CJK character."""
  592. # This defines a "chinese character" as anything in the CJK Unicode block:
  593. # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
  594. #
  595. # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
  596. # despite its name. The modern Korean Hangul alphabet is a different block,
  597. # as is Japanese Hiragana and Katakana. Those alphabets are used to write
  598. # space-separated words, so they are not treated specially and handled
  599. # like the all of the other languages.
  600. if ((cp >= 0x4E00 and cp <= 0x9FFF) or #
  601. (cp >= 0x3400 and cp <= 0x4DBF) or #
  602. (cp >= 0x20000 and cp <= 0x2A6DF) or #
  603. (cp >= 0x2A700 and cp <= 0x2B73F) or #
  604. (cp >= 0x2B740 and cp <= 0x2B81F) or #
  605. (cp >= 0x2B820 and cp <= 0x2CEAF) or
  606. (cp >= 0xF900 and cp <= 0xFAFF) or #
  607. (cp >= 0x2F800 and cp <= 0x2FA1F)): #
  608. return True
  609. return False
  610. def _clean_text(self, text):
  611. """Performs invalid character removal and whitespace cleanup on text."""
  612. output = []
  613. for char in text:
  614. cp = ord(char)
  615. if cp == 0 or cp == 0xfffd or _is_control(char):
  616. continue
  617. if _is_whitespace(char):
  618. output.append(" ")
  619. else:
  620. output.append(char)
  621. return "".join(output)
  622. # In[13]:
  623. class WordpieceTokenizer(object):
  624. """Runs WordPiece tokenziation."""
  625. def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100):
  626. self.vocab = vocab
  627. self.unk_token = unk_token
  628. self.max_input_chars_per_word = max_input_chars_per_word
  629. def tokenize(self, text):
  630. """Tokenizes a piece of text into its word pieces.
  631. This uses a greedy longest-match-first algorithm to perform tokenization
  632. using the given vocabulary.
  633. For example:
  634. input = "unaffable"
  635. output = ["un", "##aff", "##able"]
  636. Args:
  637. text: A single token or whitespace separated tokens. This should have
  638. already been passed through `BasicTokenizer.
  639. Returns:
  640. A list of wordpiece tokens.
  641. """
  642. text = convert_to_unicode(text)
  643. output_tokens = []
  644. for token in whitespace_tokenize(text):
  645. chars = list(token)
  646. if len(chars) > self.max_input_chars_per_word:
  647. output_tokens.append(self.unk_token)
  648. continue
  649. is_bad = False
  650. start = 0
  651. sub_tokens = []
  652. while start < len(chars):
  653. end = len(chars)
  654. cur_substr = None
  655. while start < end:
  656. substr = "".join(chars[start:end])
  657. if start > 0:
  658. substr = "##" + substr
  659. if substr in self.vocab:
  660. cur_substr = substr
  661. break
  662. end -= 1
  663. if cur_substr is None:
  664. is_bad = True
  665. break
  666. sub_tokens.append(cur_substr)
  667. start = end
  668. if is_bad:
  669. output_tokens.append(self.unk_token)
  670. else:
  671. output_tokens.extend(sub_tokens)
  672. return output_tokens
  673. # #### 3.5 分词辅助代码
  674. # 以下 cell 定义分词中的辅助性代码,包括 convert_to_unicode、whitespace_tokenize 等。
  675. # In[14]:
  676. def convert_to_unicode(text):
  677. """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
  678. if six.PY3:
  679. if isinstance(text, str):
  680. return text
  681. elif isinstance(text, bytes):
  682. return text.decode("utf-8", "ignore")
  683. else:
  684. raise ValueError("Unsupported string type: %s" % (type(text)))
  685. elif six.PY2:
  686. if isinstance(text, str):
  687. return text.decode("utf-8", "ignore")
  688. elif isinstance(text, unicode):
  689. return text
  690. else:
  691. raise ValueError("Unsupported string type: %s" % (type(text)))
  692. else:
  693. raise ValueError("Not running on Python2 or Python 3?")
  694. def load_vocab(vocab_file):
  695. """Loads a vocabulary file into a dictionary."""
  696. vocab = collections.OrderedDict()
  697. fin = io.open(vocab_file, encoding="utf8")
  698. for num, line in enumerate(fin):
  699. items = convert_to_unicode(line.strip()).split("\t")
  700. if len(items) > 2:
  701. break
  702. token = items[0]
  703. index = items[1] if len(items) == 2 else num
  704. token = token.strip()
  705. vocab[token] = int(index)
  706. return vocab
  707. def convert_by_vocab(vocab, items):
  708. """Converts a sequence of [tokens|ids] using the vocab."""
  709. output = []
  710. for item in items:
  711. output.append(vocab[item])
  712. return output
  713. def whitespace_tokenize(text):
  714. """Runs basic whitespace cleaning and splitting on a peice of text."""
  715. text = text.strip()
  716. if not text:
  717. return []
  718. tokens = text.split()
  719. return tokens
  720. def _is_whitespace(char):
  721. """Checks whether `chars` is a whitespace character."""
  722. # \t, \n, and \r are technically contorl characters but we treat them
  723. # as whitespace since they are generally considered as such.
  724. if char == " " or char == "\t" or char == "\n" or char == "\r":
  725. return True
  726. cat = unicodedata.category(char)
  727. if cat == "Zs":
  728. return True
  729. return False
  730. def _is_control(char):
  731. """Checks whether `chars` is a control character."""
  732. # These are technically control characters but we count them as whitespace
  733. # characters.
  734. if char == "\t" or char == "\n" or char == "\r":
  735. return False
  736. cat = unicodedata.category(char)
  737. if cat.startswith("C"):
  738. return True
  739. return False
  740. def _is_punctuation(char):
  741. """Checks whether `chars` is a punctuation character."""
  742. cp = ord(char)
  743. # We treat all non-letter/number ASCII as punctuation.
  744. # Characters such as "^", "$", and "`" are not in the Unicode
  745. # Punctuation class but we treat them as punctuation anyways, for
  746. # consistency.
  747. if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or
  748. (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
  749. return True
  750. cat = unicodedata.category(char)
  751. if cat.startswith("P"):
  752. return True
  753. return False
  754. # #### 3.6 数据读取 及 预处理代码
  755. # 以下 4 个 cell 定义数据读取器和预处理代码,包括:
  756. # 1. BaseReader:数据读取器基类
  757. # 2. ClassifyReader:用于分类模型的数据读取器,重写 _readtsv 和 _pad_batch_records 方法
  758. # 3. pad_batch_data:数据预处理,给数据加 padding,并生成位置数据和 mask
  759. # 4. ernie_pyreader:生成训练、验证和预测使用的 pyreader
  760. # In[15]:
  761. class BaseReader(object):
  762. """BaseReader for classify and sequence labeling task"""
  763. def __init__(self,
  764. vocab_path,
  765. label_map_config=None,
  766. max_seq_len=512,
  767. do_lower_case=True,
  768. in_tokens=False,
  769. random_seed=None):
  770. self.max_seq_len = max_seq_len
  771. self.tokenizer = FullTokenizer(
  772. vocab_file=vocab_path, do_lower_case=do_lower_case)
  773. self.vocab = self.tokenizer.vocab
  774. self.pad_id = self.vocab["[PAD]"]
  775. self.cls_id = self.vocab["[CLS]"]
  776. self.sep_id = self.vocab["[SEP]"]
  777. self.in_tokens = in_tokens
  778. np.random.seed(random_seed)
  779. self.current_example = 0
  780. self.current_epoch = 0
  781. self.num_examples = 0
  782. if label_map_config:
  783. with open(label_map_config) as f:
  784. self.label_map = json.load(f)
  785. else:
  786. self.label_map = None
  787. def _read_tsv(self, input_file, quotechar=None):
  788. """Reads a tab separated value file."""
  789. with io.open(input_file, "r", encoding="utf8") as f:
  790. reader = csv_reader(f, delimiter="\t")
  791. headers = next(reader)
  792. Example = namedtuple('Example', headers)
  793. examples = []
  794. for line in reader:
  795. example = Example(*line)
  796. examples.append(example)
  797. return examples
  798. def _truncate_seq_pair(self, tokens_a, tokens_b, max_length):
  799. """Truncates a sequence pair in place to the maximum length."""
  800. # This is a simple heuristic which will always truncate the longer sequence
  801. # one token at a time. This makes more sense than truncating an equal percent
  802. # of tokens from each, since if one sequence is very short then each token
  803. # that's truncated likely contains more information than a longer sequence.
  804. while True:
  805. total_length = len(tokens_a) + len(tokens_b)
  806. if total_length <= max_length:
  807. break
  808. if len(tokens_a) > len(tokens_b):
  809. tokens_a.pop()
  810. else:
  811. tokens_b.pop()
  812. def _convert_example_to_record(self, example, max_seq_length, tokenizer):
  813. """Converts a single `Example` into a single `Record`."""
  814. text_a = convert_to_unicode(example.text_a)
  815. tokens_a = tokenizer.tokenize(text_a)
  816. tokens_b = None
  817. if "text_b" in example._fields:
  818. text_b = convert_to_unicode(example.text_b)
  819. tokens_b = tokenizer.tokenize(text_b)
  820. if tokens_b:
  821. # Modifies `tokens_a` and `tokens_b` in place so that the total
  822. # length is less than the specified length.
  823. # Account for [CLS], [SEP], [SEP] with "- 3"
  824. self._truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
  825. else:
  826. # Account for [CLS] and [SEP] with "- 2"
  827. if len(tokens_a) > max_seq_length - 2:
  828. tokens_a = tokens_a[0:(max_seq_length - 2)]
  829. # The convention in BERT/ERNIE is:
  830. # (a) For sequence pairs:
  831. # tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
  832. # type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1
  833. # (b) For single sequences:
  834. # tokens: [CLS] the dog is hairy . [SEP]
  835. # type_ids: 0 0 0 0 0 0 0
  836. #
  837. # Where "type_ids" are used to indicate whether this is the first
  838. # sequence or the second sequence. The embedding vectors for `type=0` and
  839. # `type=1` were learned during pre-training and are added to the wordpiece
  840. # embedding vector (and position vector). This is not *strictly* necessary
  841. # since the [SEP] token unambiguously separates the sequences, but it makes
  842. # it easier for the model to learn the concept of sequences.
  843. #
  844. # For classification tasks, the first vector (corresponding to [CLS]) is
  845. # used as as the "sentence vector". Note that this only makes sense because
  846. # the entire model is fine-tuned.
  847. tokens = []
  848. text_type_ids = []
  849. tokens.append("[CLS]")
  850. text_type_ids.append(0)
  851. for token in tokens_a:
  852. tokens.append(token)
  853. text_type_ids.append(0)
  854. tokens.append("[SEP]")
  855. text_type_ids.append(0)
  856. if tokens_b:
  857. for token in tokens_b:
  858. tokens.append(token)
  859. text_type_ids.append(1)
  860. tokens.append("[SEP]")
  861. text_type_ids.append(1)
  862. token_ids = tokenizer.convert_tokens_to_ids(tokens)
  863. position_ids = list(range(len(token_ids)))
  864. if self.label_map:
  865. label_id = self.label_map[example.label]
  866. else:
  867. label_id = example.label
  868. Record = namedtuple(
  869. 'Record',
  870. ['token_ids', 'text_type_ids', 'position_ids', 'label_id', 'qid'])
  871. qid = None
  872. if "qid" in example._fields:
  873. qid = example.qid
  874. record = Record(
  875. token_ids=token_ids,
  876. text_type_ids=text_type_ids,
  877. position_ids=position_ids,
  878. label_id=label_id,
  879. qid=qid)
  880. return record
  881. def _prepare_batch_data(self, examples, batch_size, phase=None):
  882. """generate batch records"""
  883. batch_records, max_len = [], 0
  884. for index, example in enumerate(examples):
  885. if phase == "train":
  886. self.current_example = index
  887. record = self._convert_example_to_record(example, self.max_seq_len,
  888. self.tokenizer)
  889. max_len = max(max_len, len(record.token_ids))
  890. if self.in_tokens:
  891. to_append = (len(batch_records) + 1) * max_len <= batch_size
  892. else:
  893. to_append = len(batch_records) < batch_size
  894. if to_append:
  895. batch_records.append(record)
  896. else:
  897. yield self._pad_batch_records(batch_records)
  898. batch_records, max_len = [record], len(record.token_ids)
  899. if batch_records:
  900. yield self._pad_batch_records(batch_records)
  901. def get_num_examples(self, input_file):
  902. """return total number of examples"""
  903. examples = self._read_tsv(input_file)
  904. return len(examples)
  905. def get_examples(self, input_file):
  906. examples = self._read_tsv(input_file)
  907. return examples
  908. def data_generator(self,
  909. input_file,
  910. batch_size,
  911. epoch,
  912. shuffle=True,
  913. phase=None):
  914. """return generator which yields batch data for pyreader"""
  915. examples = self._read_tsv(input_file)
  916. def _wrapper():
  917. for epoch_index in range(epoch):
  918. if phase == "train":
  919. self.current_example = 0
  920. self.current_epoch = epoch_index
  921. if shuffle:
  922. np.random.shuffle(examples)
  923. for batch_data in self._prepare_batch_data(
  924. examples, batch_size, phase=phase):
  925. yield batch_data
  926. return _wrapper
  927. # In[16]:
  928. class ClassifyReader(BaseReader):
  929. """ClassifyReader"""
  930. def _read_tsv(self, input_file, quotechar=None):
  931. """Reads a tab separated value file."""
  932. with io.open(input_file, "r", encoding="utf8") as f:
  933. reader = csv_reader(f, delimiter="\t")
  934. headers = next(reader)
  935. text_indices = [
  936. index for index, h in enumerate(headers) if h != "label"
  937. ]
  938. Example = namedtuple('Example', headers)
  939. examples = []
  940. for line in reader:
  941. for index, text in enumerate(line):
  942. if index in text_indices:
  943. line[index] = text.replace(' ', '')
  944. example = Example(*line)
  945. examples.append(example)
  946. return examples
  947. def _pad_batch_records(self, batch_records):
  948. batch_token_ids = [record.token_ids for record in batch_records]
  949. batch_text_type_ids = [record.text_type_ids for record in batch_records]
  950. batch_position_ids = [record.position_ids for record in batch_records]
  951. batch_labels = [record.label_id for record in batch_records]
  952. batch_labels = np.array(batch_labels).astype("int64").reshape([-1, 1])
  953. # padding
  954. padded_token_ids, input_mask, seq_lens = pad_batch_data(
  955. batch_token_ids,
  956. pad_idx=self.pad_id,
  957. return_input_mask=True,
  958. return_seq_lens=True)
  959. padded_text_type_ids = pad_batch_data(
  960. batch_text_type_ids, pad_idx=self.pad_id)
  961. padded_position_ids = pad_batch_data(
  962. batch_position_ids, pad_idx=self.pad_id)
  963. return_list = [
  964. padded_token_ids, padded_text_type_ids, padded_position_ids,
  965. input_mask, batch_labels, seq_lens
  966. ]
  967. return return_list
  968. # In[17]:
  969. def pad_batch_data(insts,
  970. pad_idx=0,
  971. return_pos=False,
  972. return_input_mask=False,
  973. return_max_len=False,
  974. return_num_token=False,
  975. return_seq_lens=False):
  976. """
  977. Pad the instances to the max sequence length in batch, and generate the
  978. corresponding position data and input mask.
  979. """
  980. return_list = []
  981. max_len = max(len(inst) for inst in insts)
  982. # Any token included in dict can be used to pad, since the paddings' loss
  983. # will be masked out by weights and make no effect on parameter gradients.
  984. inst_data = np.array(
  985. [inst + list([pad_idx] * (max_len - len(inst))) for inst in insts])
  986. return_list += [inst_data.astype("int64").reshape([-1, max_len, 1])]
  987. # position data
  988. if return_pos:
  989. inst_pos = np.array([
  990. list(range(0, len(inst))) + [pad_idx] * (max_len - len(inst))
  991. for inst in insts
  992. ])
  993. return_list += [inst_pos.astype("int64").reshape([-1, max_len, 1])]
  994. if return_input_mask:
  995. # This is used to avoid attention on paddings.
  996. input_mask_data = np.array([[1] * len(inst) + [0] *
  997. (max_len - len(inst)) for inst in insts])
  998. input_mask_data = np.expand_dims(input_mask_data, axis=-1)
  999. return_list += [input_mask_data.astype("float32")]
  1000. if return_max_len:
  1001. return_list += [max_len]
  1002. if return_num_token:
  1003. num_token = 0
  1004. for inst in insts:
  1005. num_token += len(inst)
  1006. return_list += [num_token]
  1007. if return_seq_lens:
  1008. seq_lens = np.array([len(inst) for inst in insts])
  1009. return_list += [seq_lens.astype("int64").reshape([-1])]
  1010. return return_list if len(return_list) > 1 else return_list[0]
  1011. # In[18]:
  1012. def ernie_pyreader(args, pyreader_name):
  1013. """define standard ernie pyreader"""
  1014. pyreader_name += '_' + ''.join(random.sample(string.ascii_letters + string.digits, 6))
  1015. pyreader = fluid.layers.py_reader(
  1016. capacity=50,
  1017. shapes=[[-1, args['max_seq_len'], 1], [-1, args['max_seq_len'], 1],
  1018. [-1, args['max_seq_len'], 1], [-1, args['max_seq_len'], 1], [-1, 1],
  1019. [-1]],
  1020. dtypes=['int64', 'int64', 'int64', 'float32', 'int64', 'int64'],
  1021. lod_levels=[0, 0, 0, 0, 0, 0],
  1022. name=pyreader_name,
  1023. use_double_buffer=True)
  1024. (src_ids, sent_ids, pos_ids, input_mask, labels,
  1025. seq_lens) = fluid.layers.read_file(pyreader)
  1026. ernie_inputs = {
  1027. "src_ids": src_ids,
  1028. "sent_ids": sent_ids,
  1029. "pos_ids": pos_ids,
  1030. "input_mask": input_mask,
  1031. "seq_lens": seq_lens
  1032. }
  1033. return pyreader, ernie_inputs, labels
  1034. # #### 通用参数介绍
  1035. # 1. 数据集相关配置
  1036. # ```
  1037. # data_config = {
  1038. # 'data_dir': 'data/data9740/data',
  1039. # 'vocab_path': 'data/data9740/data/vocab.txt',
  1040. # 'batch_size': 32,
  1041. # 'random_seed': 0,
  1042. # 'num_labels': 3,
  1043. # 'max_seq_len': 512,
  1044. # 'train_set': 'data/data9740/data/test.tsv',
  1045. # 'test_set': 'data/data9740/data/test.tsv',
  1046. # 'dev_set': 'data/data9740/data/dev.tsv',
  1047. # 'infer_set': 'data/data9740/data/infer.tsv',
  1048. # 'label_map_config': None,
  1049. # 'do_lower_case': True,
  1050. # }
  1051. # ```
  1052. # 参数介绍:
  1053. # * **data_dir**:数据集路径,默认 'data/data9740/data'
  1054. # * **vocab_path**:vocab.txt所在路径,默认 'data/data9740/data/vocab.txt'
  1055. # * **batch_size**:训练和验证的批处理大小,默认:32
  1056. # * **random_seed**:随机种子,默认 0
  1057. # * **num_labels**:类别数,默认 3
  1058. # * **max_seq_len**:句子中最长词数,默认 512
  1059. # * **train_set**:训练集路径,默认 'data/data9740/data/test.tsv'
  1060. # * **test_set**: 测试集路径,默认 'data/data9740/data/test.tsv'
  1061. # * **dev_set**: 验证集路径,默认 'data/data9740/data/dev.tsv'
  1062. # * **infer_set**:预测集路径,默认 'data/data9740/data/infer.tsv'
  1063. # * **label_map_config**:label_map路径,默认 None
  1064. # * **do_lower_case**:是否对输入进行额外的小写处理,默认 True
  1065. # <br><br>
  1066. #
  1067. # 2. ERNIE 网络结构相关配置
  1068. # ```
  1069. # ernie_net_config = {
  1070. # "attention_probs_dropout_prob": 0.1,
  1071. # "hidden_act": "relu",
  1072. # "hidden_dropout_prob": 0.1,
  1073. # "hidden_size": 768,
  1074. # "initializer_range": 0.02,
  1075. # "max_position_embeddings": 513,
  1076. # "num_attention_heads": 12,
  1077. # "num_hidden_layers": 12,
  1078. # "type_vocab_size": 2,
  1079. # "vocab_size": 18000,
  1080. # }
  1081. # ```
  1082. # 参数介绍:
  1083. # * **attention_probs_dropout_prob**:attention块dropout比例,默认 0.1
  1084. # * **hidden_act**:隐层激活函数,默认 'relu'
  1085. # * **hidden_dropout_prob**:隐层dropout比例,默认 0.1
  1086. # * **hidden_size**:隐层大小,默认 768
  1087. # * **initializer_range**:参数初始化缩放范围,默认 0.02
  1088. # * **max_position_embeddings**:position序列最大长度,默认 513
  1089. # * **num_attention_heads**:attention块头部数量,默认 12
  1090. # * **num_hidden_layers**:隐层数,默认 12
  1091. # * **type_vocab_size**:sentence类别数,默认 2
  1092. # * **vocab_size**:字典长度,默认 18000
  1093. # In[19]:
  1094. # 数据集相关配置
  1095. data_config = {
  1096. 'data_dir': 'data/data9740/data', # Directory path to training data.
  1097. 'vocab_path': 'pretrained_model/ernie_finetune/vocab.txt', # Vocabulary path.
  1098. 'batch_size': 32, # Total examples' number in batch for training.
  1099. 'random_seed': 0, # Random seed.
  1100. 'num_labels': 3, # label number
  1101. 'max_seq_len': 512, # Number of words of the longest seqence.
  1102. 'train_set': 'data/data9740/data/test.tsv', # Path to training data.
  1103. 'test_set': 'data/data9740/data/test.tsv', # Path to test data.
  1104. 'dev_set': 'data/data9740/data/dev.tsv', # Path to validation data.
  1105. 'infer_set': 'data/data9740/data/infer.tsv', # Path to infer data.
  1106. 'label_map_config': None, # label_map_path
  1107. 'do_lower_case': True, # Whether to lower case the input text. Should be True for uncased models and False for cased models.
  1108. }
  1109. # Ernie 网络结构相关配置
  1110. ernie_net_config = {
  1111. "attention_probs_dropout_prob": 0.1,
  1112. "hidden_act": "relu",
  1113. "hidden_dropout_prob": 0.1,
  1114. "hidden_size": 768,
  1115. "initializer_range": 0.02,
  1116. "max_position_embeddings": 513,
  1117. "num_attention_heads": 12,
  1118. "num_hidden_layers": 12,
  1119. "type_vocab_size": 2,
  1120. "vocab_size": 18000,
  1121. }
  1122. # ### 四、模型训练
  1123. # 用户可基于百度开源的对话情绪识别模型在自有数据上实现 Finetune 训练,以期获得更好的效果提升,百度提供 TextCNN、ERNIE 两种预训练模型,具体模型 Finetune 方法如下所示:
  1124. # 1. 下载预训练模型
  1125. # 2. 修改参数
  1126. # * 'init_checkpoint':'pretrained_model/ernie_finetune/params'
  1127. # 3. 执行 “ERNIE 训练代码”
  1128. # <br><br>
  1129. #
  1130. # #### 训练阶段相关配置
  1131. # ```
  1132. # train_config = {
  1133. # 'init_checkpoint': 'pretrained_model/ernie_finetune/params',
  1134. # 'output_dir': 'train_model',
  1135. #
  1136. # 'epoch': 10,
  1137. # 'save_steps': 100,
  1138. # 'validation_steps': 100,
  1139. # 'lr': 0.00002,
  1140. #
  1141. # 'skip_steps': 10,
  1142. # 'verbose': False,
  1143. #
  1144. # 'use_cuda': True,
  1145. # }
  1146. # ```
  1147. # 参数介绍:
  1148. # * **init_checkpoint**:是否使用预训练模型,默认:'pretrained_model/ernie_finetune/params'
  1149. # * **output_dir**:模型缓存路径,默认 'train_model'
  1150. # * **epoch**:训练轮数,默认 10
  1151. # * **save_steps**:模型缓存间隔,默认 100
  1152. # * **validation_steps**:验证间隔,默认 100
  1153. # * **lr**:学习率,默认0.00002
  1154. # * **skip_steps**:日志输出间隔,默认 10
  1155. # * **verbose**:是否输出详细日志,默认 False
  1156. # * **use_cuda**:是否使用 GPU,默认 True
  1157. # In[20]:
  1158. # 下载预训练模型
  1159. get_ipython().system('mkdir pretrained_model')
  1160. # 下载并解压 ERNIE 预训练模型
  1161. get_ipython().system('cd pretrained_model && wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/emotion_detection_ernie_finetune-1.0.0.tar.gz')
  1162. get_ipython().system('cd pretrained_model && tar xzf emotion_detection_ernie_finetune-1.0.0.tar.gz')
  1163. # In[21]:
  1164. # ERNIE 训练代码
  1165. train_config = {
  1166. 'init_checkpoint': 'pretrained_model/ernie_finetune/params', # Init checkpoint to resume training from.
  1167. # 'init_checkpoint': 'None',
  1168. 'output_dir': 'train_model', # Directory path to save checkpoints
  1169. 'epoch': 5, # Number of epoches for training.
  1170. 'save_steps': 100, # The steps interval to save checkpoints.
  1171. 'validation_steps': 100, # The steps interval to evaluate model performance.
  1172. 'lr': 0.00002, # The Learning rate value for training.
  1173. 'skip_steps': 10, # The steps interval to print loss.
  1174. 'verbose': False, # Whether to output verbose log
  1175. 'use_cuda':True, # If set, use GPU for training.
  1176. }
  1177. train_config.update(data_config)
  1178. def evaluate(exe, test_program, test_pyreader, fetch_list, eval_phase):
  1179. """
  1180. Evaluation Function
  1181. """
  1182. test_pyreader.start()
  1183. total_cost, total_acc, total_num_seqs = [], [], []
  1184. time_begin = time.time()
  1185. while True:
  1186. try:
  1187. # 执行一步验证
  1188. np_loss, np_acc, np_num_seqs = exe.run(program=test_program,
  1189. fetch_list=fetch_list,
  1190. return_numpy=False)
  1191. np_loss = np.array(np_loss)
  1192. np_acc = np.array(np_acc)
  1193. np_num_seqs = np.array(np_num_seqs)
  1194. total_cost.extend(np_loss * np_num_seqs)
  1195. total_acc.extend(np_acc * np_num_seqs)
  1196. total_num_seqs.extend(np_num_seqs)
  1197. except fluid.core.EOFException:
  1198. test_pyreader.reset()
  1199. break
  1200. time_end = time.time()
  1201. logger.info("[%s evaluation] avg loss: %f, ave acc: %f, elapsed time: %f s" %
  1202. (eval_phase, np.sum(total_cost) / np.sum(total_num_seqs),
  1203. np.sum(total_acc) / np.sum(total_num_seqs), time_end - time_begin))
  1204. def main(config):
  1205. """
  1206. Main Function
  1207. """
  1208. # 定义 executor
  1209. if config['use_cuda']:
  1210. place = fluid.CUDAPlace(0)
  1211. dev_count = fluid.core.get_cuda_device_count()
  1212. else:
  1213. place = fluid.CPUPlace()
  1214. dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
  1215. exe = fluid.Executor(place)
  1216. # 定义数据 reader
  1217. reader = ClassifyReader(
  1218. vocab_path=config['vocab_path'],
  1219. label_map_config=config['label_map_config'],
  1220. max_seq_len=config['max_seq_len'],
  1221. do_lower_case=config['do_lower_case'],
  1222. random_seed=config['random_seed'])
  1223. startup_prog = fluid.Program()
  1224. if config['random_seed'] is not None:
  1225. startup_prog.random_seed = config['random_seed']
  1226. # 训练阶段初始化
  1227. train_data_generator = reader.data_generator(
  1228. input_file=config['train_set'],
  1229. batch_size=config['batch_size'],
  1230. epoch=config['epoch'],
  1231. shuffle=True,
  1232. phase="train")
  1233. num_train_examples = reader.get_num_examples(config['train_set'])
  1234. # 通过训练集大小 * 训练轮数得出总训练步数
  1235. max_train_steps = config['epoch'] * num_train_examples // config['batch_size'] // dev_count + 1
  1236. logger.info("Device count: %d" % dev_count)
  1237. logger.info("Num train examples: %d" % num_train_examples)
  1238. logger.info("Max train steps: %d" % max_train_steps)
  1239. train_program = fluid.Program()
  1240. with fluid.program_guard(train_program, startup_prog):
  1241. with fluid.unique_name.guard():
  1242. # create ernie_pyreader
  1243. train_pyreader, ernie_inputs, labels = ernie_pyreader(config, pyreader_name='train_reader')
  1244. embeddings = ernie_encoder(ernie_inputs, ernie_config=ernie_net_config)
  1245. # user defined model based on ernie embeddings
  1246. loss, accuracy, num_seqs = create_ernie_model(config, embeddings, labels=labels, is_prediction=False)
  1247. """
  1248. sgd_optimizer = fluid.optimizer.Adagrad(learning_rate=config['lr'])
  1249. sgd_optimizer.minimize(loss)
  1250. """
  1251. optimizer = fluid.optimizer.Adam(learning_rate=config['lr'])
  1252. optimizer.minimize(loss)
  1253. if config['verbose']:
  1254. lower_mem, upper_mem, unit = fluid.contrib.memory_usage(
  1255. program=train_program, batch_size=config['batch_size'])
  1256. logger.info("Theoretical memory usage in training: %.3f - %.3f %s" %
  1257. (lower_mem, upper_mem, unit))
  1258. # 验证阶段初始化
  1259. test_prog = fluid.Program()
  1260. with fluid.program_guard(test_prog, startup_prog):
  1261. with fluid.unique_name.guard():
  1262. # create ernie_pyreader
  1263. test_pyreader, ernie_inputs, labels = ernie_pyreader(config, pyreader_name='eval_reader')
  1264. embeddings = ernie_encoder(ernie_inputs, ernie_config=ernie_net_config)
  1265. # user defined model based on ernie embeddings
  1266. loss, accuracy, num_seqs = create_ernie_model(config, embeddings, labels=labels, is_prediction=False)
  1267. test_prog = test_prog.clone(for_test=True)
  1268. exe.run(startup_prog)
  1269. # 加载预训练模型
  1270. # if config['init_checkpoint']:
  1271. # init_checkpoint(exe, config['init_checkpoint'], main_program=train_program)
  1272. # 模型训练代码
  1273. if not os.path.exists(config['output_dir']):
  1274. os.mkdir(config['output_dir'])
  1275. logger.info('Start training')
  1276. train_pyreader.decorate_tensor_provider(train_data_generator)
  1277. train_pyreader.start()
  1278. steps = 0
  1279. total_cost, total_acc, total_num_seqs = [], [], []
  1280. time_begin = time.time()
  1281. while True:
  1282. try:
  1283. steps += 1
  1284. if steps % config['skip_steps'] == 0:
  1285. fetch_list = [loss.name, accuracy.name, num_seqs.name]
  1286. else:
  1287. fetch_list = []
  1288. # 执行一步训练
  1289. outputs = exe.run(program=train_program, fetch_list=fetch_list, return_numpy=False)
  1290. if steps % config['skip_steps'] == 0:
  1291. # 打印日志
  1292. np_loss, np_acc, np_num_seqs = outputs
  1293. np_loss = np.array(np_loss)
  1294. np_acc = np.array(np_acc)
  1295. np_num_seqs = np.array(np_num_seqs)
  1296. total_cost.extend(np_loss * np_num_seqs)
  1297. total_acc.extend(np_acc * np_num_seqs)
  1298. total_num_seqs.extend(np_num_seqs)
  1299. if config['verbose']:
  1300. verbose = "train pyreader queue size: %d, " % train_pyreader.queue.size()
  1301. logger.info(verbose)
  1302. time_end = time.time()
  1303. used_time = time_end - time_begin
  1304. logger.info("step: %d, avg loss: %f, "
  1305. "avg acc: %f, speed: %f steps/s" %
  1306. (steps, np.sum(total_cost) / np.sum(total_num_seqs),
  1307. np.sum(total_acc) / np.sum(total_num_seqs),
  1308. config['skip_steps'] / used_time))
  1309. total_cost, total_acc, total_num_seqs = [], [], []
  1310. time_begin = time.time()
  1311. if steps % config['save_steps'] == 0:
  1312. # 缓存模型
  1313. # fluid.io.save_persistables(exe, config['output_dir'], train_program)
  1314. fluid.save(train_program, os.path.join(config['output_dir'], "checkpoint"))
  1315. if steps % config['validation_steps'] == 0:
  1316. # 在验证集上执行验证
  1317. test_pyreader.decorate_tensor_provider(
  1318. reader.data_generator(
  1319. input_file=config['dev_set'],
  1320. batch_size=config['batch_size'],
  1321. phase='dev',
  1322. epoch=1,
  1323. shuffle=False))
  1324. evaluate(exe, test_prog, test_pyreader,
  1325. [loss.name, accuracy.name, num_seqs.name],
  1326. "dev")
  1327. except fluid.core.EOFException:
  1328. # 训练结束
  1329. # fluid.io.save_persistables(exe, config['output_dir'], train_program)
  1330. fluid.save(train_program, os.path.join(config['output_dir'], "checkpoint"))
  1331. train_pyreader.reset()
  1332. logger.info('Training end.')
  1333. break
  1334. # 模型验证代码
  1335. test_pyreader.decorate_tensor_provider(
  1336. reader.data_generator(
  1337. input_file=config['test_set'],
  1338. batch_size=config['batch_size'], phase='test', epoch=1,
  1339. shuffle=False))
  1340. logger.info("Final validation result:")
  1341. evaluate(exe, test_prog, test_pyreader,
  1342. [loss.name, accuracy.name, num_seqs.name], "test")
  1343. if __name__ == "__main__":
  1344. init_log_config()
  1345. print_arguments(train_config)
  1346. main(train_config)
  1347. # ### 五、模型预测
  1348. #
  1349. #
  1350. # 预测阶段加载保存的模型,对预测集进行预测,通过修改如下参数实现
  1351. # <br><br>
  1352. #
  1353. # #### 预测阶段相关配置
  1354. # ```
  1355. # infer_config = {
  1356. # 'init_checkpoint': 'train_model',
  1357. # 'use_cuda': True,
  1358. # }
  1359. # ```
  1360. # 参数介绍:
  1361. # * **init_checkpoint**:加载预训练模型,默认:'train_model'
  1362. # * **use_cuda**:是否使用 GPU,默认 True
  1363. # In[22]:
  1364. # ERNIE 预测代码
  1365. infer_config = {
  1366. 'init_checkpoint': 'train_model', # Init checkpoint to resume training from.
  1367. 'use_cuda': True, # If set, use GPU for training.
  1368. }
  1369. infer_config.update(data_config)
  1370. def init_checkpoint_infer(exe, init_checkpoint_path, main_program):
  1371. """
  1372. 加载缓存模型
  1373. """
  1374. assert os.path.exists(
  1375. init_checkpoint_path), "[%s] cann't be found." % init_checkpoint_path
  1376. # fluid.io.load_vars(
  1377. # exe,
  1378. # init_checkpoint_path,
  1379. # main_program=main_program,
  1380. # predicate=existed_persitables)
  1381. fluid.load(main_program, os.path.join(init_checkpoint_path, "checkpoint"), exe)
  1382. logger.info("Load model from {}".format(init_checkpoint_path))
  1383. def infer(exe, infer_program, infer_pyreader, fetch_list, infer_phase, examples):
  1384. """Infer"""
  1385. infer_pyreader.start()
  1386. time_begin = time.time()
  1387. while True:
  1388. try:
  1389. # 进行一步预测
  1390. batch_probs = exe.run(program=infer_program, fetch_list=fetch_list,
  1391. return_numpy=True)
  1392. for i, probs in enumerate(batch_probs[0]):
  1393. logger.info("Probs: %f %f %f, prediction: %d, input: %s" % (probs[0], probs[1], probs[2], np.argmax(probs), examples[i]))
  1394. except fluid.core.EOFException:
  1395. infer_pyreader.reset()
  1396. break
  1397. time_end = time.time()
  1398. logger.info("[%s] elapsed time: %f s" % (infer_phase, time_end - time_begin))
  1399. def main(config):
  1400. """
  1401. Main Function
  1402. """
  1403. # 定义 executor
  1404. if config['use_cuda']:
  1405. place = fluid.CUDAPlace(0)
  1406. dev_count = fluid.core.get_cuda_device_count()
  1407. else:
  1408. place = fluid.CPUPlace()
  1409. dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
  1410. exe = fluid.Executor(place)
  1411. # 定义数据 reader
  1412. reader = ClassifyReader(
  1413. vocab_path=config['vocab_path'],
  1414. label_map_config=config['label_map_config'],
  1415. max_seq_len=config['max_seq_len'],
  1416. do_lower_case=config['do_lower_case'],
  1417. random_seed=config['random_seed'])
  1418. startup_prog = fluid.Program()
  1419. if config['random_seed'] is not None:
  1420. startup_prog.random_seed = config['random_seed']
  1421. # 预测阶段初始化
  1422. test_prog = fluid.Program()
  1423. with fluid.program_guard(test_prog, startup_prog):
  1424. with fluid.unique_name.guard():
  1425. infer_pyreader, ernie_inputs, labels = ernie_pyreader(config, pyreader_name='infer_reader')
  1426. embeddings = ernie_encoder(ernie_inputs, ernie_config=ernie_net_config)
  1427. probs = create_ernie_model(config, embeddings, labels=labels, is_prediction=True)
  1428. test_prog = test_prog.clone(for_test=True)
  1429. exe.run(startup_prog)
  1430. # 加载预训练模型
  1431. if not config['init_checkpoint']:
  1432. raise ValueError("args 'init_checkpoint' should be set if"
  1433. "only doing validation or infer!")
  1434. init_checkpoint_infer(exe, config['init_checkpoint'], main_program=test_prog)
  1435. # 模型预测代码
  1436. infer_pyreader.decorate_tensor_provider(
  1437. reader.data_generator(
  1438. input_file=config['infer_set'],
  1439. batch_size=config['batch_size'],
  1440. phase='infer',
  1441. epoch=1,
  1442. shuffle=False))
  1443. logger.info("Final test result:")
  1444. infer(exe, test_prog, infer_pyreader,
  1445. [probs.name], "infer", reader.get_examples(config['infer_set']))
  1446. if __name__ == "__main__":
  1447. init_log_config()
  1448. print_arguments(infer_config)
  1449. main(infer_config)
  1450. # ### 六、总结
  1451. #
  1452. # ERNIE 在对话情绪识别数据集上的实际运行结果如下:
  1453. #
  1454. #
  1455. # | 模型 | 准确率 |
  1456. # | -------- | -------- |
  1457. # | ERNIE pretrained | 0.944981
  1458. # | ERNIE finetuned | 0.999035
  1459. # <br>
  1460. #
  1461. # 本项目实现 ERNIE 1.0 版本,在对话情绪识别任务上表现良好,除此之外,ERNIE 还可以执行:
  1462. # * 自然语言推断任务 XNLI
  1463. # * 阅读理解任务 DRCD、DuReader、CMRC2018
  1464. # * 命名实体识别任务 MSRA-NER (SIGHAN2006)
  1465. # * 情感分析任务 ChnSentiCorp
  1466. # * 语义相似度任务 BQ Corpus、LCQMC
  1467. # * 问答任务 NLPCC2016-DBQA
  1468. #
  1469. # 读者也可以尝试移植 ERNIE 2.0 进行对比测试。

 六、PaddlePaddle结合爬虫实现女票微博情绪长期监控:

1、imdb数据集可网络搜索,整理后灌入数据:

  1. from __future__ import print_function
  2. import sys
  3. import paddle.v2 as paddle
  4. from __future__ import print_function
  5. import sys
  6. import paddle.v2 as paddle
  7. import sys
  8. import os
  9. import json
  10. import nltk
  11. if __name__ == '__main__':
  12. # init
  13. paddle.init(use_gpu=False)
  14. print('load dictionary...')
  15. word_dict = paddle.dataset.imdb.word_dict()
  16. print(word_dict)

2、数据读取及网络定义准备:

  1. def reader_creator(pos_pattern, neg_pattern, word_idx, buffer_size):
  2. # this unk is a token
  3. UNK = word_idx['']
  4. # start a quen to using multi-process
  5. qs = [Queue.Queue(maxsize=buffer_size), Queue.Queue(maxsize=buffer_size)]
  6. def load(pattern, queue):
  7. for doc in tokenize(pattern):
  8. queue.put(doc)
  9. queue.put(None)
  10. def reader():
  11. # Creates two threads that loads positive and negative samples
  12. # into qs.
  13. t0 = threading.Thread(
  14. target=load, args=(
  15. pos_pattern,
  16. qs[0], ))
  17. t0.daemon = True
  18. t0.start()
  19. t1 = threading.Thread(
  20. target=load, args=(
  21. neg_pattern,
  22. qs[1], ))
  23. t1.daemon = True
  24. t1.start()
  25. # Read alternatively from qs[0] and qs[1].
  26. i = 0
  27. doc = qs[i].get()
  28. while doc != None:
  29. yield [word_idx.get(w, UNK) for w in doc], i % 2
  30. i += 1
  31. doc = qs[i % 2].get()
  32. # If any queue is empty, reads from the other queue.
  33. i += 1
  34. doc = qs[i % 2].get()
  35. while doc != None:
  36. yield [word_idx.get(w, UNK) for w in doc], i % 2
  37. doc = qs[i % 2].get()
  38. return reader()

这个方法其实已经内置在Paddle中,我们不需要写它,但是为了让大家能够理解,我把它单独拿出来讲解一下,这个函数执行的操作其实非常简单,那就是根据上面所得到的word dict,把文本的每一个句子转换成一维的数字向量。由于Imdb里面是一句正情绪,一句负情绪,所以或有一个 %2的操作。

好了,接下来重点来了,我们要用PaddlePaddle构建我们的模型了,我之前提到了这个embed层,我们直接embed之后,接一个CNN来构建一个简单的文本卷积分类网络:

  1. def convolution_net(input_dim, class_dim=2, emb_dim=128, hid_dim=128):
  2. # we are starting with a embed layer
  3. data = paddle.layer.data("word",
  4. paddle.data_type.integer_value_sequence(input_dim))
  5. emb = paddle.layer.embedding(input=data, size=emb_dim)
  6. # this convolution is a sequence convolution
  7. conv_3 = paddle.networks.sequence_conv_pool(
  8. input=emb, context_len=3, hidden_size=hid_dim)
  9. conv_4 = paddle.networks.sequence_conv_pool(
  10. input=emb, context_len=4, hidden_size=hid_dim)
  11. output = paddle.layer.fc(
  12. input=[conv_3, conv_4], size=class_dim, act=paddle.activation.Softmax())
  13. lbl = paddle.layer.data("label", paddle.data_type.integer_value(2))
  14. cost = paddle.layer.classification_cost(input=output, label=lbl)
  15. return cost, output

这里面有一个词嵌入操作,紧接着是两个卷积层,注意这里的卷基层并非是图片卷积,而是文本序列卷积,这个应该是PaddlePaddle中特有的一个特殊层,百度在文本序列和语音序列处理上还是有一套,等一下大家会看到,这么一个简单的模型可以在仅仅6个epoch就达到99.99%的精确度。embed的size是128,隐藏层神经元个数是128。 

3、训练:

  1. from __future__ import print_function
  2. import sys
  3. import paddle.v2 as paddle
  4. import sys
  5. import os
  6. import json
  7. import nltk
  8. def convolution_net(input_dim, class_dim=2, emb_dim=128, hid_dim=128):
  9. data = paddle.layer.data("word",
  10. paddle.data_type.integer_value_sequence(input_dim))
  11. emb = paddle.layer.embedding(input=data, size=emb_dim)
  12. conv_3 = paddle.networks.sequence_conv_pool(
  13. input=emb, context_len=3, hidden_size=hid_dim)
  14. conv_4 = paddle.networks.sequence_conv_pool(
  15. input=emb, context_len=4, hidden_size=hid_dim)
  16. output = paddle.layer.fc(
  17. input=[conv_3, conv_4], size=class_dim, act=paddle.activation.Softmax())
  18. lbl = paddle.layer.data("label", paddle.data_type.integer_value(2))
  19. cost = paddle.layer.classification_cost(input=output, label=lbl)
  20. return cost, output
  21. if __name__ == '__main__':
  22. # init
  23. paddle.init(use_gpu=False)
  24. # those lines are get the code
  25. print('load dictionary...')
  26. word_dict = paddle.dataset.imdb.word_dict()
  27. print(word_dict)
  28. dict_dim = len(word_dict)
  29. class_dim = 2
  30. train_reader = paddle.batch(
  31. paddle.reader.shuffle(
  32. lambda: paddle.dataset.imdb.train(word_dict), buf_size=1000),
  33. batch_size=100)
  34. test_reader = paddle.batch(
  35. lambda: paddle.dataset.imdb.test(word_dict),
  36. batch_size=100)
  37. feeding = {'word': 0, 'label': 1}
  38. # get the output of the model
  39. [cost, output] = convolution_net(dict_dim, class_dim=class_dim)
  40. parameters = paddle.parameters.create(cost)
  41. adam_optimizer = paddle.optimizer.Adam(
  42. learning_rate=2e-3,
  43. regularization=paddle.optimizer.L2Regularization(rate=8e-4),
  44. model_average=paddle.optimizer.ModelAverage(average_window=0.5))
  45. trainer = paddle.trainer.SGD(
  46. cost=cost, parameters=parameters, update_equation=adam_optimizer)
  47. def event_handler(event):
  48. if isinstance(event, paddle.event.EndIteration):
  49. if event.batch_id % 100 == 0:
  50. print("\nPass %d, Batch %d, Cost %f, %s" % (
  51. event.pass_id, event.batch_id, event.cost, event.metrics))
  52. else:
  53. sys.stdout.write('.')
  54. sys.stdout.flush()
  55. if isinstance(event, paddle.event.EndPass):
  56. with open('./params_pass_%d.tar' % event.pass_id, 'w') as f:
  57. trainer.save_parameter_to_tar(f)
  58. result = trainer.test(reader=test_reader, feeding=feeding)
  59. print("\nTest with Pass %d, %s" % (event.pass_id, result.metrics))
  60. inference_topology = paddle.topology.Topology(layers=output)
  61. with open("./inference_topology.pkl", 'wb') as f:
  62. inference_topology.serialize_for_inference(f)
  63. trainer.train(
  64. reader=train_reader,
  65. event_handler=event_handler,
  66. feeding=feeding,
  67. num_passes=20)

训练完成后我们现在只有一个main.py,这里面就是我们训练的脚本。我们有一个inference_topology.pkl,这个是我们的网络模型保存的二进制文件。

4、预测:

  1. import numpy as np
  2. import sys
  3. import paddle.v2 as paddle
  4. from __future__ import print_function
  5. import sys
  6. import os
  7. import json
  8. import nltk
  9. def convolution_net(input_dim,
  10. class_dim=2,
  11. emb_dim=128,
  12. hid_dim=128,
  13. is_predict=False):
  14. data = paddle.layer.data("word",
  15. paddle.data_type.integer_value_sequence(input_dim))
  16. emb = paddle.layer.embedding(input=data, size=emb_dim)
  17. conv_3 = paddle.networks.sequence_conv_pool(
  18. input=emb, context_len=3, hidden_size=hid_dim)
  19. conv_4 = paddle.networks.sequence_conv_pool(
  20. input=emb, context_len=4, hidden_size=hid_dim)
  21. output = paddle.layer.fc(input=[conv_3, conv_4],
  22. size=class_dim,
  23. act=paddle.activation.Softmax())
  24. if not is_predict:
  25. lbl = paddle.layer.data("label", paddle.data_type.integer_value(2))
  26. cost = paddle.layer.classification_cost(input=output, label=lbl)
  27. return cost
  28. else:
  29. return output
  30. if __name__ == '__main__':
  31. # Movie Reviews, from imdb test
  32. paddle.init(use_gpu=False)
  33. word_dict = paddle.dataset.imdb.word_dict()
  34. dict_dim = len(word_dict)
  35. class_dim = 2
  36. reviews = [
  37. 'Read the book, forget the movie!',
  38. 'This is a great movie.'
  39. ]
  40. print(reviews)
  41. reviews = [c.split() for c in reviews]
  42. UNK = word_dict['']
  43. input = []
  44. for c in reviews:
  45. input.append([[word_dict.get(words, UNK) for words in c]])
  46. # 0 stands for positive sample, 1 stands for negative sample
  47. label = {0: 'pos', 1: 'neg'}
  48. # Use the network used by trainer
  49. out = convolution_net(dict_dim, class_dim=class_dim, is_predict=True)
  50. parameters = paddle.parameters.create(out)
  51. print(parameters)
  52. # out = stacked_lstm_net(dict_dim, class_dim=class_dim, stacked_num=3, is_predict=True)
  53. probs = paddle.infer(output_layer=out, parameters=parameters, input=input)
  54. print('probs:', probs)
  55. labs = np.argsort(-probs)
  56. print(labs)
  57. for idx, lab in enumerate(labs):
  58. print(idx, "predicting probability is", probs[idx], "label is", label[lab[0]])

review的内容可由微博爬虫得到。

5、微博爬虫程序爬取微博内容,作为上述review的信息:

此处感谢@cici_富贵的文章,同样深受启发。

url = 'https://m.weibo.cn'首先打开手机端微博的登陆网址,登上自己的微博。

之后就是打开你的特别关注,然后搜索你喜欢的那个人,进入她的微博主页,然后F12,到Network界面下,选择XHR,F5刷新以下就会出现Ajax响应:
然后接着下拉,找寻加载页面返回内容的Ajax请求响应,并打开对其进行分析:

在这里插入图片描述

在这里插入图片描述

很容易找到在所选响应下的data->cards->mblog中找到自己想要的内容,并发现其时以json形式返回的,那么我就开始分析其请求头和所携带的参数列表,寻找其中的不同,以构造自己的Ajax请求头:

 在这里插入图片描述

在这里插入图片描述 在这里插入图片描述

观察发现在参数列表当中,刚点进去微博主页刷新得到的参数列表中没有since_id这一项,而下拉内容再返回的响应参数列表中却出现的不同的since_id,可以发现再请求响应过程中只有since_id在改变,其它参数都没有变(since_id是用来控制页面的增加的)

在搞清楚这些后我们就可以构造自己的头部,模拟Ajax请求从而爬取内容:

  1. from urllib.parse import urlencode
  2. from pyquery import PyQuery as py
  3. import requests
  4. def get_information(since_id = '',uid = 0,):
  5. #X-Requested-With 用来标识Ajax请求,必须得有
  6. #Referer 用来指明请求来源 必须有
  7. #User-Agent 伪装浏览器,必须有
  8. headers = {'Referer': 'https://m.weibo.cn',
  9. 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36',
  10. 'X-Requested-With': 'XMLHttpRequest'}
  11. params = {
  12. 'uid': uid,
  13. 'luicode': 10000011,
  14. 'lfid': '231093_-_selffollowed',
  15. 'type': int(uid),
  16. 'value': 5768045317,
  17. 'containerid': '107603' + str(uid),
  18. 'since_id': since_id
  19. }
  20. # urlencode() 方法将参数转化为URL的GET请求参数
  21. url = 'https://m.weibo.cn/api/container/getIndex?' + urlencode(params)
  22. response = requests.get(url,headers = headers)
  23. #获得返回的 json 内容,做分析
  24. json_text = response.json()
  25. #获得 since_id 为增加页面做准备
  26. since_id = json_text.get('data').get('cardlistInfo').get('since_id')
  27. return json_text,since_id
  28. def parse_json(json):
  29. items = json.get('data').get('cards')
  30. for item in items:
  31. item = item.get('mblog')
  32. weibo = {}
  33. weibo['发表时间'] = item.get('created_at')
  34. weibo['手机类型'] = item.get('source')
  35. weibo['内容'] = py(item.get('text')).text()
  36. weibo['图片链接'] = item.get('bmiddle_pic')
  37. weibo['点赞数'] = item.get('attitudes_count')
  38. weibo['评论数'] = item.get('comments_count')
  39. yield weibo
  40. if __name__ == '__main__':
  41. #uid 你所要爬取的微博的ID,在响应的参数列表中可以得到,图中可以找到
  42. uid = 5768045317
  43. #p 爬取的页数
  44. p = 3
  45. #获得返回的 JSON 内容 和 since_id
  46. s = get_information(since_id = '',uid = uid)
  47. #解析 JSON
  48. parse_json(s[0])
  49. #输出解析后的内容
  50. for i in parse_json(s[0]):
  51. print(i)
  52. '''
  53. #多页爬取
  54. for i in range(p):
  55. s = get_information(since_id = s[1],uid = uid)
  56. for i in parse_json(s[0]):
  57. print(i)
  58. '''

 


笔者首先需要澄清,本文的核心目的是学习自然语言处理预训练模型的调用和Finetune等方法,并且利用一个有趣的现实问题进行实战。虽然笔者找到了这几种测试对话情绪的方式,同时也利用飞桨框架实现了对对话内容的预测,但是笔者不会用于真实生活中。同样作为工程师or程序员的你不要忘记用自己对待工作的认真和拼命的态度去对待自己心爱的女孩哦。
不要忘记那句歌词:“女孩儿的心思男孩你别猜,你猜来猜去也猜不明白。不知道她为什么掉眼泪,也不知她为什么笑开怀……”

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/小蓝xlanll/article/detail/311463?site
推荐阅读
相关标签
  

闽ICP备14008679号