赞
踩
从THUCNews中抽取了20万条新闻标题,已上传至github,文本长度在20到30之间。一共10个类别,每类2万条。数据以字为单位输入模型。
类别:财经、房产、股票、教育、科技、社会、时政、体育、游戏、娱乐。
数据集 | 数据量 |
---|---|
训练集 | 18万 |
验证集 | 1万 |
测试集 | 1万 |
注意:更换自己的数据集:按照我数据集的格式来格式化你的中文数据集。
训练集有18w条,测试集和验证集均为1w条,每条样本都是20-30个中文汉字。词表直接使用huggingface
上的中文词表即可。划分训练测试验证集用如下函数,返回train
列表有18w条样本数据,每条样本数据为一个tuple元组,比如train[0]
元组大小为4。
def build_dataset(config): def load_dataset(path, pad_size=32): contents = [] with open(path, 'r', encoding='UTF-8') as f: for line in tqdm(f): lin = line.strip() if not lin: continue content, label = lin.split('\t') token = config.tokenizer.tokenize(content) token = [CLS] + token seq_len = len(token) mask = [] token_ids = config.tokenizer.convert_tokens_to_ids(token) if pad_size: if len(token) < pad_size: mask = [1] * len(token_ids) + [0] * (pad_size - len(token)) token_ids += ([0] * (pad_size - len(token))) else: mask = [1] * pad_size token_ids = token_ids[:pad_size] seq_len = pad_size contents.append((token_ids, int(label), seq_len, mask)) return contents train = load_dataset(config.train_path, config.pad_size) dev = load_dataset(config.dev_path, config.pad_size) test = load_dataset(config.test_path, config.pad_size) return train, dev, test
为了模型后面的训练,先要做文本数据预处理:
bert模型放在 bert_pretain目录下,ERNIE模型放在ERNIE_pretrain目录下,每个目录下都是三个文件:
下载地址:
解压后,按照上面说的放在对应目录下,文件名称确认无误即可。
【使用说明】下载好预训练模型就可以跑了。
# 训练并测试:
# bert
python run.py --model bert
# bert + 其它
python run.py --model bert_CNN
# ERNIE
python run.py --model ERNIE
参数:模型都在models目录下,超参定义和模型定义在同一文件中。
大模型的痛点:
Ernie 2.0是百度在Ernie 1.0基础之上的第二代预训练结构。
模型结构与Ernie 1.0还有 Bert保持一致,12层的transformer的encoder层。
具体可以参考:一文读懂最强中文NLP预训练模型ERNIE
ERNIE:Enhanced Representation through Knowledge Integration
文心 ERNIE 开源版地址:https://github.com/PaddlePaddle/ERNIE
文心 ERNIE 官网地址:https://wenxin.baidu.com/
论文地址:https://arxiv.org/pdf/1904.09223v1.pdf
[mask]
,小部分(10%)单词被随机替换为其他单词,另外小部分(10%)单词保持不变。[cls]
预测标记和[sep]
分句标记后输入BERT模型,使用[cls]
预测标记收集到的全局表征进行二分类预测,并使用分类损失优化BERT模型。
[cls]我喜欢打篮球[sep]尤其是3v3篮球[sep]
。[cls]我喜欢打[mask][sep]尤其是3v3篮球[sep]
。class Model(nn.Module):
def __init__(self, config):
super(Model, self).__init__()
self.bert = BertModel.from_pretrained(config.bert_path)
for param in self.bert.parameters():
param.requires_grad = True
self.fc = nn.Linear(config.hidden_size, config.num_classes)
def forward(self, x):
context = x[0] # 输入的句子
mask = x[2] # 对padding部分进行mask,和句子一个size,padding部分用0表示,如:[1, 1, 1, 1, 0, 0]
_, pooled = self.bert(context, attention_mask=mask, output_all_encoded_layers=False)
out = self.fc(pooled)
return out
进入上面forward
函数的x
列表有3个元素,x[0]
是输入的句子,x[2]
是mask_tensor,即如果当前句子位置存在元素则为1,否则为0(这0为padding元素)。比如第一个句子有19个中文汉字,first_lst = x[2][0].cpu().numpy()
,通过sum([x for x in first_lst])
统计到确实是有19个1,剩下全为0。
通过torchinfo.summary
看model
对象结构如下:
================================================================================ Layer (type:depth-idx) Param # ================================================================================ Model -- ├─BertModel: 1-1 -- │ └─BertEmbeddings: 2-1 -- │ │ └─Embedding: 3-1 16,226,304 │ │ └─Embedding: 3-2 393,216 │ │ └─Embedding: 3-3 1,536 │ │ └─BertLayerNorm: 3-4 1,536 │ │ └─Dropout: 3-5 -- │ └─BertEncoder: 2-2 -- │ │ └─ModuleList: 3-6 85,054,464 │ └─BertPooler: 2-3 -- │ │ └─Linear: 3-7 590,592 │ │ └─Tanh: 3-8 -- ├─Linear: 1-2 7,690 ================================================================================ Total params: 102,275,338 Trainable params: 102,275,338 Non-trainable params: 0 ================================================================================
class Model(nn.Module):
def __init__(self, config):
super(Model, self).__init__()
self.bert = BertModel.from_pretrained(config.bert_path)
for param in self.bert.parameters():
param.requires_grad = True
self.fc = nn.Linear(config.hidden_size, config.num_classes)
def forward(self, x):
context = x[0] # 输入的句子
mask = x[2] # 对padding部分进行mask,和句子一个size,padding部分用0表示,如:[1, 1, 1, 1, 0, 0]
_, pooled = self.bert(context, attention_mask=mask, output_all_encoded_layers=False)
out = self.fc(pooled)
return out
对应的模型结构:
Model( (bert): BertModel( (embeddings): BertEmbeddings( (word_embeddings): Embedding(18000, 768, padding_idx=0) (position_embeddings): Embedding(513, 768) (token_type_embeddings): Embedding(2, 768) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (encoder): BertEncoder( (layer): ModuleList( (0): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (1): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (2): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (3): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (4): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (5): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (6): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (7): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (8): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (9): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (10): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (11): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) ) (pooler): BertPooler( (dense): Linear(in_features=768, out_features=768, bias=True) (activation): Tanh() ) ) (fc): Linear(in_features=768, out_features=10, bias=True) )
跑Ernie的文本分类效果是比Bert效果好的:
No optimization for a long time, auto-stopping... Test Loss: 0.24, Test Acc: 92.54% Precision, Recall and F1-Score... precision recall f1-score support finance 0.9465 0.8840 0.9142 1000 realty 0.9171 0.9510 0.9337 1000 stocks 0.8738 0.8520 0.8628 1000 education 0.9658 0.9590 0.9624 1000 science 0.8395 0.9100 0.8733 1000 society 0.9348 0.9170 0.9258 1000 politics 0.9232 0.9010 0.9119 1000 sports 0.9649 0.9890 0.9768 1000 game 0.9630 0.9360 0.9493 1000 entertainment 0.9335 0.9550 0.9441 1000 accuracy 0.9254 10000 macro avg 0.9262 0.9254 0.9254 10000 weighted avg 0.9262 0.9254 0.9254 10000 Confusion Matrix... [[884 13 68 3 10 7 10 4 0 1] [ 9 951 3 0 8 5 5 4 2 13] [ 31 43 852 1 37 1 25 6 2 2] [ 1 2 1 959 10 16 3 0 0 8] [ 3 1 20 4 910 14 8 1 29 10] [ 2 16 0 10 14 917 22 2 1 16] [ 2 4 29 10 36 11 901 1 0 6] [ 0 2 0 1 2 1 1 989 0 4] [ 1 1 2 1 42 5 1 3 936 8] [ 1 4 0 4 15 4 0 15 2 955]] Time usage: 0:00:09
项目完整代码链接稍等放上。
[1] 百度搜索首届技术创新挑战赛:赛道一
[2] 超详细中文预训练模型ERNIE使用指南
[3] BERT与知识图谱的结合——ERNIE模型浅析
[4] 赛题baseline:使用RocketQA进行篇章检索判定是否存在答案 + 使用 ERNIE 3.0进行阅读理解答案抽取
搜索问答赛道,包含答案抽取和答案检验两大任务。
Coggle阿水:https://aistudio.baidu.com/aistudio/projectdetail/5013840
致Great:https://aistudio.baidu.com/aistudio/projectdetail/5043272
洪荒流1st:https://aistudio.baidu.com/aistudio/projectdetail/4960090
[5] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
[6] ERNIE: Enhanced Representation through Knowledge Integration
[6] 百度文心大模型ERNIE:https://github.com/PaddlePaddle/ERNIE
[7] https://github.com/649453932/Bert-Chinese-Text-Classification-Pytorch
[8] 一文读懂最强中文NLP预训练模型ERNIE.飞桨
[9] 超详细的 Bert 文本分类源码解读 | 附源码
[10] pytorch的bert预训练模型名称及下载路径
[11] 记一次工业界文本分类任务的尝试(文本分类+序列标注+文本抽取)
[12] NLP学习之使用pytorch搭建textCNN模型进行中文文本分类
[13] Pytorch实现中文文本分类任务(Bert,ERNIE,TextCNN,TextRNN,FastText,TextRCNN,BiLSTM_Attention, DPCNN, Transformer)
[14] 领域自适应之继续预训练demo.预训练模型加载
[15] 百度飞桨项目:基于BERT的中文分词
[16] https://github.com/huggingface/transformers/tree/main/src/transformers/models/bert
[17] NLP&深度学习:PyTorch文本分类
[18] 自然语言处理 — Bert开发实战 (Transformers)
[19] BERT模型实战之多文本分类(附源码)
[20] 使用chatGPT看论文:https://www.humata.ai/
[21] 四大模型」革新NLP技术应用,百度文心ERNIE最新开源预训练模型
[22] 论文解读 | 百度 ERNIE: Enhanced Representation through Knowledge Integration
[23] 预训练模型之Ernie2.0. 某乎
[24] 预训练模型(四)—Ernie
[25] https://wenxin.baidu.com/
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。