赞
踩
本文讲述了如何从零开始训练一个大模型,这个从零开始值是指从源码层面自己处理数据、搭建模型。
加载数据
对过长的文本进行切分,设置max_len=128,也就是先按照标点符号把长句子切分为短句子,然后将短句子进行组合,但是不能超过最大长度,超过的部分组合成新的句子,所有数据处理完得到 Dataset。
将Dataset传给DataLoader,在DataLoader里面的collate_fn进行操作,得到模型的输入。
collate_fn函数:
1)对输入的文本进行编码:input_ids = tokenizer.encode(list(text))
2)计算需要掩码的token数量:n_pred = min(max_pred, max(1, int(len(input_ids) * 0.15)))
3)对input_ids进行处理得到待掩码的token:cand_maked_pos = [i for i, token in enumerate(input_ids) if token != word2idx[‘[CLS]’] and token != word2idx[‘[SEP]’]]
4)打乱待掩码的token:shuffle(cand_maked_pos)
5)进行掩码:
for pos in cand_maked_pos[:n_pred]:
masked_pos.append(pos)
masked_tokens.append(input_ids[pos])
if random() < 0.8:
input_ids[pos] = word2idx['[MASK]']
elif random() > 0.9:
index = randint(0, vocab_size - 1)
while index == 0 or index == 101 or index == 102 or index == 103:
index = randint(0, vocab_size - 1)
input_ids[pos] = index
6)对input_ids进行补零操作:
n_pad = maxlen - len(input_ids)
input_ids.extend([0] * n_pad)
7)对掩码的token和对应的位置向量进行补零操作:
n_pad = max_pred - n_pred
masked_tokens.extend([0] * n_pad)
masked_pos.extend([0] * n_pad)
8)将得到的模型输入input_ids、masked_tokens、 masked_pos转换为torch tensor格式:
torch.tensor(input_ids, dtype=torch.long)
torch.tensor(masked_tokens, dtype=torch.long)
torch.tensor(masked_pos, dtype=torch.long)
class Embedding(nn.Module): def __init__(self): super(Embedding, self).__init__() self.tok_embed = nn.Embedding(vocab_size, hidden_size) self.pos_embed = nn.Embedding(maxlen, hidden_size) self.norm = nn.LayerNorm(hidden_size) self.dropout = nn.Dropout(hidden_dropout_prob) def forward(self, input_ids): seq_len = input_ids.size(1) pos = torch.arange(seq_len, dtype=torch.long) pos = pos.unsqueeze(0).expand_as(input_ids) # [seq_len] -> [batch_size, seq_len] embedding = self.tok_embed(input_ids) + self.pos_embed(pos) # [batch_size, seq_len, hidden_size] embedding = self.norm(embedding) embedding = self.dropout(embedding) return embedding
def forward(self, Q, K, V, attn_mask):
scores = torch.matmul(Q, K.transpose(-1, -2)) / np.sqrt(d_k)
scores.masked_fill_(attn_mask, -1e9)
attn = nn.Softmax(dim=-1)(scores)
context = torch.matmul(attn, V)
return context
self.linear = nn.Linear(hidden_size, hidden_size)
self.activ2 = gelu
embed_weight = self.embedding.tok_embed.weight
self.fc2 = nn.Linear(hidden_size, vocab_size)
self.fc2.weight = embed_weight
masked_pos = masked_pos[:, :, None].expand(-1, -1, d_model) # [batch_size, max_pred, hidden_size]
h_masked = self.activ2(self.linear(h_masked)) # [batch_size, max_pred, hidden_size]
logits_lm = self.fc2(h_masked) # [batch_size, max_pred, vocab_size]
return logits_lm
class RoBERTa(nn.Module): def __init__(self): super(RoBERTa, self).__init__() self.embedding = Embedding() self.layers = nn.ModuleList([EncoderLayer() for _ in range(n_layers)]) self.linear = nn.Linear(hidden_size, hidden_size) self.activ2 = gelu embed_weight = self.embedding.tok_embed.weight self.fc2 = nn.Linear(hidden_size, vocab_size) self.fc2.weight = embed_weight def forward(self, input_ids, masked_pos): output = self.embedding(input_ids) # [batch_size, seq_len, hidden_size] enc_self_attn_mask = get_attn_pad_mask(input_ids, input_ids) # [batch_size, maxlen, maxlen] for layer in self.layers: output = layer(output, enc_self_attn_mask) # [batch_size, seq_len, hidden_size] masked_pos = masked_pos[:, :, None].expand(-1, -1, hidden_size) # [batch_size, max_pred, hidden_size] h_masked = torch.gather(output, 1, masked_pos) # [batch_size, max_pred, hidden_size] h_masked = self.activ2(self.linear(h_masked)) # [batch_size, max_pred, hidden_size] logits_lm = self.fc2(h_masked) # [batch_size, max_pred, vocab_size] return logits_lm
model = RoBERTa() criterion = nn.CrossEntropyLoss(ignore_index=0) optimizer = optim.Adam(model.parameters(), lr=0.0001) for epoch in range(epochs): loss = 0 pbar = tqdm.tqdm(loader, desc='Train', nrows=200, ncols=100) for input_ids, masked_tokens, masked_pos in pbar: logits_lm = model(input_ids, masked_pos) # [batch_size, max_pred, vocab_size] loss_lm = criterion(logits_lm.view(-1, vocab_size), masked_tokens.view(-1)) loss += loss_lm optimizer.zero_grad() loss_lm.backward() optimizer.step() print('Epoch:', '%04d' % (epoch + 1), 'loss =', '{:.6f}'.format(loss)) output_path = './outputs/' save_path = os.path.join(output_path, 'checkpoint_RoBERTa-' + str(epoch+1)) torch.save(model, save_path)
下图为模型测试和测试结果,从预测结果可以看出,预测的两段文本,分别掩码了13和2个token,但是两段文本都只预测对了一个token。这是因为当时只用了少量的训练数据,而且模型未得到充分训练。
模型推理过程:
model=torch.load(save_path)
input_ids = input_ids.numpy().tolist()
masked_tokens = masked_tokens.numpy().tolist()
masked_pos = masked_pos.numpy().tolist()
print([idx2word[w] for w in input_ids[0] if idx2word[w] != ‘[PAD]’])
logits_lm = model(torch.LongTensor([input_ids[0]]),torch.LongTensor([masked_pos[0]]))# logits_lm :[batch_size, max_pred, vocab_size]
logits_lm = logits_lm.data.max(2)[1][0].data.numpy() # 预测出的掩码位置的token,长度为max_pred
print(‘masked tokens list : ’,[pos for pos in masked_tokens[0] if pos != 0]) print('predict masked tokens list : ',[pos for pos in logits_lm if pos != 0])
测试结果1:
测试结果2:
ALBERT 的结构和 BERT 基本一样,采用了 Transformer 以及 GELU 激活函数。具体的创新部分有三个:
ELECTRA最主要的贡献是提出了新的预训练任务和框架,把生成式的Masked language model(MLM)预训练任务改成了判别式的Replaced token detection(RTD)任务,判断当前token是否被语言模型替换过。
具体而言:首先按照一定的比例对于原始输入序列X-ORI进行随机MASK操作得到新序列X-MASK;其次将X-MASK作为生成器模型(Generator)的输入,该生成器模型用于对序列中那些被MASK操作的tokens生成新的token(此时生成器是面向所有词表而言),以此来产生新的序列X-Generator;之后将X-Generator作为判别器模型的输入(Discriminator),该判别器模型用于判别序列中每一个token是否是原始token(和X-ORI进行对比而言)。
ELECTRA和BERT的区别:
ERNIE相比于BERT,做出了如下改进:
本文是对过去看过内容的一个复盘,只涉及到训练大模型的主要部分,部分细节无法逐一展现,需要源码的可以私我,如果疑问欢迎评论区交流。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。