当前位置:   article > 正文

统计语言模型:Bi-gram_bigrams

bigrams

统计语言模型 : Bi-grams

  本文通过使用一小部分的中文语料,训练一个Bigrams模型,然后使用Bigrams模型以自回归的方式生成一段中文文本,纯粹为了学习和娱乐,因为Bigrams为了得到较好的结果一般需要数以亿计的词汇才可以,本文采用的训练数据约10k+。

  Bigrams(二元语法模型),是一种简单易实现但实际应用价值有限的统计语言模型,是N-gram的一个特例。在给定一个前置字符的情况下,计算出二元组合(词)的概率:

P ( W n ∣ W n − 1 ) = P ( W n − 1 , W n ) P ( W n − 1 ) = W n − 1 , W n 同时出现的概率 W n − 1 单独出现的概率 P(W_n|W_{n-1}) = \frac{P(W_{n-1}, W_n)}{P(W_{n-1})} = \frac{W_{n-1},W_n同时出现的概率}{W_{n-1}单独出现的概率} P(WnWn1)=P(Wn1)P(Wn1,Wn)=Wn1单独出现的概率Wn1,Wn同时出现的概率

即:在给定前一个字符 W n − 1 W_{n-1} Wn1的前提下,出现某个字符 W n W_n Wn的概率 P ( W n ) P(W_n) P(Wn)与它们构成的二元组合概率相同。

import os
import time
import pandas as pd
from dataclasses import dataclass

import torch
import torch.nn as nn
from torch.nn import functional as F
from torch.utils.data import Dataset
from torch.utils.data.dataloader import DataLoader
from torch.utils.tensorboard import SummaryWriter
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11

加载数据

数据来自10k+外卖评价数据:

data = pd.read_csv('./dataset/waimai_10k.csv')
data.dropna(subset='review',inplace=True)
data['review_length'] = data.review.apply(lambda x:len(x))
data.sample(5)
  • 1
  • 2
  • 3
  • 4
labelreviewreview_length
73240等的花儿都谢了7
83090速度快,味道一般8
89790冷面套餐只有一碗面?难道我记错了?有点凉了,第一次点外卖,不是很满意。35
651702个小时送到的,气都不想生了,自己看着办吧21
101050味道著實一般送到時候飯涼涼的14

统计信息:

data = data[data.review_length <= 50] 
words = data.review.tolist()
chars = sorted(list(set(''.join(words))))   # all the possible characters
max_word_length = max(len(w) for w in words)

print(f"number of examples: {len(words)}")
print(f"max word length: {max_word_length}")
print(f"size of vocabulary: {len(chars)}")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
number of examples: 10796
max word length: 50
size of vocabulary: 2272
  • 1
  • 2
  • 3
划分训练/测试集
test_set_size = min(1000, int(len(words) * 0.1)) # 10% of the training set, or up to 10 examples
rp = torch.randperm(len(words)).tolist()
train_words = [words[i] for i in rp[:-test_set_size]]
test_words = [words[i] for i in rp[-test_set_size:]]
print(f"split up the dataset into {len(train_words)} training examples and {len(test_words)} test examples")
  • 1
  • 2
  • 3
  • 4
  • 5
split up the dataset into 9796 training examples and 1000 test examples
  • 1
构造字符数据集[tensor]
  • < BLANK> : 0
  • token seqs : [1, 2, 3, 4, 5, 6]
  • block_size : 1,上下文长度
  • x : [0, 1, 2, 3, 4, 5, 6]
  • y : [1, 2, 3, 4, 5, 6, 0]
class CharDataset(Dataset):

    def __init__(self, words, chars, max_word_length):
        self.words = words
        self.chars = chars
        self.max_word_length = max_word_length
        # char-->index-->char
        self.char2i = {ch:i+1 for i,ch in enumerate(chars)}
        self.i2char = {i:s for s,i in self.char2i.items()}    

    def __len__(self):
        return len(self.words)

    def contains(self, word):
        return word in self.words

    def get_vocab_size(self):
        return len(self.chars) + 1      # add a special token 0

    def get_output_length(self):
        return self.max_word_length + 1 # <START> : special tolen

    def encode(self, word):
        # char sequece ---> index sequence
        ix = torch.tensor([self.char2i[w] for w in word], dtype=torch.long)
        return ix

    def decode(self, ix):
        # index sequence ---> char sequence
        word = ''.join(self.i2char[i] for i in ix)
        return word

    def __getitem__(self, idx):
        word = self.words[idx][:max_word_length]
        ix = self.encode(word)
        x = torch.zeros(self.max_word_length + 1, dtype=torch.long)
        y = torch.zeros(self.max_word_length + 1, dtype=torch.long)
        x[1:1+len(ix)] = ix 
        y[:len(ix)] = ix 
        # len(ix)+1 : <END> : 0
        y[len(ix)+1:] = -1 # index -1 will mask the loss at the inactive locations
        return x, y
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
train_dataset = CharDataset(train_words, chars, max_word_length)
test_dataset = CharDataset(test_words, chars, max_word_length)
  • 1
  • 2
数据加载器[DataLoader]
class InfiniteDataLoader:
    
    def __init__(self, dataset, **kwargs):
        train_sampler = torch.utils.data.RandomSampler(dataset, replacement=True, num_samples=int(1e10))
        self.train_loader = DataLoader(dataset, sampler=train_sampler, **kwargs)
        self.data_iter = iter(self.train_loader)

    def next(self):
        try:
            batch = next(self.data_iter)
        except StopIteration: # this will technically only happen after 1e10 samples... (i.e. basically never)
            self.data_iter = iter(self.train_loader)
            batch = next(self.data_iter)
        return batch
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14

构建模型

  • 随机初始化参数矩阵 l o g i t s : l e n ( c h a r s ) × l e n ( c h a r s ) logits:len(chars)\times len(chars) logits:len(chars)×len(chars),作为为两两字符之间的概率表
  • 基于交叉熵误差优化参数矩阵 l o g i t s logits logits,使其近似于字符之间真实的概率分布
@dataclass
class ModelConfig:
    block_size: int = None # length of the input sequences 
    vocab_size: int = None # size of vocabulary
  • 1
  • 2
  • 3
  • 4
class Bigram(nn.Module):

    def __init__(self, config):
        super().__init__()
        n = config.vocab_size
        # 参数化,二元查找表
        self.logits = nn.Parameter(torch.ones((n, n))/n)
    
    # 上下文的长度,1:只用前一个字符预测下一个
    def get_block_size(self):
        return 1                  

    def forward(self, idx, targets=None):
         # 'forward pass':lookup
        logits = self.logits[idx]
        # if we are given some desired targets also calculate the loss
        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
        return logits, loss
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
@torch.inference_mode()
def evaluate(model, dataset, batch_size=10, max_batches=None):
    model.eval()
    loader = DataLoader(dataset, shuffle=True, batch_size=batch_size, num_workers=0)
    losses = []
    for i, batch in enumerate(loader):
        batch = [t.to('cuda') for t in batch]
        X, Y = batch
        logits, loss = model(X, Y)
        losses.append(loss.item())
        if max_batches is not None and i >= max_batches:
            break
    mean_loss = torch.tensor(losses).mean().item()
    model.train() # reset model back to training mode
    return mean_loss
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15

训练模型

环境初始化
torch.manual_seed(seed=12345)
torch.cuda.manual_seed_all(seed=12345)

work_dir = "./bigram_log"
os.makedirs(work_dir, exist_ok=True)
writer = SummaryWriter(log_dir=work_dir)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
初始化模型
config = ModelConfig(vocab_size=train_dataset.get_vocab_size(),
                     block_size=1)

model = Bigram(config)

model.to('cuda')
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
Bigram()
  • 1
# init optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-4, weight_decay=0.01, betas=(0.9, 0.99), eps=1e-8)
# init dataloader
batch_loader = InfiniteDataLoader(train_dataset, batch_size=64, pin_memory=True, num_workers=4)

# training loop
best_loss = None
step = 0
train_losses, test_losses = [],[]
while True:

    t0 = time.time()

    # get the next batch, ship to device, and unpack it to input and target
    batch = batch_loader.next()
    batch = [t.to('cuda') for t in batch]
    X, Y = batch
    # feed into the model
    logits, loss = model(X, Y)

    # calculate the gradient, update the weights
    model.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
    # wait for all CUDA work on the GPU to finish then calculate iteration time taken
    torch.cuda.synchronize()
    t1 = time.time()

    # logging
    if step % 1000 == 0:
        print(f"step {step} | loss {loss.item():.4f} | step time {(t1-t0)*1000:.2f}ms")

    # evaluate the model
    if step > 0 and step % 100 == 0:
        train_loss = evaluate(model, train_dataset, batch_size=100, max_batches=10)
        test_loss  = evaluate(model, test_dataset,  batch_size=100, max_batches=10)
        train_losses.append(train_loss)
        test_losses.append(test_loss)
        #writer.add_scalar("Loss/train", train_loss, step)
        #writer.add_scalar("Loss/test", test_loss, step)
        #writer.flush()
        #print(f"step {step} train loss: {train_loss} test loss: {test_loss}")
        # save the model to disk if it has improved
        if best_loss is None or test_loss < best_loss:
            out_path = os.path.join(work_dir, "model.pt")
            print(f"test loss {test_loss} is the best so far, saving model to {out_path}")
            torch.save(model.state_dict(), out_path)
            best_loss = test_loss

    step += 1
    # termination conditions
    if step > 15100:
        break
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
step 0 | loss 7.7289 | step time 22.08ms
test loss 7.676259517669678 is the best so far, saving model to ./bigram_log/model.pt
test loss 7.622135162353516 is the best so far, saving model to ./bigram_log/model.pt
test loss 7.568359375 is the best so far, saving model to ./bigram_log/model.pt
test loss 7.5148138999938965 is the best so far, saving model to ./bigram_log/model.pt
test loss 7.461203098297119 is the best so far, saving model to ./bigram_log/model.pt
test loss 7.407928466796875 is the best so far, saving model to ./bigram_log/model.pt
test loss 7.354836463928223 is the best so far, saving model to ./bigram_log/model.pt
test loss 7.3021063804626465 is the best so far, saving model to ./bigram_log/model.pt
test loss 7.249790191650391 is the best so far, saving model to ./bigram_log/model.pt
step 1000 | loss 7.1475 | step time 3.47ms
....
step 15000 | loss 3.6146 | step time 3.21ms
test loss 4.035704135894775 is the best so far, saving model to ./bigram_log/model.pt
test loss 4.032766819000244 is the best so far, saving model to ./bigram_log/model.pt
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
best test loss : 4.032766

测试:Bi-gram评论生成器

@torch.no_grad()
def generate(model, idx, max_new_tokens, temperature=1.0, do_sample=False, top_k=None):
    block_size = model.get_block_size()
    for _ in range(max_new_tokens):
        # if the sequence context is growing too long we must crop it at block_size
        idx_cond = idx if idx.size(1) <= block_size else idx[:, -block_size:]
        # forward the model to get the logits for the index in the sequence
        logits, _ = model(idx_cond)
        # pluck the logits at the final step and scale by desired temperature
        logits = logits[:, -1, :] / temperature
        # optionally crop the logits to only the top k options
        if top_k is not None:
            v, _ = torch.topk(logits, top_k)
            logits[logits < v[:, [-1]]] = -float('Inf')
        # apply softmax to convert logits to (normalized) probabilities
        probs = F.softmax(logits, dim=-1)
        # either sample from the distribution or take the most likely element
        if do_sample:
            idx_next = torch.multinomial(probs, num_samples=1)
        else:
            _, idx_next = torch.topk(probs, k=1, dim=-1)
        # append sampled index to the running sequence and continue
        idx = torch.cat((idx, idx_next), dim=1)
    return idx
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
def print_samples(num=13, top_k = None):
    # inital 0 tokens
    X_init = torch.zeros(num, 1, dtype=torch.long).to('cuda')
    steps = train_dataset.get_output_length() - 1 # -1 because we already start with <START> token (index 0)
    X_samp = generate(model, X_init, steps, top_k=top_k, do_sample=True).to('cuda')
    new_samples = []
    for i in range(X_samp.size(0)):
        # get the i'th row of sampled integers, as python list
        row = X_samp[i, 1:].tolist() # note: we need to crop out the first <START> token
        # token 0 is the <END> token, so we crop the output sequence at that point
        crop_index = row.index(0) if 0 in row else len(row)
        row = row[:crop_index]
        word_samp = train_dataset.decode(row)
        new_samples.append(word_samp)
    return new_samples
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
print_samples(num=5)
  • 1
['特意写挑烦热吗水煮丽蛹姨滥转胎描芝染渣熘健故者绒豪贴。',
 '送,但是够有点的咽奉伤蜗密餐.。鸡肉块么特别家还跑当赖焦欢饥屏印两个小哥不错',
 '我也不错',
 '刚厨睡共竹系百度特别还是怎么好',
 '不说,吃的那么味道还是速度快']
  • 1
  • 2
  • 3
  • 4
  • 5
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/小丑西瓜9/article/detail/397436
推荐阅读
  

闽ICP备14008679号