当前位置:   article > 正文

hugginface-introduction 案例介绍_raise environmenterror( oserror: distilbert-base-u

raise environmenterror( oserror: distilbert-base-uncased does not appear to

前言

如果要实战,肯定第一个考虑的库就是transformers, 关于transformers的用法可以参考这个视频,是个外国的小哥, 很短很实用。
【双语字幕+资料下载】Hugging Face速成指南!一遍搞定NLP任务中最常用的功能板块<实战教程系列>

本文就是根据这个视频进行实验的。
首先,如果没有 transformers库的话,先执行

pip install transformers
  • 1

另外,torch可以去官网下载(latest),如果需要下载其他版本的话, 可以去https://download.pytorch.org/whl/torch_stable.html



Pipeline

这个仅用于一些常规任务,自己想简单处理就行,不求定制什么模型,以及不需要finetune的场景下使用。

from transformers import pipeline
  • 1
classifier = pipeline("sentiment-analysis")
res = classifier("We are very happy to show you the Transformers library.")
print(res)
  • 1
  • 2
  • 3

可以看到默认使用的模型是 distilbert-base-uncased-finetuned-sst-2-english, 输出结果:
[{‘label’: ‘POSITIVE’, ‘score’: 0.9997994303703308}]



多个输入样例:

classifier = pipeline("sentiment-analysis")
res = classifier(["We are very happy to show you the Transformers library.", 
          "We hope you don't hate it."])
print(res)
  • 1
  • 2
  • 3
  • 4

输出:
[{‘label’: ‘POSITIVE’, ‘score’: 0.9997994303703308}, {‘label’: ‘NEGATIVE’, ‘score’: 0.5308617353439331}]


指定模型:

model_name = 'distilbert-base-uncased-finetuned-sst-2-english'

classifier = pipeline("sentiment-analysis", model=model_name)
res = classifier(["We are very happy to show you the Transformers library.", 
          "We hope you don't hate it."])
print(res)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

输出:
[{‘label’: ‘POSITIVE’, ‘score’: 0.9997994303703308}, {‘label’: ‘NEGATIVE’, ‘score’: 0.5308617353439331}]


定制分词器和模型

from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification
  • 1
  • 2

model_name = 'distilbert-base-uncased-finetuned-sst-2-english'

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
res = classifier(["We are very happy to show you the Transformers library.", 
          "We hope you don't hate it."])
print(res)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

输出:
[{‘label’: ‘POSITIVE’, ‘score’: 0.9997994303703308}, {‘label’: ‘NEGATIVE’, ‘score’: 0.5308617353439331}]


tokenizer

from transformers import AutoTokenizer
  • 1

这个是分词器,我们需要把文本转化为适合模型输入的向量。(bert家族的话是input_ids和attention_mask, 注意input_ids是不需要 onehot 处理的)

tokens = tokenizer.tokenize('We are very happy to show you the Transformers library.')
token_ids = tokenizer.convert_tokens_to_ids(tokens)

print(f'tokens: {tokens}')
print(f'token_ids: {token_ids}')
  • 1
  • 2
  • 3
  • 4
  • 5

tokens: [‘we’, ‘are’, ‘very’, ‘happy’, ‘to’, ‘show’, ‘you’, ‘the’, ‘transformers’, ‘library’, ‘.’]
token_ids: [2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 19081, 3075, 1012]

可以看到, tokenize方法就是分词, convert_tokens_to_ids就是分完的token转化为id。


来看下另一个例子,可以看到我们根据输出就可以发现,该方式的输出 即input是一个字典,该字典可以解包后作为model的输入。

input = tokenizer("We are very happy to show you the Transformers library.")
print(input)
  • 1
  • 2

{‘input_ids’: [101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 19081, 3075, 1012, 102], ‘attention_mask’: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


一个batch多个样例, 其中padding是指填充的意思,例如第二句句子长度不足,则会补上0 (通常是0)进行填充。

train_x = ["We are very happy to show you the Transformers library.", 
       "We hope you don't hate it."]

batch = tokenizer(train_x, padding=True, truncation=True, max_length=512, return_tensors='pt')
print(batch)
  • 1
  • 2
  • 3
  • 4
  • 5

{‘input_ids’: tensor([[ 101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 19081,
3075, 1012, 102],
[ 101, 2057, 3246, 2017, 2123, 1005, 1056, 5223, 2009, 1012,
102, 0, 0]]), ‘attention_mask’: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]])}

不指定返回是pytorch.tensor类型

train_x = ["We are very happy to show you the Transformers library.", 
       "We hope you don't hate it."]

batch = tokenizer(train_x, padding=True, truncation=True, max_length=512)
print(batch)
  • 1
  • 2
  • 3
  • 4
  • 5

{‘input_ids’: [[101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 19081, 3075, 1012, 102], [101, 2057, 3246, 2017, 2123, 1005, 1056, 5223, 2009, 1012, 102, 0, 0]], ‘attention_mask’: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]]}



另外一种就是encode, 可以看到它的输出是input_ids, 但是它会连在一起,是一个一维的输出, 所以使用encode的话,下面用法并不对, 得用for循环去遍历,具体可看情感分析bert家族 pytorch实现(ing

train_x = ["We are very happy to show you the Transformers library.", 
       "We hope you don't hate it."]

model_name = 'distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = AutoTokenizer.from_pretrained(model_name)

batch = tokenizer.encode(train_x, padding=True, truncation=True, max_length=512)
print(batch)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

[101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 19081, 3075, 1012, 102, 2057, 3246, 2017, 2123, 1005, 1056, 5223, 2009, 1012, 102]



inference

利用model进行分类, 这里我们仅做推导,稍后介绍如何finetue。
这里的**是解包的作用,具体可看我另一篇博客 python函数参数*与**

model_name = 'distilbert-base-uncased-finetuned-sst-2-english'

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

train_x = ["We are very happy to show you the Transformers library.", 
       "We hope you don't hate it."]

batch = tokenizer(train_x, padding=True, truncation=True, max_length=512, return_tensors='pt')

with torch.no_grad():
  outputs = model(**batch)
  predictions = F.softmax(outputs.logits, dim=-1)
  labels = torch.argmax(predictions, dim=-1)
  labels_en = [model.config.id2label[label_id] for label_id in labels.tolist()]

print('outputs:\n', outputs)
print('predictions:\n', predictions)
print('labels:', labels)
print('labels_en:', labels_en)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20

outputs:
SequenceClassifierOutput(loss=None, logits=tensor([[-4.1329, 4.3811],
[ 0.0818, -0.0418]]), hidden_states=None, attentions=None)
predictions:
tensor([[2.0060e-04, 9.9980e-01],
[5.3086e-01, 4.6914e-01]])
labels: tensor([1, 0])
labels_en: [‘POSITIVE’, ‘NEGATIVE’]



这里我们注意到上面的loss没有损失,这里我们再添加一个ground truth

# 有损失
model_name = 'distilbert-base-uncased-finetuned-sst-2-english'

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

train_x = ["We are very happy to show you the Transformers library.", 
       "We hope you don't hate it."]
train_y = [1, 0]

batch = tokenizer(train_x, padding=True, truncation=True, max_length=512, return_tensors='pt')

with torch.no_grad():
  outputs = model(**batch, labels=torch.tensor(train_y))
  predictions = F.softmax(outputs.logits, dim=-1)
  labels = torch.argmax(predictions, dim=-1)
  labels_en = [model.config.id2label[label_id] for label_id in labels.tolist()]

print('outputs:\n', outputs)
print('predictions:\n', predictions)
print('labels:', labels)
print('labels_en:', labels_en)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22

outputs:
SequenceClassifierOutput(loss=tensor(0.3167), logits=tensor([[-4.1329, 4.3811],
[ 0.0818, -0.0418]]), hidden_states=None, attentions=None)
predictions:
tensor([[2.0060e-04, 9.9980e-01],
[5.3086e-01, 4.6914e-01]])
labels: tensor([1, 0])
labels_en: [‘POSITIVE’, ‘NEGATIVE’]



这里其实tokenizer不是返回的字典里的元素不是torch.tensor类型也可以

# 有损失
model_name = 'distilbert-base-uncased-finetuned-sst-2-english'

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

train_x = ["We are very happy to show you the Transformers library.", 
       "We hope you don't hate it."]

batch_dict = tokenizer(train_x, padding=True, truncation=True, max_length=512)
print('batch_dict:', batch_dict)
batch = torch.tensor(batch_dict['input_ids'])
print('batch:', batch_dict)


with torch.no_grad():
  outputs = model(batch)
  predictions = F.softmax(outputs.logits, dim=-1)
  labels = torch.argmax(predictions, dim=-1)
  labels_en = [model.config.id2label[label_id] for label_id in labels.tolist()]

print('outputs:\n', outputs)
print('predictions:\n', predictions)
print('labels:', labels)
print('labels_en:', labels_en)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25

batch_dict: {‘input_ids’: [[101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 19081, 3075, 1012, 102], [101, 2057, 3246, 2017, 2123, 1005, 1056, 5223, 2009, 1012, 102, 0, 0]], ‘attention_mask’: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]]}
batch: {‘input_ids’: [[101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 19081, 3075, 1012, 102], [101, 2057, 3246, 2017, 2123, 1005, 1056, 5223, 2009, 1012, 102, 0, 0]], ‘attention_mask’: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]]}
outputs:
SequenceClassifierOutput(loss=None, logits=tensor([[-4.1329, 4.3811],
[ 1.5112, -1.3358]]), hidden_states=None, attentions=None)
predictions:
tensor([[2.0060e-04, 9.9980e-01],
[9.4517e-01, 5.4834e-02]])
labels: tensor([1, 0])
labels_en: [‘POSITIVE’, ‘NEGATIVE’]



加上label, 我们关注一下loss, 可以看到没有传入attention_mask其实loss的结果是不同的,尽管输出的标签一样

# 有损失
model_name = 'distilbert-base-uncased-finetuned-sst-2-english'

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

train_x = ["We are very happy to show you the Transformers library.", 
       "We hope you don't hate it."]
train_y = [1, 0]

batch_dict = tokenizer(train_x, padding=True, truncation=True, max_length=512)
print('batch_dict:', batch_dict)
batch = torch.tensor(batch_dict['input_ids'])
print('batch:', batch_dict)


with torch.no_grad():
  outputs = model(batch, labels=torch.tensor(train_y))
  predictions = F.softmax(outputs.logits, dim=-1)
  labels = torch.argmax(predictions, dim=-1)
  labels_en = [model.config.id2label[label_id] for label_id in labels.tolist()]

print('outputs:\n', outputs)
print('predictions:\n', predictions)
print('labels:', labels)
print('labels_en:', labels_en)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26

batch_dict: {‘input_ids’: [[101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 19081, 3075, 1012, 102], [101, 2057, 3246, 2017, 2123, 1005, 1056, 5223, 2009, 1012, 102, 0, 0]], ‘attention_mask’: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]]}
batch: {‘input_ids’: [[101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 19081, 3075, 1012, 102], [101, 2057, 3246, 2017, 2123, 1005, 1056, 5223, 2009, 1012, 102, 0, 0]], ‘attention_mask’: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]]}
outputs:
SequenceClassifierOutput(loss=tensor(0.0283), logits=tensor([[-4.1329, 4.3811],
[ 1.5112, -1.3358]]), hidden_states=None, attentions=None)
predictions:
tensor([[2.0060e-04, 9.9980e-01],
[9.4517e-01, 5.4834e-02]])
labels: tensor([1, 0])
labels_en: [‘POSITIVE’, ‘NEGATIVE’]



save & load

保存模型

import os
from transformers import AutoTokenizer, AutoModelForSequenceClassification
  • 1
  • 2

model_name = 'distilbert-base-uncased-finetuned-sst-2-english'

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
  • 1
  • 2
  • 3
  • 4

保存权重, 到指定文件夹

save_dir = 'save_dir'

if os.path.exists(save_dir):
  os.mkdir(save_dir)

tokenizer.save_pretrained(save_dir)
model.save_pretrained(save_dir)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7


下次进行加载

load_tokenizer = AutoTokenizer.from_pretrained(save_dir)
load_model = AutoModelForSequenceClassification.from_pretrained(save_dir)
  • 1
  • 2



fine-tune

from pathlib import Path
from sklearn.model_selection import train_test_split
import torch
from torch.utils.data import Dataset
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
from transformers import Trainer, TrainingArguments
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

def read_imdb_split(split_dir):
  split_dir = Path(split_dir)
  texts = []
  labels = []
  for label_dir in ['pos', 'neg']:
    for text_file in (split_dir/label_dir).iterdir():
      texts.append(text_file.read_text())
      labels.append(0 if label_dir == 'neg' else 1)
  return texts, labels
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

下载数据集

# http://ai.stanford.edu/~amaas/data/sentiment
!wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
  • 1
  • 2

解压数据集

!tar -zxvf /content/aclImdb_v1.tar.gz
  • 1


读取数据集

train_texts, train_labels = read_imdb_split('/content/aclImdb/train')
test_texts, test_labels = read_imdb_split('/content/aclImdb/test')
print('train_texts.len:', len(train_texts))
print('test_texts.len: ', len(test_texts))
  • 1
  • 2
  • 3
  • 4

train_texts.len: 25000
test_texts.len: 25000


分一部分作于验证

train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=0.2)
print('train_texts.len:', len(train_texts))
print('val_texts.len: ', len(val_texts))
  • 1
  • 2
  • 3

train_texts.len: 12800
val_texts.len: 3200


class IMDbDataset(Dataset):
  def __init__(self, encodings, labels):
    self.encodings = encodings
    self.labels = labels
  
  def __getitem__(self, idx):
    item = {key:torch.tensor(val[idx]) for key, val in self.encodings.items()}
    item['labels'] = torch.tensor(self.labels[idx])
    return item

  def __len__(self):
    return len(self.labels)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12



model_name = 'distilbert-base-uncased-finetuned-sst-2-english'

tokenizer = DistilBertTokenizer.from_pretrained(model_name)

train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7



train_dataset = IMDbDataset(train_encodings, train_labels)
val_dataset = IMDbDataset(val_encodings, val_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)
  • 1
  • 2
  • 3

trainer写法

设置训练的超参数

training_args = TrainingArguments(
  output_dir = './results',
  num_train_epochs=2,
  per_device_train_batch_size=16,
  per_device_eval_batch_size=64,
  warmup_steps=500,
  learning_rate=5e-5,
  weight_decay=0.01,
  logging_dir='./logs',
  logging_steps=10
)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11



加载模型

model = DistilBertForSequenceClassification.from_pretrained(model_name)
  • 1

加载训练器

trainer = Trainer(
  model=model,
  args=training_args,
  train_dataset=train_dataset,
  eval_dataset=val_dataset
)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

开始训练

trainer.train()
  • 1



native pytorch写法

from torch.utils.data import DataLoader
from transformers import AdamW
  • 1
  • 2

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(device)
  • 1
  • 2

加载模型

model = DistilBertForSequenceClassification.from_pretrained(model_name)
model.to(device)
model.train()
  • 1
  • 2
  • 3

定义数据Loader类

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
  • 1

实例化优化器和设置epoch数量

optim = AdamW(model.parameters(), lr=5e-5)
num_train_epochs = 2
  • 1
  • 2

for epoch in range(num_train_epochs):
  for batch in train_loader:
    optim.zero_grad()
    input_ids = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)
    labels = batch['labels'].to(device)

    outputs = model(input_ids, attention_mask=attention_mask, labels=labels)

    loss = outputs[0]
    print('epoch:{}, batch:{}, loss:{}'.format(epoch, batch, loss))
    loss.backward()
    optim.step()
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/很楠不爱3/article/detail/662409
推荐阅读
相关标签
  

闽ICP备14008679号