当前位置:   article > 正文

深度学习系列35:transformer库入门_transformer实例

transformer实例

1. 介绍

在这里插入图片描述
首先安装: pip install transformers
这里有不同种类语言的离线模型清单:https://huggingface.co/languages

2. pipeline例子

最简单的使用方式,是使用现成的pipeline,背后流程如下:
在这里插入图片描述
我们可以去huggingface上找模型。我们以情绪分析为例,默认的pipeline是识别英文的,如果我们要识别中文怎么办?
首先去模型库寻找合适的模型(点击左边的tasks和language可以进行筛选):
在这里插入图片描述

from transformers import BertForSequenceClassification
from transformers import BertTokenizer
import torch

tokenizer=BertTokenizer.from_pretrained('IDEA-CCNL/Erlangshen-Roberta-110M-Sentiment')
model=BertForSequenceClassification.from_pretrained('IDEA-CCNL/Erlangshen-Roberta-110M-Sentiment')

text='今天心情不好'

output=model(torch.tensor([tokenizer.encode(text)]))
print(torch.nn.functional.softmax(output.logits,dim=-1))
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11

保存模型的代码如下

pt_save_directory = "./pt_save_pretrained"
tokenizer.save_pretrained(pt_save_directory)
pt_model.save_pretrained(pt_save_directory)
  • 1
  • 2
  • 3

2.1 简介

预训练的模型如下:
"audio-classification": 语音分类
"automatic-speech-recognition" 语音识别
"conversational": 对话
"feature-extraction": 提取特征
"fill-mask": 填充
"image-classification": 图像分类
"question-answering": 问答
"table-question-answering": 表格问答
"text2text-generation": 文本生成
"text-classification" (又名"sentiment-analysis"): 文本分类
"text-generation": 文本生成
"token-classification" (又名"ner"): token分类
"translation": 翻译
"translation_xx_to_yy": 翻译
"summarization": 总结
"zero-shot-classification": 零样本分类

pipepline加载的内容包含如下:
在这里插入图片描述

2.2 情绪分析

from transformers import pipeline
classifier = pipeline('sentiment-analysis')
classifier('We are very happy to introduce pipeline to the transformers repository.')
  • 1
  • 2
  • 3

2.3 问答

from transformers import pipeline
question_answerer = pipeline('question-answering')
question_answerer({ 'question': 'What is the name of the repository ?', 'context': 'Pipeline has been included in the huggingface/transformers repository'})
  • 1
  • 2
  • 3

2.4 语音识别

from transformers import pipeline
import torch
from datasets import load_dataset, Audio
dataset = load_dataset("PolyAI/minds14", name="en-US", split="train")
# 对数据进行重采样
dataset = dataset. cast_column("audio", Audio(sampling_rate=speech_recognizer.feature_extractor.sampling_rate))
speech_recognizer = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h")
result = speech_recognizer([a['array'] for a in dataset[:4]["audio"]])
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

2.5 文本生成

generator = pipeline(task="text-generation")
generator("Eight people were kill at party in California.")
  • 1
  • 2

2.6 图像分类

vision_classifier = pipeline(task="image-classification")
vision_classifier(images="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg")
  • 1
  • 2

3. 通用方法

3.1 进行编码(tokenizer或者extractFeature)

文字的话需要定义tokenizer。tokenizer负责把文字转换为一个字典,例如:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
encoding = tokenizer("We are very happy to show you the 
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/凡人多烦事01/article/detail/372412
推荐阅读
相关标签