赞
踩
Transformers 库将目前的 NLP 任务归纳为几下几类:
Transformers 库最基础的对象就是 pipeline()
函数,它封装了预训练模型和对应的前处理和后处理环节。我们只需输入文本,就能得到预期的答案。目前常用的 pipelines 有:
feature-extraction
(获得文本的向量化表示)fill-mask
(填充被遮盖的词、片段)ner
(命名实体识别)question-answering
(自动问答)sentiment-analysis
(情感分析)summarization
(自动摘要)text-generation
(文本生成)translation
(机器翻译)zero-shot-classification
(零训练样本分类)
下面我们以常见的几个 NLP 任务为例,展示如何调用这些 pipeline 模型。
借助情感分析 pipeline,我们只需要输入文本,就可以得到其情感标签(积极/消极)以及对应的概率:
- from transformers import pipeline
-
- classifier = pipeline("sentiment-analysis")
- result = classifier("I've been waiting for a HuggingFace course my whole life.")
- print(result)
- results = classifier(
- ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"]
- )
- print(results)
- No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)
-
- [{'label': 'POSITIVE', 'score': 0.9598048329353333}]
- [{'label': 'POSITIVE', 'score': 0.9598048329353333}, {'label': 'NEGATIVE', 'score': 0.9994558691978455}]
pipeline 模型会自动完成以下三个步骤:
pipeline 会自动选择合适的预训练模型来完成任务。例如对于情感分析,默认就会选择微调好的英文情感模型 distilbert-base-uncased-finetuned-sst-2-english。
注意:
Transformers 库会在创建对象时下载并且缓存模型,只有在首次加载模型时才会下载,后续会直接调用缓存好的模型。
零训练样本分类 pipeline 允许我们在不提供任何标注数据的情况下自定义分类标签。
- from transformers import pipeline
-
- classifier = pipeline("zero-shot-classification")
- result = classifier(
- "This is a course about the Transformers library",
- candidate_labels=["education", "politics", "business"],
- )
- print(result)
分析:我们把输入的一句话,分为三个分类【教育,政治,商业】,最后模型给出结果,这句话分类到教育的概率是0.844597。这个类似新闻多标签分类
No model was supplied, defaulted to facebook/bart-large-mnli (https://huggingface.co/facebook/bart-large-mnli) {'sequence': 'This is a course about the Transformers library', 'labels': ['education', 'business', 'politics'], 'scores': [0.8445973992347717, 0.11197526752948761, 0.043427325785160065]}
可以看到,pipeline 自动选择了预训练好的 facebook/bart-large-mnli 模型来完成任务。
我们首先根据任务需要构建一个模板 (prompt),然后将其送入到模型中来生成后续文本。注意,由于文本生成具有随机性,因此每次运行都会得到不同的结果
- from transformers import pipeline
-
- generator = pipeline("text-generation")
- results = generator("In this course, we will teach you how to")
- print(results)
- results = generator(
- "In this course, we will teach you how to",
- num_return_sequences=2,
- max_length=50
- )
- print(results)
代码解释:传入一个句子,最大返回文本结果数量是2,允许最大长度是50
- results = generator(
- "In this course, we will teach you how to",
- num_return_sequences=2,
- max_length=50
- )
No model was supplied, defaulted to gpt2 (https://huggingface.co/gpt2) [{'generated_text': "In this course, we will teach you how to use data and models that can be applied in any real-world, everyday situation. In most cases, the following will work better than other courses I've offered for an undergrad or student. In order"}] [{'generated_text': 'In this course, we will teach you how to make your own unique game called "Mono" from scratch by doing a game engine, a framework and the entire process starting with your initial project. We are planning to make some basic gameplay scenarios and'}, {'generated_text': 'In this course, we will teach you how to build a modular computer, how to run it on a modern Windows machine, how to install packages, and how to debug and debug systems. We will cover virtualization and virtualization without a programmer,'}]pipeline 自动选择了预训练好的 gpt2 模型来完成任务。我们也可以指定要使用的模型。对于文本生成任务,我们可以在 Model Hub 页面左边选择 Text Generation tag 查询支持的模型。例如,我们在相同的 pipeline 中加载 distilgpt2 模型:
也可以自定模型,来进行文本生成,
generator = pipeline("text-generation", model="distilgpt2")
- from transformers import pipeline
-
- generator = pipeline("text-generation", model="distilgpt2")
- results = generator(
- "In this course, we will teach you how to",
- max_length=30,
- num_return_sequences=2,
- )
- print(results)
- [{'generated_text': 'In this course, we will teach you how to use React in any form, and how to use React without having to worry about your React dependencies because'},
- {'generated_text': 'In this course, we will teach you how to use a computer system in order to create a working computer. It will tell you how you can use'}]
还可以通过左边的语言 tag 选择其他语言的模型。例如加载专门用于生成中文古诗的 gpt2-chinese-poem 模型:
- from transformers import pipeline
-
- generator = pipeline("text-generation", model="uer/gpt2-chinese-poem")
- results = generator(
- "[CLS] 万 叠 春 山 积 雨 晴 ,",
- max_length=40,
- num_return_sequences=2,
- )
- print(results)
- [{'generated_text': '[CLS] 万 叠 春 山 积 雨 晴 , 孤 舟 遥 送 子 陵 行 。 别 情 共 叹 孤 帆 远 , 交 谊 深 怜 一 座 倾 。 白 日 风 波 身 外 幻'},
- {'generated_text': '[CLS] 万 叠 春 山 积 雨 晴 , 满 川 烟 草 踏 青 行 。 何 人 唤 起 伤 春 思 , 江 畔 画 船 双 橹 声 。 桃 花 带 雨 弄 晴 光'}]
给定一段部分词语被遮盖掉 (masked) 的文本,使用预训练模型来预测能够填充这些位置的词语。
- from transformers import pipeline
-
- unmasker = pipeline("fill-mask")
- results = unmasker("This course will teach you all about <mask> models.", top_k=2)
- print(results)
设置top_k=2,返回俩个sequence(序列),填充了俩个不同的字,但是对填充的字,有score评分。可以看到,pipeline 自动选择了预训练好的 distilroberta-base 模型来完成任务。
No model was supplied, defaulted to distilroberta-base (https://huggingface.co/distilroberta-base) [{'sequence': 'This course will teach you all about mathematical models.', 'score': 0.19619858264923096, 'token': 30412, 'token_str': ' mathematical'}, {'sequence': 'This course will teach you all about computational models.', 'score': 0.04052719101309776, 'token': 38163, 'token_str': ' computational'}]
命名实体识别 (NER) pipeline 负责从文本中抽取出指定类型的实体,例如人物、地点、组织等等。
- from transformers import pipeline
-
- ner = pipeline("ner", grouped_entities=True)
- results = ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")
- print(results)
- No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)
-
- [{'entity_group': 'PER', 'score': 0.9981694, 'word': 'Sylvain', 'start': 11, 'end': 18},
- {'entity_group': 'ORG', 'score': 0.97960186, 'word': 'Hugging Face', 'start': 33, 'end': 45},
- {'entity_group': 'LOC', 'score': 0.99321055, 'word': 'Brooklyn', 'start': 49, 'end': 57}]
可以看到,模型正确地识别出了 Sylvain 是一个人物【PER】,Hugging Face 是一个组织【ORG】,Brooklyn 是一个地名【LOC】。
【这里通过设置参数
grouped_entities=True
,使得 pipeline 自动合并属于同一个实体的多个子词 (token),例如这里将“Hugging”和“Face”合并为一个组织实体,实际上 Sylvain 也进行了子词合并,因为分词器会将 Sylvain 切分为S
、##yl
、##va
和##in
四个 token。】
自动问答 pipeline 可以根据给定的上下文回答问题,例如:
- from transformers import pipeline
-
- question_answerer = pipeline("question-answering")
- answer = question_answerer(
- question="Where do I work?",
- context="My name is Sylvain and I work at Hugging Face in Brooklyn",
- )
- print(answer)
- No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)
-
- {'score': 0.6949771046638489, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}
可以看到,pipeline 自动选择了在 SQuAD 数据集上训练好的 distilbert-base 模型来完成任务。这里的自动问答 pipeline 实际上是一个抽取式问答模型,即从给定的上下文中抽取答案,而不是生成答案。
【
根据形式的不同,自动问答 (QA) 系统可以分为三种:
- 抽取式 QA (extractive QA):假设答案就包含在文档中,因此直接从文档中抽取答案;
- 多选 QA (multiple-choice QA):从多个给定的选项中选择答案,相当于做阅读理解题;
- 无约束 QA (free-form QA):直接生成答案文本,并且对答案文本格式没有任何限制。
】
自动摘要 pipeline 旨在将长文本压缩成短文本,并且还要尽可能保留原文的主要信息,例如:
- from transformers import pipeline
-
- summarizer = pipeline("summarization")
- results = summarizer(
- """
- America has changed dramatically during recent years. Not only has the number of
- graduates in traditional engineering disciplines such as mechanical, civil,
- electrical, chemical, and aeronautical engineering declined, but in most of
- the premier American universities engineering curricula now concentrate on
- and encourage largely the study of engineering science. As a result, there
- are declining offerings in engineering subjects dealing with infrastructure,
- the environment, and related issues, and greater concentration on high
- technology subjects, largely supporting increasingly complex scientific
- developments. While the latter is important, it should not be at the expense
- of more traditional engineering.
- Rapidly developing economies such as China and India, as well as other
- industrial countries in Europe and Asia, continue to encourage and advance
- the teaching of engineering. Both China and India, respectively, graduate
- six and eight times as many traditional engineers as does the United States.
- Other industrial countries at minimum maintain their output, while America
- suffers an increasingly serious decline in the number of engineering graduates
- and a lack of well-educated engineers.
- """
- )
- print(results)
- No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)
-
- [{'summary_text': ' America has changed dramatically during recent years . The number of engineering graduates in the U.S. has declined in traditional engineering disciplines such as mechanical, civil, electrical, chemical, and aeronautical engineering . Rapidly developing economies such as China and India, as well as other industrial countries in Europe and Asia, continue to encourage and advance engineering .'}]
可以看到,pipeline 自动选择了预训练好的 distilbart-cnn-12-6 模型来完成任务。与文本生成类似,我们也可以通过
max_length
或min_length
参数来控制返回摘要的长度。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。