IT小白

这个屌丝很懒，什么也没留下！

热门标签

Transformers库与LLM推理_llama transformers推理

作者：IT小白 | 2024-08-04 06:35:05

踩

llama transformers推理

Transformers库

Transformers库是一个开源的Python库，用于自然语言处理（NLP）任务。它提供了一套强大的工具和功能，可以轻松地使用和训练预训练的Transformer模型。
几乎所有的开源LLM都可以通过Transformers库使用，该库提供了简洁、一致的API，大幅简化了开源LLM的使用门槛。同时，Transformers库也与Pytorch库兼容，使得开发者同样可以做一些高级操作，以满足不同的个性化需求。

官方文档：https://huggingface.co/docs/transformers/index
源代码：https://github.com/huggingface/transformers

LLM推理：Pipelines

Pipelines是Transformers库提供的最简单的LLM推理工具，对于只需要使用LLM推理而不需要定制其他特殊功能的用户，使用Pipelines是最佳的选择。

Pipelines的使用分为一下两个步骤

使用pipeline函数实例化一个对应任务、模型的Pipeline类。
执行该pipeline对象，得到输出

pipeline支持很多任务，可以在Transformers的官方文档中找到。而对于LLM的推理，一般会使用text-generation方法。

案例

Llama2推理（摘自Hugging Face Llama2 Blog）：

from transformers import AutoTokenizer
import transformers
import torch

model = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

sequences = pipeline(
    'I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?\n',
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=200,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

Llama3推理（摘自Transformers Llama3文档）：

import transformers
import torch

model_id = "meta-llama/Meta-Llama-3-8B"

pipeline = transformers.pipeline("text-generation", model=model_id, model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto")
pipeline("Hey how are you doing today?")
1
2
3
4
5
6
7

函数详解

pipeline的实例化

源码：https://github.com/huggingface/transformers/blob/f5c0fa9f6fe0eea2ad69bb1b03aff04824aa4870/src/transformers/pipelines/init.py#L562
文档：https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#pipelines

pipeline = transformers.pipeline("text-generation", model=model_id)
1

常用参数：

参数名	类型	解释
task	str	推理任务名称，在LLM推理时一般设置为"text-generation"
model	Optional[Union[str, “PreTrainedModel”, “TFPreTrainedModel”]]	推理使用的模型，可以是路径（本地路径，也可以是HuggingFace路径），也可以是一个模型
device_map	`str` or `Dict[str, Union[int, str, torch.device]`, optional	可以指定设备如cuda:1（较新版本的transformers支持），也可也设定auto自动选择设备
torch_dtype	`str` or `torch.dtype`, optional	加载的datatype，可以设定auto，也可以自己指定。不同模型可以使用不同的dtype，在不了解的情况建议选择auto或bfloat16。比如Llama2的datatype可以详见文章结尾

text generation pipeline调用

pipeline(text_inputs, **kwargs)
1

文档：https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.TextGenerationPipeline

常用参数：

参数名	类型	解释
text_inputs	str, List[str], List[Dict[str, str]], or List[List[Dict[str, str]]]	如果输入是字符串或者字符串列表，会将这些字符串作为输入生成对应输出。如果是字典（包含不同角色的对话），会使用model tokenizer对应的模板生成对应prompt作为输入。关于字典的格式见后文。
return_tensors	bool, optional, defaults to False	是否输出为decode的tensor矩阵
return_text	bool, optional, defaults to True	是否输出decode后的文本
return_full_text	bool, optional, defaults to True	如果设定为False，只会输出LLM新生成的文本
generate_kwargs	dict, optional	控制模型生成的参数，比如生成最大长度等等，具体实例可以参照前文Llama2推理示例部分，参数说明可以参照后文

返回值：
文本或输出的ids的列表，具体取决于函数接收的参数

返回文本时输出：
输入为单个样例：sample
输入为多个样例：[sample1, sample2, sample3]

其中每个sample的数据格式：

[{"generated_text": generated_content}, ...]
1

如果text_inputs是List[dict]形式，则generated_content同样也是List[dict]形式返回，不过会在输入的基础上增加{“role”: “assistant”, “content”: …}的输出，否则则是返回正常字符串信息

这里特别注意输出的List套了两层，例如访问第i个样本的输出使用outputs[i][0]["generated_text"][len(batch_prompt[i]) :]

进阶用法

在一些场景中，需要对推理细节做一些修改，或是进行模型训练等操作，这就需要一些抽象程度更低的工具。对于LLM来说，最长使用的分别是model和tokenizer。

Model: 与Pytorch中的Model具有类似的功能。事实上，transformers的model也继承自Pytorch的Model，并在Pytorch模型原有基础上增加了一些功能
Tokenizer: 将字符串编码为LLM输入的token

模型（Model）

模型加载

model = transformers.AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path)
1

通过AutoModel类，可以加载Hugging Face上几乎所有的模型。可以通过指定HF模型路径（可以在本地或HF上的路径），加载对应的模型。

常用参数：

参数名	类型	解释
pretrained_model_name_or_path	str or os.PathLike	模型路径
device_map	`str` or `Dict[str, Union[int, str, torch.device]]` or `int` or `torch.device`, optional	将模型加载到哪个设备上。可以传入auto，也可以自己指定设备
torch_dtype	`str` or `torch.dtype`, optional	数据类型

模型参数和上文pipeline传递的模型参数基本是一致的。

模型使用

模型的使用和Pytorch中Model使用类似，不同Model开发者可能会开发不同的功能，可以结合文档或源码进行更深入的研究。下文以较常用的Llama Model为例，简要说明一下与一般Pytorch Model不同的地方
官方文档：https://huggingface.co/docs/transformers/main/en/model_doc/llama2
源码：https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L1084

def forward()

def forward(
    self,
    input_ids: torch.LongTensor = None,
    attention_mask: Optional[torch.Tensor] = None,
    position_ids: Optional[torch.LongTensor] = None,
    past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
    inputs_embeds: Optional[torch.FloatTensor] = None,
    labels: Optional[torch.LongTensor] = None,
    use_cache: Optional[bool] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
    cache_position: Optional[torch.LongTensor] = None,
) -> Union[Tuple, CausalLMOutputWithPast]:
1
2
3
4
5
6
7
8
9
10
11
12
13
14

特殊参数解释：

labels: 如果传入了labels，会自动计算loss。可以通过loss=output.loss获取计算的梯度。这里的labels并不是实际做反向传播时的target，不需要提前做移位操作（target后移一位使得input的输出是input下一个序列的输入），模型在计算loss前会自动进行一次移位操作，详细源码可以看这里。

def generate()

模型的generate方法会使用自回归的方法生成文本。在使用pipeline做模型推理时，也是通过generate方法实现的文本生成。

参数	类型	说明
inputs	torch.Tensor of varying shape depending on the modality, optional	模型输入的一个batch的prompt，一般为编码后的token形式。其中inputs.shape[0]为batch_size
generation_config	GenerationConfig, optional	文本生成参数，常用参数见下表

一些可以设置的常用参数：

参数	类型	说明
do_sample	bool	输出内容是否随机
max_length	int, optional	输入+输出的最大长度。如果输入文本超过了max_length，可能会报错
max_new_tokens	int, optional	由LLM生成的文本的长度

更多参数可以参考HF文档GenerationConfig部分

解码器（Tokenizer）

HF文档：https://huggingface.co/docs/transformers/main/en/main_classes/tokenizer#transformers.PreTrainedTokenizer

Tokenizer的主要功能是将字符串转换为对应的Token，或将Token转换为字符串。Tokenizer主要可以实现以下功能：

将字符串或字符转换为对应的Token，或将token转换为字符串。
将一个batch的字符串转换为token，或将一个batch的token转换为字符串。在转换成token时可以自动进行pad操作，并添加对应的attention mask
将对话数据按照Chat LLM的格式转换为对应的字符串输入

创建Tokenizer

tokenizer = transformers.AutoTokenizer.from_pretrained(model_path)
1

编码

编码类方法的主要功能是将字符串编码为对应的token

`call`

最常用的方法，可以理解为batch_encode

常用参数：

参数名	类型	解释
text	(str, List[str], List[List[str]], optional)	需要被编码的字符串
padding	bool, str or PaddingStrategy, optional, defaults to False	是否自动padding。支持max_length和longest两种方法
truncation	bool, str or TruncationStrategy, optional, defaults to False	是否将句子进行截断
max_length	int, optional	设置padding或truncation所使用的最大长度
return_tensors	str or TensorType, optional	是否返回tensor数据类型（默认为list）。可使用`pt`(Pytorch), `tf`(TensorFlow), `np`(Numpy)三种类型。

返回类型：BatchEncoding
可以像字典一样进行访问，有以下关键的信息：

input_ids: List of token ids to be fed to a model.
attention_mask: List of indices specifying which tokens should be attended to by the model (when return_attention_mask=True or if “attention_mask” is in self.model_input_names). 返回的数据中0表示需要mask的部分

常见的用法：

inputs = tokenizer(prompts, return_tensors='pt', padding=True).to(device)
outputs = model(**inputs)
1
2

`encode`

与__call__功能类似，不过主要用于将单一句子encode

输出：List[int], torch.Tensor, tf.Tensor or np.ndarray

解码

解码类函数的功能是将token解码为对应的字符串

`batch_decode`

输入：sequences (Union[List[int], List[List[int]], np.ndarray, torch.Tensor, tf.Tensor]) — List of tokenized input ids.
输出：List[str] The list of decoded sentences.

`decode`

输入：token_ids (Union[int, List[int], np.ndarray, torch.Tensor, tf.Tensor]) — List of tokenized input ids.
输出：str The decoded sentence.

生成对话Prompt

许多LLM具有对话功能，用户的输入可以根据对应的模板转换为输入模型的字符串，这就可以使用tokenizer自带的apply_chat_template方法
例如：

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]
input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)
1
2
3
4
5
6
7
8
9

apply_chat_template方法的常用参数：

参数名	类型	解释
conversation	Union[List[Dict[str, str]], List[List[Dict[str, str]]]]	A list of dicts with “role” and “content” keys, representing the chat history so far.
tokenize	bool, defaults to True	Whether to tokenize the output. If False, the output will be a string.
padding	bool, defaults to False	Whether to pad sequences to the maximum length. Has no effect if tokenize is False.
truncation	bool, defaults to False	Whether to truncate sequences at the maximum length. Has no effect if tokenize is False.
return_tensors	str or TensorType, optional	If set, will return tensors of a particular framework. Has no effect if tokenize is False.
add_generation_prompt	bool, defaults to False	Whether to end the prompt with the token(s) that indicate the start of an assistant message. This is useful when you want to generate a response from the model. Note that this argument will be passed to the chat template, and so it must be supported in the template for this argument to have any effect.

返回类型

返回类型	函数输入条件	备注
List[int]	不进行额外设置
List[List[int]]	输入input为一个batch
str	tokenize=False且输入单条数据
List[str]	tokenize=False且输入多条数据
torch.Tensor	tokenize=True且return_tensors=‘pt’	shape为[batch_size, seq_len]

这个方法和__call__方法很类似，不同的是该方法接受的输入是dict格式的conversation。如果设置tokenize=False，则返回使用对话模板生成的对应字符串，反之则返回和__call__方法类似的输出

conversation的格式：

[{"role": role, "content": content}, ...]
1

案例：

[{"role": "system", "content": "You are a helpful assistant"},
 {"role": "user", "conent": "What is the captical city of China?"}]
1
2
'运行

Others

Padding side

一般模型的padding side 为右边，这也是tokenizer默认的做法
但在进行LLM推理时，一般会设置padding side为左边，这样可以使得输入的最后一个字符和输出的第一个字符相连，与attention_mask一起使用可以保证pad后的输出结果与pad前相同
在transformer中只需要设置tokenizer.padding_side='left'，在后续encode时就是left padding了

Llama2的Data type

Llama2的dtype有点特殊，在训练时使用的是bfloat16，推理时使用的float16。正常使用时可以使用float16，但如果要进行模型训练或者遇到NAN问题，建议把数据类型转换成bfloat16。可以参考HF关于Llama2的说明

(未完待续)

训练相关

Data collator function

DataCollatorForSeq2Seq

主要用于SFT训练，可以通过data_loader = torch.utils.data.DataLoader(dataset, collate_fn=data_collator_fn)使用，这样可以不用自己写pad function

其中dataset每个item的数据格式：

{
	'input_ids': input_ids
	'labels': labels
}
1
2
3
4

对于seq2seq任务，一般input_ids和labels相同。如果想只对LLM生成的文本做梯度，可以在labels中将不需要做梯度的内容用-100代替即可。一个简单的例子：

return {
    "input_ids": prompt_ids + target_ids + [self.eos_token_id],
    "labels": [-100] * len(prompt_ids) + target_ids + [self.eos_token_id]
}
1
2
3
4

常用参数：

参数	类型	说明
tokenizer	PreTrainedTokenizerBase	编码器，用于识别pad token id。如果tokenizer不含pad_id（如Llama），记得提取设置一下
return_tensors	str, optional, defaults to “pt”	返回值类型

Learning Rate Schedules

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/IT小白/article/detail/926577?site