当前位置:   article > 正文

huggingface的生成模型_huggingface的beamsearch

huggingface的beamsearch

GPT2

训练

  1. from transformers import GPT2Tokenizer, GPT2LMHeadModel
  2. tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
  3. model = GPT2LMHeadModel.from_pretrained("gpt2")
  4. inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
  5. outputs = model(**inputs,
  6. labels=inputs["input_ids"])
  7. loss = outputs.loss
  8. print(loss)
  9. logits = outputs.logits
  10. print(logits)

可以看到tokenizer是有bos和eos的

inference / generate

默认是greedy search

  1. from transformers import GPT2Tokenizer, GPT2LMHeadModel
  2. tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
  3. model = GPT2LMHeadModel.from_pretrained("gpt2")
  4. inputs = tokenizer("Hello, my dog is cute and ", return_tensors="pt")
  5. # ipdb> inputs.keys()
  6. # dict_keys(['input_ids', 'attention_mask'])
  7. generation_output = model.generate(
  8. # **inputs,
  9. input_ids = inputs['input_ids'],
  10. attention_mask = inputs['attention_mask'],
  11. return_dict_in_generate=True,
  12. output_scores=True
  13. )
  14. gen_texts = tokenizer.batch_decode(
  15. generation_output['sequences'],
  16. skip_special_tokens=True,
  17. )
  18. print(gen_texts)

注意生成的结果是带着inputs的

beam_search

只需要传入一个num_beams参数即可

  1. from transformers import GPT2Tokenizer, GPT2LMHeadModel
  2. tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
  3. model = GPT2LMHeadModel.from_pretrained("gpt2")
  4. inputs = tokenizer("Hello, my dog is cute and ", return_tensors="pt")
  5. # ipdb> inputs.keys()
  6. # dict_keys(['input_ids', 'attention_mask'])
  7. generation_output = model.generate(
  8. # **inputs,
  9. input_ids = inputs['input_ids'],
  10. attention_mask = inputs['attention_mask'],
  11. return_dict_in_generate=True,
  12. output_scores=True,
  13. num_beams=5,
  14. max_length=50,
  15. early_stopping=True,
  16. )
  17. gen_texts = tokenizer.batch_decode(
  18. generation_output['sequences'],
  19. skip_special_tokens=True,
  20. )
  21. print(gen_texts)

early_stopping参数的话,就是当生成到EOS token的话就停止

可以看到生成结果依然有重复

引入n-grams惩罚

 一个简单的补救措施是引入Paulus等人(2017)和Klein等人(2017)所介绍的n-grams(又称n个词的词序)惩罚措施。最常见的n-grams惩罚确保没有n-gram出现两次,方法是手动将可能产生已见n-gram的下一个词的概率设为0。

然而,在使用n-gram惩罚时必须小心。一篇关于纽约市的文章不应该使用2-gram的惩罚,否则,这个城市的名字就会在整个文本中只出现一次!

  1. from transformers import GPT2Tokenizer, GPT2LMHeadModel
  2. tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
  3. model = GPT2LMHeadModel.from_pretrained("gpt2")
  4. inputs = tokenizer("Hello, my dog is cute and ", return_tensors="pt")
  5. # ipdb> inputs.keys()
  6. # dict_keys(['input_ids', 'attention_mask'])
  7. generation_output = model.generate(
  8. # **inputs,
  9. input_ids = inputs['input_ids'],
  10. attention_mask = inputs['attention_mask'],
  11. return_dict_in_generate=True,
  12. output_scores=True,
  13. num_beams=5,
  14. max_length=50,
  15. early_stopping=True,
  16. no_repeat_ngram_size=2,
  17. #在使用n-gram惩罚时必须小心。一篇关于纽约市的文章不应该使用2-gram的惩罚,否则,这个城市的名字就会在整个文本中只出现一次!
  18. )
  19. gen_texts = tokenizer.batch_decode(
  20. generation_output['sequences'],
  21. skip_special_tokens=True,
  22. )
  23. print(gen_texts)

返回多个beam_search结果

  1. from transformers import GPT2Tokenizer, GPT2LMHeadModel
  2. tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
  3. model = GPT2LMHeadModel.from_pretrained("gpt2")
  4. inputs = tokenizer("Hello, my dog is cute and ", return_tensors="pt")
  5. # ipdb> inputs.keys()
  6. # dict_keys(['input_ids', 'attention_mask'])
  7. generation_output = model.generate(
  8. # **inputs,
  9. input_ids = inputs['input_ids'],
  10. attention_mask = inputs['attention_mask'],
  11. return_dict_in_generate=True,
  12. output_scores=True,
  13. num_beams=5,
  14. max_length=50,
  15. early_stopping=True,
  16. num_return_sequences=5,
  17. )
  18. for i, beam_output in enumerate(generation_output['sequences']):
  19. print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))

上面可以看到,beam_search生成也会有beamsearch的情况,而设置n-grams惩罚又太简单粗暴,有些地方还不能用,另一种较好的解决方法是sampling

正如Ari Holtzman等人(2019)所论证的那样,高质量的人类语言并不遵循高概率下一个词的分布。换句话说,作为人类,我们希望生成的文本能给我们带来惊喜,而不是无聊/可预测的。作者通过绘制概率图,很好地展示了这一点,一个模型会给人类文本与光束搜索所做的事情。

这就是Beam-search multinomial sampling

  1. from transformers import GPT2Tokenizer, GPT2LMHeadModel
  2. tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
  3. model = GPT2LMHeadModel.from_pretrained("gpt2")
  4. inputs = tokenizer("Hello, my dog is cute and ", return_tensors="pt")
  5. # ipdb> inputs.keys()
  6. # dict_keys(['input_ids', 'attention_mask'])
  7. generation_output = model.generate(
  8. # **inputs,
  9. input_ids = inputs['input_ids'],
  10. attention_mask = inputs['attention_mask'],
  11. return_dict_in_generate=True,
  12. output_scores=True,
  13. num_beams=5,
  14. max_length=50,
  15. early_stopping=True,
  16. do_sample=True,
  17. )
  18. gen_texts = tokenizer.batch_decode(
  19. generation_output['sequences'],
  20. skip_special_tokens=True,
  21. )
  22. print(gen_texts)

do_sample=True参数,可以启用multinomial sampling, beam-search multinomial sampling, Top-K sampling and Top-p sampling 等策略。所有这些策略都是从整个词汇的概率分布中选择下一个标记,并进行各种策略的调整。

How to generate text: using different decoding methods for language generation with Transformers

查看一个模型的generation相关的config参数

  1. from transformers import GPT2Tokenizer, GPT2LMHeadModel
  2. tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
  3. model = GPT2LMHeadModel.from_pretrained("gpt2")
  4. print(model.generation_config)

这里输出的model.generation_config只显示了与默认生成配置不同的值,而没有列出任何默认值。

默认max_length是20

默认是greedy search

保存(常用的)generation的配置

  1. from transformers import GPT2Tokenizer, GPT2LMHeadModel, GenerationConfig
  2. model = GPT2LMHeadModel.from_pretrained("gpt2")
  3. generation_config = GenerationConfig(
  4. max_new_tokens=50,
  5. do_sample=True,
  6. top_k=50,
  7. eos_token_id=model.config.eos_token_id,
  8. )
  9. generation_config.save_pretrained("my_generation_config")

生成的另一种写法

greedy search

  1. from transformers import GPT2Tokenizer, GPT2LMHeadModel, LogitsProcessorList, MinLengthLogitsProcessor, StoppingCriteriaList, MaxLengthCriteria
  2. tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
  3. model = GPT2LMHeadModel.from_pretrained("gpt2")
  4. # set pad_token_id to eos_token_id because GPT2 does not have a PAD token
  5. model.generation_config.pad_token_id = model.generation_config.eos_token_id
  6. inputs = tokenizer("Hello, my dog is cute and ", return_tensors="pt")
  7. # ipdb> inputs.keys()
  8. # dict_keys(['input_ids', 'attention_mask'])
  9. # instantiate logits processors
  10. logits_processor = LogitsProcessorList(
  11. [
  12. MinLengthLogitsProcessor(10, eos_token_id=model.generation_config.eos_token_id),
  13. ]
  14. )
  15. stopping_criteria = StoppingCriteriaList([MaxLengthCriteria(max_length=20)])
  16. generation_output = model.greedy_search(
  17. inputs['input_ids'],
  18. logits_processor=logits_processor,
  19. stopping_criteria=stopping_criteria,
  20. )
  21. gen_texts = tokenizer.batch_decode(
  22. generation_output,
  23. skip_special_tokens=True,
  24. )
  25. print(gen_texts)

beam search

甚至可以传encoder的相关信息!

  1. from transformers import (
  2. AutoTokenizer,
  3. AutoModelForSeq2SeqLM,
  4. LogitsProcessorList,
  5. MinLengthLogitsProcessor,
  6. BeamSearchScorer,
  7. )
  8. import torch
  9. tokenizer = AutoTokenizer.from_pretrained("t5-base")
  10. model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
  11. encoder_input_str = "translate English to German: How old are you?"
  12. encoder_input_ids = tokenizer(encoder_input_str, return_tensors="pt").input_ids
  13. # lets run beam search using 3 beams
  14. num_beams = 3
  15. # define decoder start token ids
  16. input_ids = torch.ones((num_beams, 1), device=model.device, dtype=torch.long)
  17. input_ids = input_ids * model.config.decoder_start_token_id
  18. # add encoder_outputs to model keyword arguments
  19. model_kwargs = {
  20. "encoder_outputs": model.get_encoder()(
  21. encoder_input_ids.repeat_interleave(num_beams, dim=0), return_dict=True
  22. )
  23. }
  24. # instantiate beam scorer
  25. beam_scorer = BeamSearchScorer(
  26. batch_size=1,
  27. num_beams=num_beams,
  28. device=model.device,
  29. )
  30. # instantiate logits processors
  31. logits_processor = LogitsProcessorList(
  32. [
  33. MinLengthLogitsProcessor(5, eos_token_id=model.config.eos_token_id),
  34. ]
  35. )
  36. outputs = model.beam_search(input_ids, beam_scorer, logits_processor=logits_processor, **model_kwargs)
  37. gen_texts = tokenizer.batch_decode(outputs, skip_special_tokens=True)
  38. print(gen_texts)

GPT和BERT的tokenizer的区别

transformers库中GPTTokenizer需要同时读取vocab_filemerges_file两个文件,不同于BertTokenizer只需要读取vocab_file一个词文件。主要原因是两种模型采用的编码不同:

  • Bert采用的是字符级别的BPE编码,直接生成词表文件,官方词表中包含3w左右的单词,每个单词在词表中的位置即对应Embedding中的索引,Bert预留了100个[unused]位置,便于使用者将自己数据中重要的token手动添加到词表中。
  • GPT采用的是byte级别的BPE编码,官方词表包含5w多的byte级别的token。merges.txt中存储了所有的token,而vocab.json则是一个byte到索引的映射,通常频率越高的byte索引越小。所以转换的过程是,先将输入的所有tokens转化为merges.txt中对应的byte,再通过vocab.json中的字典进行byte到索引的转化

对于GPT,比如输入的文本是

What's up with the tokenizer?

首先使用merges.txt转化为对应的Byte(类似于标准化的过程)

['What', "'s", 'Ġup', 'Ġwith', 'Ġthe', 'Ġtoken', 'izer', '?']

再通过vocab.json文件存储的映射转化为对应的索引

  1. [ 'What', "'s", 'Ġup', 'Ġwith', 'Ġthe', 'Ġtoken', 'izer', '?']
  2. ---- becomes ----
  3. [ 2061, 338, 510, 351, 262, 11241, 7509, 30]

BERT

训练

  1. from transformers import BertModel, BertTokenizer
  2. # 加载预训练模型和 tokenizer
  3. model = BertModel.from_pretrained('bert-large-uncased')
  4. tokenizer = BertTokenizer.from_pretrained('bert-large-uncased')
  5. # 处理输入文本并进行编码
  6. text = "This is a test sentence."
  7. input_ids = tokenizer.encode(text, add_special_tokens=True, return_tensors='pt')
  8. #input_ids是[bs,l]
  9. # 使用 BertModel 对输入进行编码
  10. outputs = model(input_ids)
  11. last_hidden_state = outputs.last_hidden_state #[bs,l,dim]
  12. all_hidden_states = outputs.hidden_states

tokenizer是

BERT作encoder和decoder

  1. from transformers import BertGenerationEncoder,BertGenerationDecoder
  2. encoder = BertGenerationEncoder.from_pretrained(
  3. "bert-large-uncased",
  4. bos_token_id=101,
  5. eos_token_id=102
  6. )
  7. # add cross attention layers and use BERT's cls token as BOS token and sep token as EOS token
  8. decoder = BertGenerationDecoder.from_pretrained(
  9. "bert-large-uncased",
  10. add_cross_attention=True,
  11. is_decoder=True,
  12. bos_token_id=101,
  13. eos_token_id=102,
  14. )

verify是不是decoder

  1. # verify the decoder
  2. self.decoder.config.is_decoder
  3. self.decoder.config.add_cross_attention

BertModel和BertGenerationEncoder的区别

通用操作

generate()时传入inputs_embeds

transformers要是最新的4.30.0

  1. from transformers import AutoModelForCausalLM, AutoTokenizer
  2. model = AutoModelForCausalLM.from_pretrained("gpt2")
  3. tokenizer = AutoTokenizer.from_pretrained("gpt2")
  4. text = "Hello world"
  5. input_ids = tokenizer.encode(text, return_tensors="pt")
  6. # Traditional way of generating text
  7. outputs = model.generate(input_ids)
  8. print("\ngenerate + input_ids:", tokenizer.decode(outputs[0], skip_special_tokens=True))
  9. # From inputs_embeds -- exact same output if you also pass `input_ids`. If you don't
  10. # pass `input_ids`, you will get the same generated content but without the prompt
  11. inputs_embeds = model.transformer.wte(input_ids)
  12. outputs = model.generate(input_ids, inputs_embeds=inputs_embeds)
  13. print("\ngenerate + inputs_embeds:", tokenizer.decode(outputs[0], skip_special_tokens=True))

Passing inputs_embeds into GenerationMixin.generate() · Issue #6535 · huggingface/transformers · GitHub

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/一键难忘520/article/detail/742711
推荐阅读
  

闽ICP备14008679号