当前位置:   article > 正文

关于GPT2Tokenizer的一些发现_add_prefix_space

add_prefix_space

 

在使用transformers里的GPT2Tokenizer时,看到一句话:

GPT-2 BPE tokenizer. Peculiarities:

  • Byte-level Byte-Pair-Encoding

  • Requires a space to start the input string => the encoding methods should be called with the add_prefix_space flag set to True. Otherwise, this tokenizer encode and decode method will not conserve the absence of a space at the beginning of a string:

tokenizer.decode(tokenizer.encode("Hello")) = " Hello"

当时没有很理解这句话,一次查看其字典时,发现:

"negative": 31591,

"Ġnegative": 4633,

这是两个一样的单词,但是一个前边有Ġ,当时猜测类似后缀。

实际上当negative前有空格时,编码为4633,无空格时编码为31591,如Y negative中negative编码为4633,而Ynegative中negative编码为4633。可以认为不带Ġ的是一个后缀,而带Ġ的表示以该词开始的单词。

add_prefix_space是说,如果编码时,不加这个,默认该字符前没有空格,实际是不妥的,如一句话"Attention is all you need",中第一个单词应该表示开头,但是如果直接

tokenizer.encode("Attention is all you need"),则Attention将被认为是一个后缀。所以应该加上add_prefix_space=True。

实验验证:

方法1

每一次都如:

tokenizer.encode("negative", add_prefix_space=True)
  1. import warnings
  2. warnings.filterwarnings("ignore")
  3. from transformers import GPT2LMHeadModel, GPT2Tokenizer
  4. import torch
  5. import tokenizers
  6. tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
  7. tokenizer.save_pretrained('./config')
  8. text = "I love you"
  9. print(text)
  10. print(tokenizer.decode(tokenizer.encode(text))) # I love you (I 前无空格)
  11. print(tokenizer.all_special_tokens) # ['<|endoftext|>']
  12. '''
  13. 推测 若前边有空格则为 "Ġnegative": 4633, 否则为[31591] 注 字典为"negative": 31591,
  14. '''
  15. print(tokenizer.encode("negative")) # [31591] 注 字典为"negative": 31591,
  16. print(tokenizer.encode("negativeY")) # [31591, 56] 注 字典为"negative": 31591,
  17. print(tokenizer.encode(" negative")) # [4633] 注 "Ġnegative": 4633,
  18. print(tokenizer.encode("you negative")) # [5832, 4633]
  19. print(tokenizer.encode("Knegative")) # [42, 31591]
  20. special_tokens_dict = {'cls_token': '<CLS>', 'bos_token': '<s>'}
  21. num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
  22. print('We have added', num_added_toks, 'tokens') # We have added 2 tokens
  23. print(tokenizer.encode("negative", add_special_tokens=True)) # [31591] 注 字典为"negative": 31591,
  24. print(tokenizer.encode("negative", add_prefix_space=True)) # [4633]
  25. print(tokenizer.encode("<s> negative", add_special_tokens=True)) # [50258, 4633] 注 字典为"negative": 31591,
  26. print(tokenizer.encode("<s> negative", add_special_tokens=True, add_prefix_space=True)) # [50258, 4633] 注 字典为"negative": 31591,

方法2

利用 tokenizers.ByteLevelBPETokenizer

如:

  1. import warnings
  2. warnings.filterwarnings("ignore")
  3. from transformers import GPT2LMHeadModel, GPT2Tokenizer
  4. import torch
  5. import tokenizers
  6. tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
  7. tokenizer.save_pretrained('./config')
  8. text = "I love you"
  9. PATH = './config/'
  10. tokenizer = tokenizers.ByteLevelBPETokenizer(
  11. vocab_file=PATH + 'vocab.json',
  12. merges_file=PATH + 'merges.txt',
  13. lowercase=False,
  14. add_prefix_space=True
  15. )
  16. print(text)
  17. print(tokenizer.decode(tokenizer.encode(text).ids)) # I love you (I 前有空格)
  18. print(tokenizer.encode("negative").ids) # [4633] 注 "Ġnegative": 4633,
  19. print(tokenizer.encode(" negative").ids) # [4633] 注 "Ġnegative": 4633,
  20. print(tokenizer.encode(" negative").ids) # [220, 4633] 注: "Ġ": 220,
  21. print(tokenizer.encode("you negative").ids) # [345, 4633]
  22. print(tokenizer.encode("Knegative").ids) # [509, 31591]

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/不正经/article/detail/413197
推荐阅读
  

闽ICP备14008679号