赞
踩
在使用transformers里的GPT2Tokenizer时,看到一句话:
GPT-2 BPE tokenizer. Peculiarities:
Byte-level Byte-Pair-Encoding
Requires a space to start the input string => the encoding methods should be called with the add_prefix_space
flag set to True
. Otherwise, this tokenizer encode
and decode
method will not conserve the absence of a space at the beginning of a string:
tokenizer.decode(tokenizer.encode("Hello")) = " Hello"
当时没有很理解这句话,一次查看其字典时,发现:
"negative": 31591,
"Ġnegative": 4633,
这是两个一样的单词,但是一个前边有Ġ,当时猜测类似后缀。
实际上当negative前有空格时,编码为4633,无空格时编码为31591,如Y negative中negative编码为4633,而Ynegative中negative编码为4633。可以认为不带Ġ的是一个后缀,而带Ġ的表示以该词开始的单词。
而add_prefix_space是说,如果编码时,不加这个,默认该字符前没有空格,实际是不妥的,如一句话"Attention is all you need",中第一个单词应该表示开头,但是如果直接
tokenizer.encode("Attention is all you need"),则Attention将被认为是一个后缀。所以应该加上add_prefix_space=True。
实验验证:
每一次都如:
tokenizer.encode("negative", add_prefix_space=True)
- import warnings
- warnings.filterwarnings("ignore")
-
- from transformers import GPT2LMHeadModel, GPT2Tokenizer
- import torch
- import tokenizers
- tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
- tokenizer.save_pretrained('./config')
- text = "I love you"
- print(text)
- print(tokenizer.decode(tokenizer.encode(text))) # I love you (I 前无空格)
- print(tokenizer.all_special_tokens) # ['<|endoftext|>']
-
- '''
- 推测 若前边有空格则为 "Ġnegative": 4633, 否则为[31591] 注 字典为"negative": 31591,
- '''
- print(tokenizer.encode("negative")) # [31591] 注 字典为"negative": 31591,
- print(tokenizer.encode("negativeY")) # [31591, 56] 注 字典为"negative": 31591,
- print(tokenizer.encode(" negative")) # [4633] 注 "Ġnegative": 4633,
- print(tokenizer.encode("you negative")) # [5832, 4633]
- print(tokenizer.encode("Knegative")) # [42, 31591]
-
- special_tokens_dict = {'cls_token': '<CLS>', 'bos_token': '<s>'}
- num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
- print('We have added', num_added_toks, 'tokens') # We have added 2 tokens
- print(tokenizer.encode("negative", add_special_tokens=True)) # [31591] 注 字典为"negative": 31591,
- print(tokenizer.encode("negative", add_prefix_space=True)) # [4633]
- print(tokenizer.encode("<s> negative", add_special_tokens=True)) # [50258, 4633] 注 字典为"negative": 31591,
- print(tokenizer.encode("<s> negative", add_special_tokens=True, add_prefix_space=True)) # [50258, 4633] 注 字典为"negative": 31591,
利用 tokenizers.ByteLevelBPETokenizer
如:
- import warnings
- warnings.filterwarnings("ignore")
-
- from transformers import GPT2LMHeadModel, GPT2Tokenizer
- import torch
- import tokenizers
- tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
- tokenizer.save_pretrained('./config')
- text = "I love you"
-
- PATH = './config/'
- tokenizer = tokenizers.ByteLevelBPETokenizer(
- vocab_file=PATH + 'vocab.json',
- merges_file=PATH + 'merges.txt',
- lowercase=False,
- add_prefix_space=True
- )
- print(text)
- print(tokenizer.decode(tokenizer.encode(text).ids)) # I love you (I 前有空格)
-
- print(tokenizer.encode("negative").ids) # [4633] 注 "Ġnegative": 4633,
- print(tokenizer.encode(" negative").ids) # [4633] 注 "Ġnegative": 4633,
- print(tokenizer.encode(" negative").ids) # [220, 4633] 注: "Ġ": 220,
- print(tokenizer.encode("you negative").ids) # [345, 4633]
- print(tokenizer.encode("Knegative").ids) # [509, 31591]
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。