当前位置:   article > 正文

You are using the legacy behaviour of the <class ‘transformers.models.t5.tokenization_t5.T5Tokenizer_you are using the default legacy behaviour of the

you are using the default legacy behaviour of the

诸神缄默不语-个人CSDN博文目录

这是在调用mT5时出现的警告信息。

原代码:

from transformers import AutoTokenizer
tokenizer=AutoTokenizer.from_pretrained("/data/pretrained_models/mt5_small")
  • 1
  • 2

警告信息全文:
You are using the legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565 env_path/lib/python3.9/site-packages/transformers/convert_slow_tokenizer.py:470: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.

这个大概就是说直接这么调用会使用老板的tokenizer,有bug,就是</s>后面会多个空格。debug后的结果需要设置legacy=False调用。此外如果直接调用fast版本,这个版本也没有debug。所以如果想调用debug前的版本,需要用legacy=False,use_fast=False参数。

如果你是直接调用mT5来进行推理,建议采用训练时tokenizer的参数。如果是自己微调的话,我感觉应该没有区别,建议用legacy=False,use_fast=False参数。到底有没有区别,看你的文本里会不会出现</s>这种特殊token,如果不出现的话就没有区别(尤其在中文里,应该就是没有区别,看下文,只在编码环节有差异:

测试不同版本的输出:

tokenizer=AutoTokenizer.from_pretrained("/data/pretrained_models/mt5_small",legacy=True,use_fast=False)

a_sentence="没有任何特殊符号的部分,加上有特殊符号比如</s>、数字202401111356的部分"
print(tokenizer.encode(a_sentence))
print(tokenizer.tokenize(a_sentence))
print(tokenizer.decode(tokenizer.encode(a_sentence)))
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

输出:

You are using the legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565
[259, 13342, 43803, 64479, 82061, 3688, 227830, 261, 158919, 1637, 64479, 82061, 3688, 82734, 1, 259, 227830, 1]
['▁', '没有', '任何', '特殊', '符', '号', '的部分', ',', '加上', '有', '特殊', '符', '号', '比如', '</s>', '▁', '的部分']
没有任何特殊符号的部分,加上有特殊符号比如</s> 、数字202401111356的部分</s>
  • 1
  • 2
  • 3
  • 4
tokenizer=AutoTokenizer.from_pretrained("/data/pretrained_models/mt5_small",legacy=True,use_fast=True)

a_sentence="没有任何特殊符号的部分,加上有特殊符号比如</s>、数字202401111356的部分"
print(tokenizer.encode(a_sentence))
print(tokenizer.tokenize(a_sentence))
  • 1
  • 2
  • 3
  • 4
  • 5

输出:

You are using the legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565
env_path/lib/python3.9/site-packages/transformers/convert_slow_tokenizer.py:470: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.
[259, 13342, 43803, 64479, 82061, 3688, 227830, 261, 158919, 1637, 64479, 82061, 3688, 82734, 1, 259, 227830, 1]
['▁', '没有', '任何', '特殊', '符', '号', '的部分', ',', '加上', '有', '特殊', '符', '号', '比如', '</s>', '▁', '的部分']
没有任何特殊符号的部分,加上有特殊符号比如</s> 、数字202401111356的部分</s>
  • 1
  • 2
  • 3
  • 4
  • 5
tokenizer=AutoTokenizer.from_pretrained("/data/pretrained_models/mt5_small",legacy=False,use_fast=False)

a_sentence="没有任何特殊符号的部分,加上有特殊符号比如</s>、数字202401111356的部分"
print(tokenizer.encode(a_sentence))
print(tokenizer.tokenize(a_sentence))
  • 1
  • 2
  • 3
  • 4
  • 5

输出:

[259, 13342, 43803, 64479, 82061, 3688, 227830, 261, 158919, 1637, 64479, 82061, 3688, 82734, 1, 292, 39542, 79806, 122631, 176372, 227830, 1]
['▁', '没有', '任何', '特殊', '符', '号', '的部分', ',', '加上', '有', '特殊', '符', '号', '比如', '</s>', '、', '数字', '2024', '0111', '1356', '的部分']
没有任何特殊符号的部分,加上有特殊符号比如</s> 、数字202401111356的部分</s>
  • 1
  • 2
  • 3
tokenizer=AutoTokenizer.from_pretrained("/data/pretrained_models/mt5_small",legacy=False,use_fast=True)

a_sentence="没有任何特殊符号的部分,加上有特殊符号比如</s>、数字202401111356的部分"
print(tokenizer.encode(a_sentence))
print(tokenizer.tokenize(a_sentence))
  • 1
  • 2
  • 3
  • 4
  • 5

输出:

env_path/lib/python3.9/site-packages/transformers/convert_slow_tokenizer.py:470: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.
[259, 13342, 43803, 64479, 82061, 3688, 227830, 261, 158919, 1637, 64479, 82061, 3688, 82734, 1, 259, 292, 39542, 79806, 122631, 176372, 227830, 1]
['▁', '没有', '任何', '特殊', '符', '号', '的部分', ',', '加上', '有', '特殊', '符', '号', '比如', '</s>', '▁', '、', '数字', '2024', '0111', '1356', '的部分']
没有任何特殊符号的部分,加上有特殊符号比如</s> 、数字202401111356的部分</s>
  • 1
  • 2
  • 3
  • 4

参考信息:

  1. ⚠️⚠️[T5Tokenize] Fix T5 family tokenizers⚠️⚠️ by ArthurZucker · Pull Request #24565 · huggingface/transformers
  2. Using transformers legacy tokenizer · Issue #305 · OpenAccess-AI-Collective/axolotl
  3. Slow Tokenizer adds whitespace after special token · Issue #25073 · huggingface/transformers
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/不正经/article/detail/372960
推荐阅读
相关标签