赞
踩
tokenizer基本含义
tokenizer就是分词器; 只不过在bert里和我们理解的中文分词不太一样,主要不是分词方法的问题,bert里基本都是最大匹配方法。
最大的不同在于“词”的理解和定义。 比如:中文基本是字为单位。
英文则是subword的概念,例如将"unwanted"分解成[“un”, “##want”, “##ed”] 请仔细理解这个做法的优点。
这是tokenizer的一个要义。
bert里涉及的tokenizer
BasicTokenzer
主要的类是BasicTokenizer,做一些基础的大小写、unicode转换、标点符号分割、小写转换、中文字符分割、去除重音符号等操作,最后返回的是关于词的数组(中文是字的数组)
- def tokenize(self, text):
- """Tokenizes a piece of text."""
- text = convert_to_unicode(text)
- text = self._clean_text(text)
-
- # This was added on November 1st, 2018 for the multilingual and Chinese
- # models. This is also applied to the English models now, but it doesn't
- # matter since the English models were not trained on any Chinese data
- # and generally don't have any Chinese data in them (there are Chinese
- # characters in the vocabulary because Wikipedia
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。