赞
踩
目录
1. space、punctuation、rule-based tokenization
2. Subword tokenization
2.1 Byte-Pair Encoding (BPE)
Byte-level BPE
2.2 WordPiece
2.3 Unigram
2.4 SentencePiece:ALBERT,XLNet,Marian和T5
参考:
tokenizing a text是将文本分为words或subwords,然后通过look-up table将其转换为ID。我们将研究