从前慢现在也慢

这个屌丝很懒，什么也没留下！

热门标签

NLP tokenizer (分词器) 介绍_文本分类tokenizer分词器

作者：从前慢现在也慢 | 2024-05-03 02:53:12

踩

文本分类tokenizer分词器

主流的三种分词算法

Byte-Pair Encoding tokenization

使用BPE的模型：GPT、GPT-2、RoBERTa、BART 和 DeBERTa

训练算法（Training algorithm）

1. 获得基础词表（base vocabulary），由语料的所有单个字符组成（是在标准化和预处理之后）

假设我们的语料为

"hug", "pug", "pun", "bun", "hugs"

那么我们的基础词表就是 ["b", "g", "h", "n", "p", "s", "u"]，

2. 统计词表中，任意两个token的组合次数，次数最大的加入词表，直到词表达到设定大小（vocabulary size）

3.保存词表和合并顺序

分词算法（Tokenization algorithm）

分词算法和训练算法类似，步骤如下

Tokenization follows the training process closely, in the sense that new inputs are tokenized by applying the following steps:

Normalization
Pre-tokenization
Splitting the words into individual characters
Applying the merge rules learned in order on those splits

首先将一个单词拆成单个字符，然后应用训练时学习到的合并顺序，依次合并字符。

WordPiece tokenization

使用WordPiece的模型：BERT, DistilBERT

WordPiece的训练算法和BPE类似，但是分词算法则不一样。

训练算法（Training algorithm）

1.获得初始词表

假设我的语料如下

("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)

那么初始词表就是

["b", "h", "p", "##g", "##n", "##s", "##u"]

注意，WordPiece会为非开头的token添加##前缀

2. 根据如下公式，计算每个组合的score，score最高的组合添加到词表中，直到词表达到设定大小（vocabulary size）

score=(freq_of_pair)/(freq_of_first_element×freq_of_second_element)

合并后的词表

["b", "h", "p", "##g", "##n", "##s", "##u", "##gs", "hu", "hug"]

分词算法（Tokenization algorithm）

WordPiece 只保存最终词表，不保存合并规则（这点和BPE不一样）

从要标记化的单词开始，WordPiece 找到词汇表中最长的子词，然后对其进行分割。例如，如果我们使用在上面的示例中学习的词汇表，对于单词，"hugs"词汇表中从头开始的最长子词是"hug"，因此我们在那里进行分割并得到["hug", "##s"]。然后我们继续"##s"，它在词汇表中，所以的标记化"hugs"是["hug", "##s"]。

如果使用 BPE，按照合并规则的顺序，最红将获得["hu", "##gs"]，因此编码是不同的。

另一个例子，让我们看看如何对单词"bugs"进行标记。"b"是从词汇表中单词开头开始的最长子词，因此我们在那里拆分并得到["b", "##ugs"]。然后"##u"是词汇表中从其开头开始的最长子词，因此我们将其拆分并得到["b", "##u, "##gs"]。最后，"##gs"是在词汇表中，所以最终分词结果是["b", "##u, "##gs"]。

wordPiece 和BPE的不同

1. BPE保存合并顺序，而wordPiece不保存，总是从单词起始搜索在词表中出现的最长子词。

2. 合并规则不同，BPE简单计算共现次数，而wordPiece是计算score(计算联合概率除以边缘概率)

3. 当分某个单词分词后，只要有一个token不在词表，wordPiece将整个单词标记为UNK，而BPE只将不在词表的token标记为UNK

例如："bum"，"b" and "##u"都在词表，但是"##m"不在词表，那最终的分词结果就是["[UNK]"], 而不是["b", "##u", "[UNK]"]，而如果使用BPE，则结果为["b", "##u", "[UNK]"]。

Unigram tokenization

Unigram tokenization also starts with setting a desired vocabulary size. However, the main difference between unigram and the previous 2 approaches is that we don’t start with a base vocabulary of characters only. Instead, the base vocabulary has all the words and symbols. And tokens are gradually removed to arrive at the final vocabulary.

The way that tokens are removed is key to the unigram tokenizer. It uses a language model at each step and keeps removing x% of the pair (definition of pair is same as in word piece) which have the highest loss. Loss is generally defined as the log likelihood over the vocabulary at that training step.

The Unigram algorithm always keeps the base characters so that any word can be tokenized.

Unigram is mostly used in conjunction with the SentencePiece.

演示代码


corpus = [
    "This is the Hugging Face Course.",
    "This chapter is about tokenization.",
    "This section shows several tokenizer algorithms.",
    "Hopefully, you will be able to understand how they are trained and generate tokens.",
]
 
from transformers import AutoTokenizer
 
tokenizer = AutoTokenizer.from_pretrained("/home/chenjq/model/xlnet-base-cased/")
 
 
from collections import defaultdict
 
word_freqs = defaultdict(int)
for text in corpus:
    words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
    new_words = [word for word, offset in words_with_offsets]
    for word in new_words:
        word_freqs[word] += 1
 
word_freqs
 
 
 
char_freqs = defaultdict(int)
subwords_freqs = defaultdict(int)
for word, freq in word_freqs.items():
    for i in range(len(word)):
        char_freqs[word[i]] += freq
        # Loop through the subwords of length at least 2
        for j in range(i + 2, len(word) + 1):
            subwords_freqs[word[i:j]] += freq
 
# Sort subwords by frequency
sorted_subwords = sorted(subwords_freqs.items(), key=lambda x: x[1], reverse=True)
sorted_subwords[:10]
 
 
 
token_freqs = list(char_freqs.items()) + sorted_subwords[: 300 - len(char_freqs)]
token_freqs = {token: freq for token, freq in token_freqs}
 
 
 
from math import log
 
total_sum = sum([freq for token, freq in token_freqs.items()])
model = {token: -log(freq / total_sum) for token, freq in token_freqs.items()}
 
 
 
def encode_word(word, model):
    best_segmentations = [{"start": 0, "score": 1}] + [
        {"start": None, "score": None} for _ in range(len(word))
    ]
    for start_idx in range(len(word)):
        # This should be properly filled by the previous steps of the loop
        best_score_at_start = best_segmentations[start_idx]["score"]
        for end_idx in range(start_idx + 1, len(word) + 1):
            token = word[start_idx:end_idx]
            if token in model and best_score_at_start is not None:
                score = model[token] + best_score_at_start
                # If we have found a better segmentation ending at end_idx, we update
                if (
                    best_segmentations[end_idx]["score"] is None
                    or best_segmentations[end_idx]["score"] > score
                ):
                    best_segmentations[end_idx] = {"start": start_idx, "score": score}
 
    segmentation = best_segmentations[-1]
    if segmentation["score"] is None:
        # We did not find a tokenization of the word -> unknown
        return ["<unk>"], None
 
    score = segmentation["score"]
    start = segmentation["start"]
    end = len(word)
    tokens = []
    while start != 0:
        tokens.insert(0, word[start:end])
        next_start = best_segmentations[start]["start"]
        end = start
        start = next_start
    tokens.insert(0, word[start:end])
    return tokens, score
 
 
print(encode_word("Hopefully", model))
print(encode_word("This", model))
 
 
 
def compute_loss(model):
    loss = 0
    for word, freq in word_freqs.items():
        _, word_loss = encode_word(word, model)
        loss += freq * word_loss
    return loss
 
 
 
compute_loss(model)
 
 
 
import copy
 
 
def compute_scores(model):
    scores = {}
    model_loss = compute_loss(model)
    for token, score in model.items():
        # We always keep tokens of length 1
        if len(token) == 1:
            continue
        model_without_token = copy.deepcopy(model)
        _ = model_without_token.pop(token)
        scores[token] = compute_loss(model_without_token) - model_loss
    return scores
 
 
 
scores = compute_scores(model)
print(scores["ll"])
print(scores["his"])
 
 
percent_to_remove = 0.1
while len(model) > 100:
    scores = compute_scores(model)
    sorted_scores = sorted(scores.items(), key=lambda x: x[1])
    # Remove percent_to_remove tokens with the lowest scores.
    for i in range(int(len(model) * percent_to_remove)):
        _ = token_freqs.pop(sorted_scores[i][0])
 
    total_sum = sum([freq for token, freq in token_freqs.items()])
    model = {token: -log(freq / total_sum) for token, freq in token_freqs.items()}
 
 
 
def tokenize(text, model):
    words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
    pre_tokenized_text = [word for word, offset in words_with_offsets]
    encoded_words = [encode_word(word, model)[0] for word in pre_tokenized_text]
    return sum(encoded_words, [])
 
 
tokenize("This is the Hugging Face course.", model)

SentencePiece

All the tokenizers discussed above assume that space separates words. This is true except for a few languages like Chinese, Japanese etc. SentencePiece does not treat space as a separator, instead, it takes the string as input in its original raw format, i.e. along with all spaces. It then uses BPE or unigram as its tokenizers to construct the vocabulary.

Example: “I just got a funky phone case!”

Tokenized: [“_I”, “_just”, “_got”, “_a”, “_fun”, “ky”, “_phone”, “_case”]

The tokens can be joined to form a string and “_” can be replaced with space to get the original string back.

https://towardsdatascience.com/a-comprehensive-guide-to-subword-tokenisers-4bbd3bad9a7c

仅训练词表

场景：当我们使用如bert（wordPiece分词）预训练模型的时候，由于预训练的bert语料可能和我们手上的不一样，导致很多token被分成[unk]，这时候我们希望对词表重新训练，但是使用的分词算法保持不变（依旧是wordPiece）

https://huggingface.co/learn/nlp-course/chapter6/2


import pandas as pd
from datasets import Dataset
from transformers import AutoTokenizer
 
# 加载原有的bert tokenizer
old_tokenizer = AutoTokenizer.from_pretrained('../models/bert-base-chinese')
 
# 读取数据
src = pd.read_csv('./test.src', names=['src'])
tgt = pd.read_csv('./test.tgt', names=['tgt'])
src_tgt = pd.concat([src[['src']], tgt[['tgt']]], axis=1)
src_tgt.columns = ['document', 'summary']
raw_datasets = Dataset.from_pandas(src_tgt)
print(raw_datasets)
 
# 构建一个生成器读取语料
def get_training_corpus():
    dataset = raw_datasets
    for start_idx in range(0, len(dataset), 1000):
        samples = dataset[start_idx: start_idx + 1000]
        yield samples["document"]
 
 
training_corpus = get_training_corpus()
 
# 训练一个和原来一样的分词器，但是词表不一样，是基于我们自己的语料生成的
tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 52000)  # 设置新的词表大小为52000
 
# 保存到test-tok目录
tokenizer.save_pretrained("test-tok")

重新训练分词器

https://huggingface.co/learn/nlp-course/chapter6/8?fw=pt#building-a-wordpiece-tokenizer-from-scratch

训练


from tokenizers import Tokenizer
from tokenizers.models import BPE
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
 
from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
 
from tokenizers.pre_tokenizers import Whitespace
tokenizer.pre_tokenizer = Whitespace()
 
files = ['./train_sample.txt']
tokenizer.train(files, trainer)
tokenizer.save("./my-vocab.json")

使用分词器


from tokenizers import Tokenizer
from tokenizers.models import BPE
 
sent1 = "有感伤才完美 声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/从前慢现在也慢/article/detail/527425
推荐阅读
article2023年全国职业院校技能大赛-应用软件系统开发赛项（高职组）赛题第3套_双碳管理系统api文档...
1.实现智能制造的过程控制、任务派工、质量管理、数据采集等；2.实时监测每个设备的能源消耗数据，进行能源分析，实现能源利...
                                    赞
踩
article我眼中的docker 如何启动多个docker任务容器_docker 启动多个oracle容器...
docker类式于为沙盒中运行的线程，可以看做是一个简易的linux系统，容器的ID与名称都是唯一的，可以通过ID的前3...
                                    赞
踩
article【面试系列】百度，字节跳动，小米算法三面感悟，（内附复习资料及答案）...
作为一名菜鸡，浑浑噩噩的也到了研二了，琢磨着也要开始找暑期实习了，我的老本行是计算机视觉，小方向是做迁移学习的。但是这年...
                                    赞
踩
article香港Web3，步履蹒跚...
4月30日，6支香港虚拟资产现货ETF在香港交易所敲钟上市，正式迎来市场首秀。而在前一周，香港证监会（SFC）官网就已列...
                                    赞
踩
article怎么搭建一个单机的聊天机器人...
搭建一个单机的聊天机器人需要以下步骤：选择语言：可以使用 Python、Java、C++ 等编程语言。选择框架：可以使用...
                                    赞
踩
articleJava线程安全问题的原因和解决方案...
线程安全的确切定义是比较复杂的,不过我们可以这样认为:当多线程环境下的代码运行的结果是符合我们预期的,即在单线程环境下应...
                                    赞
踩
article【SpringCloud Alibaba】 介绍及微服务模块搭建_搭建springcloud ali...
Spring Cloud Alibaba致力于提供微服务开发的一站式解决方案。此项目包含开发分布式应用微服务的必需组件，...
                                    赞
踩
article某东jd滑块系列第三-变态的注册页面滑块_京东滑块验证码识别平台...
某东jd滑块系列第三-变态的注册页面滑块什么，还有浏览器指纹？声明：本文章纯属技术探讨学习防止不法用途轨迹代码为旧版本需...
                                    赞
踩
article【C语言】switch应用：输入1~7之内的数字，输出对应 星期几 | C语言面试：解释C语言中的t...
代码例如，输入1输出Monday。输入 3输出Wednesday。输入 0输出error！//switch实例#incl...
                                    赞
踩
articleGitHub Copilot一直等待问题_vscode github copilot 的聊天窗口一直...
GitHub Copilot一直等待问题_vscode github copilot 的聊天窗口一直思考中vscode ...
                                    赞
踩
article你知道什么是Prompt吗...
然而，设计有效的Prompt需要考虑任务目标、Prompt位置、关键词和指令的选择，并进行迭代和优化。例如，在问答任务中...
                                    赞
踩
articlepython笔记整理第二十六篇（for-in循环）_python for in 循环图...
**for-in循环：**in表达从（字符串、序列等）中依次取值，又称为遍历；for-in遍历的对象必须是可迭代对象。f...
                                    赞
踩
article在 Web3 方向有哪些工作，需要掌握哪些知识呢？...
综上所述，无论您选择Web3领域的哪个岗位，都需要构建全面的知识体系，既包括基础理论与核心技术，也涵盖特定岗位的专业技能...
                                    赞
踩
article【华为机试】2023年真题C卷（python）-机器人搬砖_华为od 机器人搬砖,一共有n堆砖存放在...
【华为机试】2023年真题C卷（python）-机器人搬砖_华为od 机器人搬砖,一共有n堆砖存放在n个不同的仓库中,第...
                                    赞
踩
articlepython如何随机生成100个0～100的整数_python面试题...
1、在python里面如何生成随机数1>生成随机小数random.random()是指0-1之间的随机小数，不包括1ra...
                                    赞
踩
articleWindows下安装Kafka_kafka2.7.2windows安装...
1、官网下载安装包，注意下载Binary downloads，我这里下载的是kafka_2.12-2.7.0.tgz2、...
                                    赞
踩
article关于keil显示行数_keil统计代码行数...
关于keil显示行数一张图说明如何 显示行数，找keil界面中的小板子，即Configuration，然后勾选 Show...
                                    赞
踩
articleSAS正则表达式之prxchange—短句中唯首单词首字母大写_sas 首字母大写...
医药临床试验编程技巧分享之SAS正则表达式_sas 首字母大写sas 首字母大写             ADAM数据集...
                                    赞
踩
article[SQL系列]从零开始学Clickhouse...
简单开始学习clickhouse[SQL系列]从零开始学Clickhouse             起因  ...
                                    赞
踩
articleSpringBoot+Minio实现文件断点续传_springboot minio断点续传...
断点续传的实现方式。_springboot minio断点续传springboot minio断点续传         ...
                                    赞
踩
相关标签
大数据
人工智能
系统软件
算法
java
面试
spring
python
web3
区块链
机器人
深度学习
数据挖掘
自然语言处理
开发语言
多线程安全
spring cloud
微服务
SpringBoot
MyBatis
验证码
C语言
switch
break