gpt-2 生成文本
Note: This blog was originally posted in this following link.
注意: 此博客最初发布在以下 链接中 。
We all heard modern-day Natural Language Processing (NLP) has progressed by leaps and bounds in the past couple of years following the development of attention networks and transformers. It paved the way for a plethora of new algorithms achieving State-Of-The-Art (SOTA) for the different tasks of NLP.
我们都听说,随着注意力网络和转换器的发展,近几年来,现代自然语言处理(NLP)取得了突飞猛进的发展。 它为众多新算法铺平了道路,从而可以为NLP的不同任务提供最新技术(SOTA)。
OpenAI has been one of the leaders in providing their own language model (now released GPT-3) which is trained on a huge corpus of internet data. Since GPT-3 is a recent phenomenon and in English at the moment, and is only accessible through API given by OpenAI, we shift our focus on the earlier version of it, i.e. GPT-2. To know about the internal nuts and bolts of GPT-2, I suggest you to go through this link. For more depths into Attention and Transformers, here are some excellent links:
OpenAI一直是提供自己的语言模型(现已发布的GPT-3)的领导者之一,该模型在庞大的互联网数据集上进行了培训。 由于GPT-3是目前的一种新现象,目前仅以英语提供,并且只能通过OpenAI提供的API进行访问,因此我们将重点放在GPT-2的早期版本上。 要了解GPT-2的内部螺母和螺栓,建议您通过以下链接进行操作 。 要深入了解“注意力”和“变形金刚”,以下是一些出色的链接:
The illustrated Transformer by Jay Alammar
Jay Alammar 的插图变形金刚
The Annotated Transformer by Harvard NLP
哈佛NLP 的带注释的变压器
GPT-2 was also released for English, which makes it difficult for someone trying to generate text in a different language.
GPT-2也发布了英语版本,这使得尝试生成其他语言的文本变得很困难。
So why not train your own GPT-2 model on your favorite language for text generation? That is exactly what we are going to do. So, without further ado, let us jump in.
那么,为什么不使用自己喜欢的语言训练自己的GPT-2模型来生成文本呢? 这正是我们要做的。 因此,事不宜迟,让我们跳进去。
For the demo, I have considered a non-Latin alphabet script (Bengali here), because why not? I have used Huggingface’s implementation for the model.
对于该演示,我考虑了非拉丁字母脚本(此处为孟加拉),因为为什么不呢? 我已经为模型使用了Huggingface的实现。
1.收集数据 (1. Gathering the data)
Gathering good quality data is one of the most important stages as all Data Scientists would agree. So we are going to assume that you already have a folder containing .txt files having all the data cleaned and stored. For ease, you can use the Wikipedia article data, which is available and can be downloaded with the following code:
所有数据科学家都会同意,收集高质量的数据是最重要的阶段之一。 因此,我们假设您已经有一个包含.txt文件的文件夹,该文件夹中已清除并存储了所有数据。 为方便起见,您可以使用Wikipedia文章数据,该数据可用以下代码下载:
- import tensorflow as tf
- from gensim.corpora import WikiCorpus
- import os
- import argparse
-
-
- # lang = 'bn'
-
-
- def store(corpus, lang):
- base_path = os.getcwd()
- store_path = os.path.join(base_path, '{}_corpus'.format(lang))
- if not os.path.exists(store_path):
- os.mkdir(store_path)
- file_idx=1
- for text in corpus.get_texts():
- current_file_path = os.path.join(store_path, 'article_{}.txt'.format(file_idx))
- with open(current_file_path, 'w' , encoding='utf-8') as file:
- file.write(bytes(' '.join(text), 'utf-8').decode('utf-8'))
- #endwith
- file_idx += 1
- #endfor
-
-
- def tokenizer_func(text: str, token_min_len: int, token_max_len: int, lower: bool) -> list:
- return [token for token in text.split() if token_min_len <= len(token) <= token_max_len]
-
-
- def run(lang):
- origin='https://dumps.wikimedia.org/{}wiki/latest/{}wiki-latest-pages-articles.xml.bz2'.format(lang,lang)
- fname='{}wiki-latest-pages-articles.xml.bz2'.format(lang)
- file_path = tf.keras.utils.get_file(origin=origin, fname=fname, untar=False, extract=False)
- corpus = WikiCorpus(file_path, lemmatize=False, lower=False, tokenizer_func=tokenizer_func)
- store(corpus, lang)
-
-
- if __name__ == '__main__':
- ARGS_PARSER = argparse.ArgumentParser()
- ARGS_PARSER.add_argument(
- '--lang',
- default='en',
- type=str,
- help='language code to download from wikipedia corpus'
- )
- ARGS = ARGS_PARSER.parse_args()
- run(**vars(ARGS))
python wikipedia_download.py --lang bn
This will create a folder containing all Wikipedia files looking like:
这将创建一个包含所有Wikipedia文件的文件夹,如下所示:
Note: Due to resource constraint, and since it is for demo purpose, I have trained the model in a small subset of books by Satyajit Ray, especially his detective Feluda series.
注意: 由于资源限制,并且出于演示目的,我在 Satyajit Ray (尤其是他的侦探 Feluda 系列)的 一小本书中训练了该模型 。
2.标记化 (2. Tokenization)
Now, the second step will be to tokenize the data. For that, we use the following class:
现在,第二步是标记数据。 为此,我们使用以下类:
- import os
- from tokenizers.models import BPE
- from tokenizers import Tokenizer
- from tokenizers.decoders import ByteLevel as ByteLevelDecoder
- from tokenizers.normalizers import NFKC, Sequence
- from tokenizers.pre_tokenizers import ByteLevel
- from tokenizers.trainers import BpeTrainer
-
-
- class BPE_token(object):
- def __init__(self):
- self.tokenizer = Tokenizer(BPE())
- self.tokenizer.normalizer = Sequence([
- NFKC()
- ])
- self.tokenizer.pre_tokenizer = ByteLevel()
- self.tokenizer.decoder = ByteLevelDecoder()
-
-
- def bpe_train(self, paths):
- trainer = BpeTrainer(vocab_size=50000, show_progress=True, inital_alphabet=ByteLevel.alphabet(), special_tokens=[
- "<s>",
- "<pad>",
- "</s>",
- "<unk>",
- "<mask>"
- ])
- self.tokenizer.train(trainer, paths)
-
-
- def save_tokenizer(self, location, prefix=None):
- if not os.path.exists(location):
- os.makedirs(location)
- self.tokenizer.model.save(location, prefix)
关于标记化的一些注意事项: (Some notes on the tokenization:)
We use BPE (Byte Pair Encoding), which is a sub-word encoding. This generally takes care of not treating different forms of word as different. (E.g., ‘greatest’ will be treated as two tokens: ‘great’ and ‘est’ which is advantageous since it retains the similarity between great and greatest, while ‘greatest’ has another token ‘est’ added, which makes it different). Also, it is not as low level as character-level encoding, which doesn’t retain any value of a particular word.
我们使用BPE (字节对编码),它是子字编码。 通常注意不要将不同形式的单词视为不同的单词。 (例如,“ greestst”将被视为两个标记:“ great”和“ est”,这是有利的,因为它保留了great和great之间的相似性,而“ greatest”又添加了另一个标记“ est”,这使其与众不同) 。 另外,它不像字符级编码那样低,而字符级编码不保留特定单词的任何值。
Another small but subtle point is NFKC (Normalization Form Compatibility Composition) in line 13 of code. It is one of the standard Unicode compatibility forms. It would not matter much if the language is English, but since we are using Bengali, which contains a different form of character, we are using this specific one. More on it can be found on this link.
另一个小而微妙的地方是代码第13行中的NFKC (规范化表单兼容性组合)。 它是标准Unicode兼容性形式之一。 语言是否为英语并没有多大关系,但是由于我们使用的孟加拉语包含不同的字符形式,因此我们使用的是这种特定的字符。 可以在此链接上找到更多信息。
So what we do here is tokenize our data and save it in a folder. Two files will be created (merges.txt and vocab.json) in a specified directory. To run the file, use the following code:
因此,我们在这里要做的是标记数据并将其保存在文件夹中。 将在指定目录中创建两个文件( merges.txt和vocab.json )。 要运行文件,请使用以下代码:
from tokenise import BPE_tokenfrom pathlib import Pathimport os # the folder 'text' contains all the filespaths = [str(x) for x in Path("./text/").glob("**/*.txt")]tokenizer = BPE_token()# train the tokenizer modeltokenizer.bpe_train(paths)# saving the tokenized data in our specified folder save_path = 'tokenized_data'tokenizer.save_tokenizer(save_path)
3.模型初始化 (3. Model Initialization)
Before the real magic begins, we need to make sure the artilleries are ready. Let us start with some initializations.
在真正的魔法开始之前,我们需要确保炮兵已经准备就绪。 让我们从一些初始化开始。
import tensorflow as tffrom transformers import GPT2Config, TFGPT2LMHeadModel, GPT2Tokenizer# loading tokenizer from the saved model pathtokenizer = GPT2Tokenizer.from_pretrained(save_path)tokenizer.add_special_tokens({ "eos_token": "</s>", "bos_token": "<s>", "unk_token": "<unk>", "pad_token": "<pad>", "mask_token": "<mask>"})# creating the configurations from which the model can be madeconfig = GPT2Config( vocab_size=tokenizer.vocab_size, bos_token_id=tokenizer.bos_token_id, eos_token_id=tokenizer.eos_token_id)# creating the modelmodel = TFGPT2LMHeadModel(config)
We also create a single string from all our documents and tokenize it.
我们还将根据所有文档创建一个字符串并将其标记化。
single_string = ''for filename in paths: with open(filename, "r", encoding='utf-8') as f: x = f.read()single_string += x + tokenizer.eos_tokenstring_tokenized = tokenizer.encode(single_string)
After we have encoded the whole string, we now move on to make a TensorFlow dataset, slicing the data into equal intervals, so that our model can learn. Here we use a block size of 100 (length of token in each example) and a batch size of 16. This is kept low, else we can run it with ease on a RTX 2060 GPU.
在对整个字符串进行编码之后,我们现在继续制作TensorFlow数据集,将数据切成相等的间隔,以便我们的模型可以学习。 在这里,我们使用的块大小为100(每个示例中令牌的长度),批处理大小为16。这保持较低,否则我们可以轻松地在RTX 2060 GPU上运行它。
examples = []block_size = 100BATCH_SIZE = 16BUFFER_SIZE = 1000for i in range(0, len(string_tokenized) - block_size + 1, block_size): examples.append(string_tokenized[i:i + block_size])inputs, labels = [], []for ex in examples: inputs.append(ex[:-1]) labels.append(ex[1:])dataset = tf.data.Dataset.from_tensor_slices((inputs, labels))dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
4.模型训练 (4. Model Training)
Now comes the part we’ve been waiting for, making the model and training. So we define our optimizer, loss functions and the metrics, and start training.
现在是我们一直在等待的部分,包括模型制作和培训。 因此,我们定义了优化器,损失函数和指标,并开始了培训。
# defining our optimizeroptimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)# definining our loss functionloss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)# defining our metric which we want to observemetric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')# compiling the modelmodel.compile(optimizer=optimizer, loss=[loss, *[None] * model.config.n_layer], metrics=[metric])
Now, let’s train the model.
现在,让我们训练模型。
num_epoch = 10history = model.fit(dataset, epochs=num_epoch)
5.预测 (5. Prediction)
To predict, we just need to simply encode the input text and pass it to the model.
为了进行预测,我们只需要简单地编码输入文本并将其传递给模型即可。
text = "লালমোহনবাবু "# encoding the input textinput_ids = tokenizer.encode(text, return_tensors='tf')# getting out outputbeam_output = model.generate( input_ids, max_length = 50, num_beams = 5, temperature = 0.7, no_repeat_ngram_size=2, num_return_sequences=5)
Now, if you are a Bengali, then you can point it out that the output although the sentence is syntactically correct, it doesn’t look cohesive. True, but for this demo, I have kept this demo as minimal as possible.
现在,如果您是孟加拉人,那么您可以指出,尽管该句子在语法上是正确的,但输出看起来并不连贯。 的确如此,但是对于本演示,我将本演示保持在最低限度。
6.保存模型 (6. Save the Model)
Well, after a long training time, what good will it do if we close our session and all our trained model is just lost, and we again need to train it from scratch? So, let’s save the model and the tokenizer so that we can retrain from where we left off.
好了,经过很长的培训时间,如果我们关闭课程并失去所有受过训练的模型,而又需要从头开始进行培训,那将会有什么好处? 因此,让我们保存模型和标记器,以便我们可以从上次中断的地方重新训练。
from transformers import WEIGHTS_NAME, CONFIG_NAMEoutput_dir = './model_bn_custom/'# creating directory if it is not presentif not os.path.exists(output_dir): os.mkdir(output_dir)model_to_save = model.module if hasattr(model, 'module') else modeloutput_model_file = os.path.join(output_dir, WEIGHTS_NAME)output_config_file = os.path.join(output_dir, CONFIG_NAME)# save model and model configsmodel.save_pretrained(output_dir)model_to_save.config.to_json_file(output_config_file)# save tokenizertokenizer.save_pretrained(output_dir)
奖金 (Bonus)
We have already done all the hard work. So to load the saved model and tokenizer, we only need to execute two lines of code and we’re all set.
我们已经完成了所有艰苦的工作。 因此,要加载保存的模型和令牌生成器,我们只需要执行两行代码,就可以完成设置。
tokenizer = GPT2Tokenizer.from_pretrained(output_dir)model = TFGPT2LMHeadModel.from_pretrained(output_dir)
Voila! Now you can train your own model in your own language. And create content which can race with some of the best literary works in any language.
瞧! 现在,您可以使用自己的语言来训练自己的模型。 并创建可以与任何语言的最佳文学作品竞争的内容。
未来范围: (Future scope:)
This blog gives a framework of how can one train GPT-2 model in any language. This is not at par with some of the pre-trained models available. But to reach that state, we need a lot of training data and computational power.
该博客提供了一个框架,该框架说明了如何训练任何一种语言的GPT-2模型。 这与某些可用的预训练模型不相称。 但是要达到这种状态,我们需要大量的训练数据和计算能力。
gpt-2 生成文本