pytorch实现bert
As of the time of writing this piece, state-of-the-art results on NLP and NLU tasks are obtained with Transformer models. There is a trend of performance improvement as models become deeper and larger, GPT 3 comes to mind. Training small versions of such models from scratch takes a significant amount of time, even with GPU. This problem can be solved via pre-training when a model is trained on a large text corpus using a high-performance cluster. Later it can be fine-tuned for a specific task in a much shorter amount of time. During fine tuning stage, additional layers can be added to the model for specific tasks, which can be different from those for which the model was initially trained. This technique is related to transfer learning, a concept applied to areas of machine learning beyond NLP (see here and here for a quick intro).
在撰写本文时,已使用Transformer模型获得了有关NLP和NLU任务的最新结果。 随着模型变得越来越深,性能越来越大, GPT 3浮现在脑海。 即使使用GPU,从头开始训练这类模型的小版本也要花费大量时间。 当使用高性能集群在大型文本语料库上训练模型时,可以通过预训练来解决此问题。 之后,可以在更短的时间内针对特定任务对其进行微调。 在微调阶段,可以为特定任务向模型添加其他层,这些层可以与最初训练模型时所用的层不同。 该技术与转移学习有关,转移学习是应用于NLP之外的机器学习领域的概念(快速入门请参见此处和此处 )。
In this post, I would like to share my experience of fine-tuning BERT and RoBERTa, available from the transformers library by Hugging Face, for a document classification task. Both models share a transformer architecture, which consists of at least two distinct blocks — encoder and decoder. Both encoder and decoder consist of multiple layers based around Attention mechanism. Encoder processed the input token sequence into a vector of floating point numbers — a hidden state, which is picked up by the decoder. It is the h