当前位置:   article > 正文

NLP分类

nlp分类

nlp文本分类

Text classification is one of the important applications of NLP. Applications such as Sentiment Analysis and Identifying spam, bots, and offensive comments come under Text Classification. Until now, the approaches used for solving these problems included building Machine Learning or Deep Learning models from scratch, training them on your text data, and fine-tuning it with hyperparameters. Even though such models give decent results for applications like classifying whether a movie review is positive or negative, they may perform terribly if things become more ambiguous because most of the time there’s just not enough amount of labeled data to learn from.

文本分类是NLP的重要应用之一。 情感分析和识别垃圾邮件,僵尸程序和令人反感的评论等应用程序属于“ 文本分类” 。 到目前为止,用于解决这些问题的方法包括从头开始构建机器学习或深度学习模型,在文本数据上对其进行训练,并使用超参数对其进行微调。 即使此类模型为诸如影片评论的正面还是负面分类之类的应用程序提供了不错的结果,但如果事情变得更加模棱两可,它们可能会表现非常差劲,因为在大多数情况下,没有足够的标签数据可供学习。

But wait a minute? Isn’t the Imagenet using the same approach to classify the images? Then how has it able to achieve great results with the same approach? What if, instead of building a model from scratch, we use a model that has been trained to solve one problem (classifying images from Imagenet) as the basis to solve some other somewhat similar problem (text classification). As the fine-tuned model doesn’t have to learn from scratch, it gives higher accuracy without needing a lot of data. This is the principle of Transfer Learning upon which the Universal Language Model Fine-tuning (ULMFiT) has been built.

等一下 Imagenet是否使用相同的方法对图像进行分类? 那么用相同的方法又如何能够取得出色的成绩呢? 如果不是使用从头开始建立模型的模型,而是使用经过训练可以解决一个问题(对Imagenet的图像进行分类)的模型作为解决其他一些相似问题(文本分类)的基础,该怎么办? 由于微调的模型不必从头学习,因此无需大量数据即可提供更高的准确性。 这是构建通用语言模型微调(ULMFiT)的迁移学习原理。

And today we are going to see how you can leverage this approach for the Sentiment Analysis. You can read more about the ULMFiT, its advantages as well as comparison with other approaches here.

今天,我们将看到您如何利用这种方法进行情感分析。 你可以阅读更多关于ULMFiT,它的优点以及与其他方法相比, 这里

The fastai library provides modules necessary to train and use ULMFiT models. You can view the library here.

fastai库提供训练和使用ULMFiT模型所需的模块。 您可以在此处查看库。

The problem we are going to solve is the Sentiment Analysis of US Airlines. You can download the dataset from here. So without further ado, let’s start!

我们要解决的问题是美国航空的情绪分析。 您可以从此处下载数据集。 因此,事不宜迟,让我们开始吧!

Firstly, let’s import all the libraries.

首先,让我们导入所有库。

Now we will convert the CSV file of our data into Pandas Dataframe and see the data.

现在,我们将数据的CSV文件转换为Pandas Dataframe并查看数据。

Image for post

Now we check if there are any nulls in the dataframe. We observe that there are 5462 nulls in the negative_reason column. These nulls belong to positive + Neutral sentiments which makes sense. We verify this by taking count of all non-negative tweets. Both the numbers match. The reason negativereason_confidence count doesn’t match with negativereason count is that the 0 values in the negativereason_confidence column correspond to blanks in negativereason column.

现在,我们检查数据帧中是否有任何空值。 我们观察到negative_reason列中有5462个空值。 这些空值属于正+中性情绪,这是有道理的。 我们通过计算所有非负面的推文来验证这一点。 两个数字匹配。 negativereason_confidence计数与negativereason计数不匹配的原因是,negativereason_confidence列中的0值对应于negativereason列中的空白。

If we look at the total count of data samples, it’s 14640. The columns airline_sentiment_gold, negativereason_gold & tweet_coord have large amounts of blanks, i.e. in the range of 13000–14000. Thus it can be concluded that these columns will not provide any significant information & thus can be discarded.

如果我们看一下数据样本的总数,则为14640。air_sentiment_gold,negativereason_gold和tweet_coord列包含大量空白,即13000–14000。 因此可以得出结论,这些列将不提供任何重要信息,因此可以将其丢弃。

图片发布
Image for post

Now that we have the relevant data, let’s start building our model.

现在我们有了相关的数据,让我们开始构建模型。

When we are making NLP model with Fastai, there are two stages:

当我们用Fastai制作NLP模型时,有两个阶段:

  • Creating LM Model & fine-tuning it with the pre-trained model

    创建LM模型并使用预先训练的模型对其进行微调
  • Using the fine-tuned model as a classifier

    使用微调模型作为分类器

Here I’m using TextList which is part of the data bloc instead of using the factory methods of TextClasDataBunch and TextLMDataBunch because TextList is part of the API which is more flexible and powerful.

在这里,我使用的是TextList ,它是数据块的一部分,而不是使用TextClasDataBunchTextLMDataBunch的工厂方法,因为TextList是API的一部分,它更加灵活和强大。

图片发布

We can see that since we are training a language model, all the texts are concatenated together (with a random shuffle between them at each new epoch).

我们可以看到,由于我们正在训练一种语言模型,因此所有文本都被串联在一起(在每个新纪元之间都有随机的随机播放)。

Now we will fine-tune our model with the weights of a model pre-trained on a larger corpus, Wikitext 103. This model has been trained to predict the next word in the sentence provided to it as an input. As the language of the tweets is not always grammatically perfect, we will have to adjust the parameters of our model. Next, we will find the optimal learning rate & visualize it. The visualization will help us to spot the range of learning rates & choose from while training our model.

现在,我们将使用在较大的语料库Wikitext 103上预先训练的模型的权重来调整模型。该模型已经过训练,可以预测作为输入提供给它的句子中的下一个单词。 由于推文的语言在语法上并不总是完美的,因此我们将不得不调整模型的参数。 接下来,我们将找到最佳学习率并将其可视化。 可视化将帮助我们发现学习率的范围并在训练模型时选择。

Image for post

By default, the Learner object is frozen thus we need to train the embeddings at first. Here, instead of running the cycle for one epoch, I am going to run it for 6 to see how accuracy varies. The learning rate I picked is with the help of the plot we got above.

默认情况下,Learner对象是冻结的,因此我们首先需要训练嵌入。 在这里,我将循环运行6个,以查看准确性如何变化,而不是将循环运行一个时期。 我选择的学习率是借助以上获得的情节进行的。

图片发布

We got very low accuracy, which was expected the rest of our model is still frozen but we can see that the accuracy is increasing.

我们获得了非常低的准确性,可以预期我们模型的其余部分仍会冻结,但是我们可以看到准确性正在提高。

Image for post

We see that the accuracy improved slightly but still looming in the same range. This is because firstly the model was trained on a pre-trained model with different vocabulary & secondly, there were no labels, we had passed the data without specifying the labels.

我们看到精度略有提高,但仍在相同范围内显示。 这是因为,首先,该模型是在具有不同词汇的预训练模型上训练的;其次,没有标签,我们在未指定标签的情况下传递了数据。

Now we will test our model with random input & see if it’ll accurately complete the sentence.

现在,我们将使用随机输入来测试我们的模型,并查看它能否准确地完成句子。

图片发布

Now, we’ll create a new data object that only grabs the labeled data and keeps those labels.

现在,我们将创建一个新的数据对象,该对象仅获取标记的数据并保留这些标签。

Image for post

The classifier needs a little less dropout, so we pass drop_mult=0.5 to multiply all the dropouts by this amount. We don’t load the pre-trained model, but instead our fine-tuned encoder from the previous section.

分类器需要更少的辍学,因此我们传递drop_mult = 0.5来将所有辍学乘以该数量。 我们不加载预训练的模型,而是上一节中的微调编码器。

Again we perform similar steps as Language mode. Here I am skipping the last 15 data points as I’m only interested till 1e-1.

同样,我们执行与语言模式类似的步骤。 在这里,我跳过了最后15个数据点,因为我只对1e-1感兴趣。

Image for post
图片发布

Here we see that the accuracy has drastically improved if we compare with the Language model in step 1 when we provide labels.

在这里,我们可以看到,如果我们与提供标签的第1步中的Language模型进行比较,则准确性会大大提高。

Now we will partially train the model by unfreezing one layer at a time & differential learning rate. Here I am using the slice attribute which will divide the specified learning rates among 3 groups of models.

现在,我们将通过按时间和差异学习速率解冻一层来部分训练模型。 在这里,我使用slice属性,它将指定的学习率划分为3组模型。

Image for post

We see that the accuracy is improving gradually which is expected as we are gradually unfreezing the layers. More layers providing more depth.

我们看到精度逐渐提高,这是随着我们逐渐解冻各层所期望的。 更多的图层可提供更多的深度。

图片发布

Finally, we will unfreeze the whole model & visualize the learning rate to choose & use that for final training.

最后,我们将解冻整个模型并可视化学习率,以选择并用于最终培训。

Image for post

We see that we have achieved maximum accuracy of 80% by the end of this model.

我们看到,到该模型结束时,我们已达到80%的最大精度。

图片发布

For our final results, we’ll take the average of the predictions of the model. Since the samples are sorted by text lengths for batching, we pass the argument ordered=True to get the predictions in the order of the texts.

对于最终结果,我们将取模型预测的平均值。 由于样本是按文本长度排序以进行批处理的,因此我们传递了ordered = True参数以按文本顺序获得预测。

Image for post

We got the accuracy of 80.09%

我们的准确性为80.09%

Now it’s time to test our model with new text inputs & see how it performs!

现在是时候使用新的文本输入来测试我们的模型并查看其性能了!

The databunch has converted the text labels into numerical. They are as follows:

数据仓库已将文本标签转换为数字标签。 它们如下:

  • 0 => Negative

    0 =>负数
  • 1 => Neutral

    1 =>中性
  • 2 => Positive

    2 =>正
图片发布

We see that our model has performed pretty well!!

我们看到我们的模型表现不错!

You can test the model with negative as well as mixed sentiment text and verify results.

您可以使用否定的和混合的情绪文本来测试模型并验证结果。

Hope you find this article helpful :D

希望本文对您有所帮助:D

Also, any suggestions/corrections are welcome.

另外,欢迎提出任何建议/更正。

Happy Coding!!

快乐编码!

翻译自: https://towardsdatascience.com/nlp-classification-with-universal-language-model-fine-tuning-ulmfit-4e1d5077372b

nlp文本分类

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/小丑西瓜9/article/detail/721172
推荐阅读
相关标签
  

闽ICP备14008679号