twitter 情感分析
In today’s modern world, where we are suffering from overloaded data, companies often are gathering tonnes of data regarding customer feedback, shopping behavior, etc. Companies can dexterously change their digital profile, products, or services to best suit the new marketplace and customers by analyzing this data. However, it is still difficult for any human to interpret it manually without any mistake or bias.
在当今现代世界中,我们正遭受数据过载的困扰,公司经常收集有关客户反馈,购物行为等方面的大量数据。公司可以通过以下方式灵活地更改其数字资料,产品或服务,以最适合新市场和客户:分析这些数据。 但是,对于任何人来说,在没有任何错误或偏见的情况下,仍然很难手动解释它。
Sentiment Analysis is a method to evaluate if a piece of writing or text is positive, negative, or neutral. Sentiment analysis lets data analysts around multinational corporations gauge public sentiment and product perception, and consider consumer perception.
情感分析是一种评估一段文字或文字是肯定的,否定的还是中立的方法。 情绪分析使跨国公司的数据分析人员能够评估公众情绪和产品感知,并考虑消费者的感知。
Today, Deep Learning and Natural Language Processing (NLP) play a significant role in Sentiment Analysis. This blog focuses on applying sentiment analysis on twitter data scraped from every major U.S. airline to classify the tweets. Our goal will be to classify customer tweets into three categories positive, negative, and neutral. There are several Machine Learning algorithm for classification such example is K- Nearest Neighbor as explained in one of my blog Telecom Industry Customer Churn Prediction with K Nearest Neighbor. However, in this case we will use NLP since our data is unstructured (raw text). At the end of this blog, we will successfully build and train a State-of-The-Art (SoTA) Machine learning model to classify tweets based on sentiments.
如今,深度学习和自然语言处理(NLP)在情感分析中起着重要作用。 该博客着重于对从美国各主要航空公司收集的推特数据进行情感分析,以对推文进行分类。 我们的目标是将客户推文分为积极,消极和中立三个类别。 有几种用于分类的机器学习算法,例如我的博客《电信行业客户使用K最近邻的客户流失预测》中所述的例子是K最近邻。 但是,在这种情况下,我们将使用NLP,因为我们的数据是非结构化的(原始文本)。 在本博客的结尾,我们将成功构建和训练一种最新的(SOTA)机器学习模型,以根据情感对推文进行分类。
This dataset is available on Kaggle: https://www.kaggle.com/crowdflower/twitter-airline-sentiment.
该数据集可在Kaggle上找到: https ://www.kaggle.com/crowdflower/twitter-airline-sentiment 。
We will apply a supervised ULMFiT model to the Twitter data of major U.S. airlines. We will follow the ULMFiT approach of Howard and Ruder presented in the paper Universal Language Model Fine-tuning for Text Classification. ULMFiT stands for Universal Language Model Fine-tuning. It is an efficient Transfer Learning approach that can be extended to any NLP function to implement language model fine-tuning techniques.
我们将对美国主要航空公司的Twitter数据应用受监管的ULMFiT模型。 我们将遵循Howard and Ruder的ULMFiT方法,该方法在针对文本分类的通用语言模型微调中提出。 ULMFiT表示通用语言模型微调。 它是一种有效的转移学习方法,可以扩展到任何NLP功能,以实现语言模型微调技术。
We will follow a step by step procedure to build a ULMFiT model, starting from Data Exploration then Text Preprocessing followed by building Language Model and then at the end building Classifier Model. Finally we will predict the accuracy of out ULMFiT model.
我们将按照逐步的过程来构建ULMFiT模型,首先是数据探索,然后是文本预处理,然后是构建语言模型,最后是构建分类器模型。 最后,我们将预测出超ULMFiT模型的准确性。
The complete Jupyter notebook for this can be found here: Twitter-Sentiment-Analysis-using-ULMFiT. So let’s begin.
完整的Jupyter笔记本可以在这里找到: Twitter-Sentiment-Analysis-using-ULMFiT 。 因此,让我们开始吧。
数据探索与处理 (Data exploration and processing)
After performing, Exploratory Data Analysis (EDA) of the dataset, it conveyed the missing values in a few dataset columns.
对数据集执行探索性数据分析(EDA)之后,它在几个数据集列中传达了缺失的值。
The columns with more than 90% of the missing values were removed, which included tweet_coord , airline_sentiment_gold, negativereason_gold is missing.
删除了缺失值超过90%的列,其中包括tweet_coord,airline_sentiment_gold和negativereason_gold缺失。
Hence, it would be better to delete these columns because they will not provide any useful information on our model.
因此,最好删除这些列,因为它们不会在我们的模型上提供任何有用的信息。
The majority of the comments are negative, which means people are generally dissatisfied with the airline company’s service.
大多数评论是负面的,这意味着人们通常对航空公司的服务不满意。
From the above bar graph, it is evident that United Airlines is widely recognized on Twitter. Of course, we do not know if that popularity is positive or negative. Apart from that, the fact that there are very few tweets in Virgin America also gives the impression that perhaps their standard is neither good nor bad.
从上面的条形图中可以看出,联合航空在Twitter上得到了广泛认可。 当然,我们不知道这种受欢迎程度是正面还是负面。 除此之外,维珍美国航空只有很少的推文,这也给人的印象是它们的标准既不好也不坏。
We will then define a new parameter ‘tweet_len,’ which will tell us the tweet’s length present in the ‘text’ column of our dataset.
然后,我们将定义一个新参数“ tweet_len”,该参数将告诉我们在数据集的“文本”列中出现的鸣叫长度。
There is not much connection between the amount of positive/neutral tweets and the tweet’s duration. However, for negative tweets, the distribution is strongly biased towards longer or longer tweets. This is plausible because the angrier the person who tweets, the more he/she has to say.After a complete EDA, we can say that the airline’s sentiments vary significantly depending on the airline. The most positive is Virgin America, while the most negative is United considering the overall sentiment.
正/中性推文的数量与推文的持续时间之间没有太多联系。 但是,对于负面推文,分布强烈偏向于更长或更长时间的推文。 这很合理,因为发推文的人越发怒,他/她不得不说的越多。在完成EDA之后,我们可以说航空公司的情绪因航空公司而异。 考虑到整体情绪,最正面的是维珍美国航空,而最负面的是曼联。
文字预处理 (Text Preprocessing)
Before building the model, we will process the column named ‘Text,’ which contains the raw text of the tweets posted by customers.
在构建模型之前,我们将处理名为“文本”的列,其中包含客户发布的推文的原始文本。
We will perform the text preprocessing by using the well known nltk library. To do this, we will import the necessary libraries into our notebook and create a new data frame containing just two columns that is Airline_sentiment and text.
我们将使用众所周知的nltk库执行文本预处理。 为此,我们将必要的库导入笔记本,并创建一个仅包含两列的新数据框,即Airline_sentiment和text。
tweet_senti = dataset[['airline_sentiment','text']]tweet_senti
``We will clean the column text and create a list named corpus to store it. We will do this by:
``我们将清理列文本并创建一个名为corpus的列表来存储它。 我们将通过以下方式做到这一点:
- Converting all the characters in Lowercase. 转换所有小写字符。
- Removing characters apart from A-Z and a-z. 除去AZ和az以外的字符。
- Removing the hashtags #. 删除主题标签#。
- Replacing ‘https://’ kind of URL into simple text ‘link’ 将“ https://”类型的URL替换为简单文本“链接”
nltk.download('stopwords')nltk.download('wordnet')from nltk.corpus import stopwordsfrom nltk.stem import WordNetLemmatizerfrom nltk.stem.porter import PorterStemmerwordnet = WordNetLemmatizer()ps = PorterStemmer()corpus = []for i in range(0, len(tweet_senti)): sntm= re.sub('[^a-zA-Z]', ' ', tweet_senti['text'][i]) sntm = sntm.lower() sntm = sntm.split() sntm = re.sub(r'#([^\s]+)', r'\1', tweet_senti['text'][i]) text = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','link',tweet_senti['text'][i]) #sntm = [ps.stem(word) for word in sntm if not word in set(stopwords.words('english'))] #sntm = ' '.join(sntm) corpus.append(sntm)
Now we will replace this ‘text’ column from the data frame tweet_senti with the new values of the list corpus. We do not need the rest of the columns in the dataset; thus, we will only use the two relevant columns that are ‘newtext’ and ‘airline_sentiment’ and make this the new data frame.
现在,我们将数据框架tweet_senti中的“文本”列替换为列表语料库的新值。 我们不需要数据集中的其余列; 因此,我们将仅使用“ newtext ”和“ airline_sentiment”这两个相关的列,并将其作为新的数据框。
tweet_senti['newtext']= corpustweet_senti.drop(["text"],axis=1, inplace = True)
We will then split the generated data frame into a Training set and Testing set, where 80% will be in the training set, and 20% will be in the testing set.
然后,我们将生成的数据帧分为训练集和测试集,其中80%将在训练集中,而20%将在测试集中。
建立ULMFiT模型 (Building the ULMFiT Model)
Universal Language Model Fine-tuning (ULMFIT) is a transfer learning technique that can assist with different NLP tasks. It’s been a State-of-The-Art (SoTA) NLP technique for a long time. The paper discusses applying ULMFiT to an IMDB sentiment problem. ULMFiT consists of three stages: a) LM Pre-training: The Language Model (LM) is trained on a general-domain corpus to capture the language’s general features in different layers. Transfer Learning and the ULMFiT method aims to align this pre-trained model with our problem. b) LM fine-tuning: Tweets’ language is different, and we need to fine-tune the language according to the dataset (tweets). Using discriminative fine-tuning and slanted triangular learning rates (STLR) to learn task-specific features, the full LM is fine-tuned on the target dataset. c) Classifier fine-tuning: The classifier is fine-tuned on the target task using gradual unfreezing, discriminative fine-tuning, and STLR to preserve low-level representations and adapt high-level ones.
通用语言模型微调(ULMFIT)是一种转移学习技术,可以帮助完成不同的NLP任务。 长期以来,这一直是最先进的(SoTA)NLP技术。 本文讨论了将ULMFiT应用于IMDB情感问题。 ULMFiT包括三个阶段:a)LM预训练:语言模型(LM)在通用域语料库上进行训练,以捕获语言在不同层的通用特征。 转移学习和ULMFiT方法旨在使该预训练模型与我们的问题保持一致。 b)LM微调:推文的语言不同,我们需要根据数据集(推文)对语言进行微调。 使用判别式微调和倾斜的三角形学习率(STLR)来学习特定于任务的功能,可以在目标数据集上微调整个LM。 c)分类器微调:使用逐步解冻,区分性微调和STLR对目标任务进行微调,以保留低级表示并适应高级表示。
建立语言模型 (Build the Language Model)
We will make heavy use of the Fastai library to build this model; thus, we will import the fastai library to develop and train our ULMFiT model. The fastai library text module contains all the required functions to identify a dataset suitable for the different NLP (Natural Language Processing) tasks and generate models using them quickly. We will use class TextDataBunch, which is ideal for training the language model. Then we will create a ‘TextLMDataBunch’ from tweet_train.csv. We will then specify ‘valid=0.1’ to set 10% of our training data for the validation set.
我们将大量使用Fastai库来构建此模型; 因此,我们将导入fastai库以开发和训练我们的ULMFiT模型。 fastai库文本模块包含所有必需的功能,以识别适合于不同NLP(自然语言处理)任务的数据集,并使用它们快速生成模型。 我们将使用TextDataBunch类,它是训练语言模型的理想选择。 然后,我们将从tweet_train.csv创建一个“ TextLMDataBunch”。 然后,我们将指定“ valid = 0.1”为验证集设置10%的训练数据。
from fastai.text import *tweet = TextLMDataBunch.from_csv(path='',csv_name='tweet_train.csv',valid_pct=0.1)
So now we will build a language model, then we will train it and finally save the encodings (Encodings means the optimized weights of the trained model.) To build our language model, we will use the class language_model_learner() from fastai. We will pass in our ‘tweet’ object to specify our Twitter dataset. Along with it, we will give in AWD_LSTM to clarify that we’re using this particular architecture for our language model.The AWD-LSTM dominates the State-of-The-Art modeling of languages. AWD-LSTM stands for ASGD Weight-Dropped LSTM. It uses a variety of well-known regularization techniques.
因此,现在我们将构建一个语言模型,然后对其进行训练,最后保存编码(编码表示经过训练的模型的优化权重。)要构建我们的语言模型,我们将使用fastai中的language_model_learner()类。 我们将传递“ tweet”对象以指定我们的Twitter数据集。 伴随它,我们将提供AWD_LSTM来阐明我们在语言模型中使用了这种特定的体系结构.AWD-LSTM主导了语言的最新模型。 AWD-LSTM代表ASGD减重LSTM。 它使用了各种众所周知的正则化技术。
tweet_model = language_model_learner(tweet, AWD_LSTM, drop_mult=0.3)tweet_model.model
We will use the learning rate finder class to find the optimum learning rate. The learning rate is a hyper-parameter that regulates how often you modify our neural network weights. It is an optimization algorithm that decides each iteration’s step size when advancing toward the global minima of the loss function.
我们将使用学习率查找器类来找到最佳学习率。 学习率是一个超参数,可调节您修改神经网络权重的频率。 它是一种优化算法,可在向损失函数的全局最小值前进时确定每个迭代的步长。
tweet_model.lr_find()
tweet_model.recorder.plot(show_grid=True, suggestion=True)
As evident from the graph plotted above, we will take 3.98e-02 as the learning rate because, after that particular value, the loss becomes minimum. We must set the cycle length to 1 as we train only with one epoch. We will also use another Hyperparameter, i.e., moms. It refers to a tuple that has the parameters as (max_momentum,min_momentum).
从上面的图表可以明显看出,我们将3.98e-02作为学习率,因为在该特定值之后,损耗变为最小。 我们必须将周期长度设置为1,因为我们只训练一个时期。 我们还将使用另一个超参数,即妈妈。 它引用一个元组,其参数为(max_momentum,min_momentum)。
tweet_model.fit_one_cycle(cyc_len=1,max_lr=3.98e-02,moms=(0.85,0.75))tweet_model.unfreeze()tweet_model.fit_one_cycle(cyc_len=5, max_lr=slice(3.98e-02/(2.6**4),3.98e-02), moms=(0.85, 0.75))
建筑物分类模型 (Building Classification Model)
Once we have built a Language model, we will change the model accordingly in order to perform the classification task.
建立语言模型后,我们将相应地更改模型以执行分类任务。
To do this, we will first create a new learner object by using ‘text_classifier_learner.’ The primary concept behind this learner object is similar to ‘language_model_learner’ because we will use the same architecture of AWD_LSTM. It can also similarly take in callbacks that allow us to train our model with unique optimization techniques. After that, we will load the encoders into this object i.e., learner.
为此,我们将首先使用'text_classifier_learner'创建一个新的学习者对象。 此学习者对象背后的主要概念类似于'language_model_learner',因为我们将使用AWD_LSTM的相同体系结构。 它也可以类似地接受回调,使我们可以使用独特的优化技术来训练模型。 之后,我们将编码器加载到该对象(即学习者)中。
Next, we will perform Gradual Unfreezing. It is the process to unfreeze the last layers as it contains the most general information. After fine-tuning unfrozen layers for one epoch, we go for the next lower layer and repeat until we complete all layers until convergence last iteration. We will unfreeze and train layers of our model one by one from top to bottom, which means from the previous layer to the inner layers. This is done to prevent the model from forgetting the features.
接下来,我们将执行“逐步解冻”。 这是解冻最后一层的过程,因为它包含最一般的信息。 在将未冻结的层微调一个时期后,我们进入下一个较低的层并重复,直到完成所有层,直到最后一次迭代收敛为止。 我们将逐层解冻和训练模型的各个层,这意味着从上一层到内层。 这样做是为了防止模型忘记功能。
Similarly, how we did before, we choose a learning rate (optimized one) before the graph starts descending and reaches a minimum. Along with it, we will use the one cycle policy.
同样,我们之前做过的事情是,在图形开始下降并达到最小值之前,选择学习率(优化值)。 随之,我们将使用一个周期的策略。
Now we will keep on unfreezing layer by layer.As different layers capture different types of information, they should be fine-tuned to varying extents. Instead of using the same learning rate for all layers of the model, discriminative fine-tuning helps us apply specific learning rates to different layers. Thus, we will then train with the next unfrozen layer and apply discriminative fine-tuning.
现在我们将逐层解冻,因为不同的层捕获不同类型的信息,因此应在不同程度上进行微调。 区分微调可以代替我们对模型的所有层使用相同的学习率,而可以帮助我们将特定的学习率应用于不同的层。 因此,我们将在下一个未冻结的层进行训练,并进行区分性的微调。
预测与评估模型 (Prediction and Evaluating Model)
After applying Gradual unfreezing and Discriminative fine-tuning, we will try to predict the accuracy of the model on the testing set that we generated by splitting the original dataset.
在应用逐步解冻和区分微调之后,我们将尝试通过拆分原始数据集来预测测试集上模型的准确性。
test['airline_senti_pred'] = test['newtext'].apply(lambda row:str(tweet_model.predict(row)[0]))from sklearn.metrics import accuracy_scorefrom sklearn.metrics import confusion_matrixprint("Accuracy of Model: {}".format(accuracy_score(test['airline_sentiment'],test[ 'airline_senti_pred'])))#Plotting Confusion Matrixfrom sklearn.metrics import confusion_matrixcf_matrix = confusion_matrix(test['airline_sentiment'], test['airline_senti_pred'])print(cf_matrix)
结论(Conclusion)
After analyzing all the different learning rates and methods we used, the accuracy we got was 0.825 (82.5%). Language modeling can be viewed as the ideal source for NLP. This is because it encompasses many aspects of language, such as long-term interactions, hierarchical relationships, and sentiments. It provides data in almost unlimited quantities for most domains and languages. As evident from the gradual unfreezing, upon increasing the count of the unfrozen layer per epoch, the validation loss was increasing, resulting in over fitting the model. The best case we obtained was with two layers unfrozen because, in that case, both the validation loss and the training loss was considerably less.
在分析了我们使用的所有不同学习率和方法之后,我们获得的准确度为0.825(82.5%)。 语言建模可以被视为NLP的理想资源。 这是因为它涵盖了语言的许多方面,例如长期交互,层次关系和情感。 它为大多数域和语言提供了几乎无限量的数据。 从逐渐解冻可以明显看出,随着每个时期未冻结层数的增加,验证损失将增加,从而导致模型过度拟合。 我们获得的最佳情况是未冻结的两层,因为在这种情况下,验证损失和训练损失都大大减少了。
翻译自: https://medium.com/@rajasubhare/twitter-sentiment-analysis-using-ulmfit-e27d8326d1c0
twitter 情感分析