当前位置:   article > 正文

数据科学与大数据分析之项目5-情感分析_大数据情感分析

大数据情感分析

情感分析Sentiment Analysis

项目介绍

在Twitter上选择一个你感兴趣的话题,比如一部电影,一个名人,或者任何流行语。收集至少200条与此主题相关的tweet。用手动将它们标记为positive, neutral 或者negative。接下来,将他们随机分成75%的tweet作为训练集,剩下的25%作为测试集。在这些tweet上部署几个分类器来执行情绪分析。报告分类准确度、AUC,并绘制混淆矩阵。最后评估哪个分类器的性能在这里优于其他分类器。

本项目包含:

  1. 描述收集tweet并手动标记它们的过程以及如何表示每个tweet进行分类。

  2. 获得的数据集的描述。

  3. 对于每种分类器,描述其工作原理、分类程序和参数设置。

  4. 每个分类器的分类精度,AUC与混淆矩阵

  5. 哪个分类器的性能比其他分类器在这里更适合?

项目开始

1

《花木兰》是最近上映的一部题材丰富的电影,在这个项目中,我们通过收集关于《花木兰》的推文进行情感分析。

为了收集我们感兴趣的数据,我们首先申请twitter开发者帐户,然后使用twitter API和Tweepy进行授权搜索。数据集通过一个简单的python程序采集,如下图。
在这里插入图片描述
我们收集了400多条包含“花木兰”一词的推文,并忽略了这些转发或非英语推文的语言。所有数据将保存在一个名为“result_mulan2.csv”的csv文件中。然后,我们手工标记这些推文为积极的、中性的或消极的。

积极的推文判断标准为: 赞美的话;高兴或者快乐,点赞的emoji表情;准备推荐给其他人等等。
中性的推文判断标准为:既没有积极的推文内容,也没有消极的推文的内容,有着普通的感受或者客观的分析等等。
消极的推文判断标准为: 批评的话,讽刺的话,反讽等与其他优秀电影比较的话,消极的emoji表情等等。

在手动标记的同时我们要将那些毫不相关的推文或者广告性质的删除掉。最后我们保留了248条数据。

Tweets包含很多twitter handles(@user),这是twitter用户在twitter上识别的方式。我们将从数据中删除所有这些twitter句柄,因为它们不能传递太多信息。

在一般的文本处理中,标点、数字和特殊字符不是很有帮助。在这里,我们将用空格替换除字符和标记以外的所有内容。删除所有长度为3或更短的单词。例如,像“嗯”和“哦”这样的词。因为我们收集的微博都是基于‘木兰’这个关键词,所以所有的微博都会包含‘木兰’、‘电影’、‘迪士尼’、‘迪士尼’、‘迪士尼’和‘迪士尼’这几个词,为了防止overfit,我们会删除这些词。同时删除数据集中包含“http”或“https”的所有链接。
在这里插入图片描述
然后,我们将标记数据集中所有已清理的tweet。
词干提取是一个基于规则的过程,从一个单词中去掉后缀(“ing”、“ly”、“es”、“s”等)。

例如,“play”、“player”、“playing”、“play”和“playing”是“play”一词的不同变体.

在这里插入图片描述

随机分成75%的tweet作为训练集,剩下的25%作为测试集.

在这里插入图片描述
我们使用word袋来构造文本特征.
在这里插入图片描述
使用sklearn库中的DecisionTreeClassifier来获得分数。然后我们计算精度,AUC和建立混乱矩阵.
在这里插入图片描述
混乱矩阵的函数在这里插入图片描述

2

该数据集包含三个属性:datetime、text和sentiment。数据集中共有248条推文。下图显示了每种情绪下推文的数量。有59条负面推文,99条正面推文和90条中性推文。
在这里插入图片描述
显示所有标记为正面的推文的词云
在这里插入图片描述
显示标记为中性的所有推文的词云
在这里插入图片描述
显示所有标记为负面的推文的词云
在这里插入图片描述
3. 分类器介绍

逻辑回归
Principle:
Assume that the data obeys the distribution, then use maximum likelihood estimation to
estimate the parameters. Fit the decision boundary (not limited to linear, but also polynomial),
and then establish the relationship between the boundary and the classification probability, to
obtain the probability in the case of two classifications.
Classification procedure:

  1. After linear transformation, use the sigmoid function to compress the value between 0-1.
  2. Use the cross-entropy objective function to judge the degree of acquaintance between
    information.
  3. Use the gradient update method to update the weights.
    Parameter setting:
    For small datasets, ‘liblinear’ is a good choice
    Maximum number of iterations taken for the solvers to converge.

K最邻近算法

Principle:
In the eigenspace, if most of the k recent samples near a sample belong to a category, the
sample also belongs to that category.
在这里插入图片描述
If K=3, the three nearest neighbors of the green circle are 2 yellow squares and 1 blue
triangle. The minority is subordinate to the majority and based on statistical methods, it is
determined that the green circle to be classified belongs to the yellow square.

If K=9, the nearest 9 neighbors of the green circle are 4 yellow squares and 5 blue triangles.
The minority is subordinate to the majority and based on statistical methods, it is determined
that the green circle to be classified belongs to the blue triangle.

When it is impossible to determine which category the point belongs to, we can look at its
position characteristics according to the theory of statistics. The weight of its neighbors
should be measured and put into a new category with greater weight. This is the kernel of the
K-nearest neighbor algorithm.

The K-nearest neighbor algorithm does not train the data. It compares the unknown data with
the known data to get the results directly. Therefore, it can be said that k- nearest algorithm
does not have an explicit learning process.

Distance measurement: Euclidean distance.

在这里插入图片描述
Classification Procedure:
Raw materials: Sample dataset and corresponding label(feature).

Input new data without labels, compare each feature of the new data with the corresponding
feature of the data in the sample dataset. Then extract the classification label of the sample
with the most similar data (nearest neighbor).

We only select the first “k” most similar data in the sample dataset.

Select the label that appears most frequently among “k most similar data”. This will serve as a
label for the new data.

Parameter Setting:
class sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, weights=‘uniform’, algorithm=’
auto’, leaf_size=30, p=2, metric=‘minkowski’, metric_params=None, n_jobs=None)
n_neighbors: Number of neighbors set to be 3 for kneighbors queries.

决策树分类器

Principle:
A decision tree is a tree structure in which each internal node represents a test on an attribute,
each branch represents a test output, and each leaf represents a category. Decision tree
learning is an inductive learning based on case. It adopts a top-down recursive method, and its
basic idea is to construct a tree with the fastest entropy decline based on information entropy.
The entropy value at the point to the leaf node is zero, at which point the instances in each
leaf node belong to the same class.

Classification Procedure:

Calculate the amount of information on the partition line of a dataset.

Iterating through all the features as partition conditions, calculate the entropy of information
for partition dataset according to their features.

Select the feature with the maximum information gain and use this feature as the data
partitioning node to classify the data.

Process the partitioned child datasets recursively, continue above steps for unselected
features. Then the optimal data partitioning feature is selected to divide sub-dataset.

Conditions for recursion stop:

All the features are used up.

Entropy gain after partitioning is small enough

For these stop conditions, threshold values of the selection information gain need to be set.

Parameter Setting:
class sklearn.tree.DecisionTreeClassifier(criterion=‘gini’, splitter=‘best’, max_depth=None, mi
n_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None
, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_sp
lit=None, class_weight=None, presort=‘deprecated’, ccp_alpha=0.0)

SVC()

Principle:

SVM performs linear classification. It can also efficiently perform a non-linear classification
using what is called the kernel trick, implicitly mapping their inputs into high-dimensional
feature spaces.

When data are unlabelled, supervised learning is not possible, and an unsupervised learning
approach is required, which attempts to find natural clustering of the data to groups, and then
map new data to these formed groups.

Classification Procedure:
Step 1: Implement the traditional SMO algorithm
Step 2: Implement kernel function cache
Step 3: Optimize the error value solution
Step 4: Realize the separation of hot and cold data
Step 5: Support Ensemble
Step 6: Continue to optimize the kernel function calculation
Step 7: Support sparse vector and non-sparse vector

Parameter Setting:
class sklearn.svm.SVC(*, C=1.0, kernel=‘rbf’, degree=3, gamma=‘scale’, coef0=0.0,
shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None,
verbose=False, max_iter=-1, decision_function_shape=‘ovr’, break_ties=False, random_state=None)

GaussianNB()

Principle:

The principle of Bayes’ can be summarized as Prior Probability + Data = Posterior
probability.

In Scikit-Learn, there are three classes of naive Bayes classification algorithm. GaussianNB,
MultinomialNB and BernoulliNB. GaussianNB is Naive Bayes whose priors are Gaussian
distributions. In general, it is better to use GaussianNB if the distribution of sample features is
continuous.

Classification Procedure

Use inter-division cutting and direct discretization. This method is more difficult to control
the size of the cells, and the requirements for the quality of the training set are relatively high.

Parameter Setting:
class sklearn.naive_bayes.GaussianNB(priors=None, var_smoothing=1e-09.

4.每个分类器的分类精度,AUC与混淆矩阵。

Logistic Regression:
Accuracy = 0.6935
AUC = 0.8054
the confusion matrix
在这里插入图片描述
K-Neigbors Classifier:

Accuracy = 0.5645
AUC = 0.6994
the confusion matrix
在这里插入图片描述
决策树分类器:
Accuracy = 0.7580
AUC = 0.8308
the confusion matrix
在这里插入图片描述
SVC:
Accuracy =0.4032
AUC = 0.5
the confusion matrix

在这里插入图片描述
GaussianNB:

Accuracy = 0.4354
AUC = 0.6978
the confusion matrix
在这里插入图片描述
在这里插入图片描述

5.哪个分类器的性能比其他分类器好?

决策树比其他树表现更好。它具有最高的精度和AUC,即0.72和0.85

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/很楠不爱3/article/detail/286902
推荐阅读
相关标签
  

闽ICP备14008679号