赞
踩
把数据financial-news.csv
读入一个DataFrame中。
# read csv
import pandas as pd
df_financial_news = pd.read_csv("C:/Users/86157/Desktop/Python/financial-news.csv")
df_financial_news.columns = ["Sentiment", "Title"]
df_financial_news
Sentiment | Title | |
---|---|---|
0 | neutral | Technopolis plans to develop in stages an area... |
1 | negative | The international electronic industry company ... |
2 | positive | With the new production plant the company woul... |
3 | positive | According to the company 's updated strategy f... |
4 | positive | FINANCING OF ASPOCOMP 'S GROWTH Aspocomp is ag... |
... | ... | ... |
4840 | negative | LONDON MarketWatch -- Share prices ended lower... |
4841 | neutral | Rinkuskiai 's beer sales fell by 6.5 per cent ... |
4842 | negative | Operating profit fell to EUR 35.4 mn from EUR ... |
4843 | negative | Net sales of the Paper segment decreased to EU... |
4844 | negative | Sales in Finland decreased by 10.5 % in Januar... |
4845 rows × 2 columns
创建一个新的DataFrame,只包含Sentiment为negative和positive的数据。
# filter data
financial_news = df_financial_news[df_financial_news["Sentiment"].isin(["negative","positive"])]
financial_news
Sentiment | Title | |
---|---|---|
1 | negative | The international electronic industry company ... |
2 | positive | With the new production plant the company woul... |
3 | positive | According to the company 's updated strategy f... |
4 | positive | FINANCING OF ASPOCOMP 'S GROWTH Aspocomp is ag... |
5 | positive | For the last quarter of 2010 , Componenta 's n... |
... | ... | ... |
4839 | negative | HELSINKI Thomson Financial - Shares in Cargote... |
4840 | negative | LONDON MarketWatch -- Share prices ended lower... |
4842 | negative | Operating profit fell to EUR 35.4 mn from EUR ... |
4843 | negative | Net sales of the Paper segment decreased to EU... |
4844 | negative | Sales in Finland decreased by 10.5 % in Januar... |
1967 rows × 2 columns
使用新闻标题作为分类特征列,Sentiment为预测目标,将Sentiment转化成数值(negative为0,positive为1),并将数据集划分为训练集和测试集。
# define X and y
financial_news["Sentiment"] = financial_news.Sentiment.map({"negative": 0, "positive": 1})
X = financial_news.Title
y = financial_news.Sentiment
# split into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print(X_train.shape)
print(X_test.shape)
(1475,)
(492,)
使用CountVectorizer将X_train和X_test转换为document-term矩阵。
# import and instantiate the vectorizer
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
# fit and transform X_train, but only transform X_test
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)
使用朴素贝叶斯模型,预测测试集中新闻标题的情感类别,并计算预测精度。
# import/instantiate/fit
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
MultinomialNB()
# make class predictions
y_pred_class = nb.predict(X_test_dtm)
# calculate accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))
0.8414634146341463
计算AUC。
# predict class probabilities
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
y_pred_prob
array([6.27558532e-01, 8.44644302e-01, 9.99392889e-01, 9.94928970e-01, 7.55609118e-01, 1.18790727e-02, 9.99348885e-01, 9.99984429e-01, 4.75062994e-03, 2.00757787e-01, 9.01069401e-03, 8.46851136e-02, 3.87060928e-01, 3.02499604e-03, 9.54772122e-01, 9.92931414e-01, 9.66267572e-01, 9.99993157e-01, 9.93445195e-01, 9.99998872e-01, 9.99999731e-01, 8.01801594e-01, 9.61466297e-01, 9.79162770e-04, 9.76174198e-01, 9.99999855e-01, 9.99999991e-01, 2.32387225e-03, 4.56566550e-01, 9.99999980e-01, 2.35952682e-01, 9.99154400e-01, 9.57484867e-01, 9.90707310e-01, 9.99969522e-01, 5.71764834e-03, 3.46627441e-03, 2.49080522e-04, 9.98558356e-01, 9.99995733e-01, 8.64893168e-01, 9.72773860e-01, 9.99987623e-01, 9.73247821e-01, 9.15411638e-01, 1.68009941e-01, 9.41700918e-01, 9.67684265e-01, 9.99994942e-01, 1.00000000e+00, 9.80138748e-01, 9.88639280e-01, 9.99998410e-01, 9.96606330e-01, 4.03940421e-01, 7.74330013e-01, 9.99990112e-01, 9.98656305e-01, 6.09325883e-01, 3.49824041e-01, 2.10919850e-02, 9.99989566e-01, 3.80680278e-01, 9.94731916e-01, 9.86088747e-01, 9.96583304e-01, 9.99984578e-01, 9.84163736e-01, 8.20935898e-01, 3.65872002e-01, 9.90861118e-01, 9.99999998e-01, 9.99999084e-01, 8.35389615e-01, 9.69678042e-01, 6.01907207e-01, 4.47572661e-02, 8.24468800e-01, 9.91944817e-01, 9.88849404e-01, 9.99999519e-01, 7.01901394e-01, 9.38109728e-01, 9.99338101e-01, 9.97821417e-01, 9.93769138e-01, 9.99997996e-01, 9.93651056e-01, 1.39286255e-03, 1.51476136e-02, 6.98682587e-01, 9.99745440e-01, 9.99975514e-01, 2.82691359e-01, 8.61302559e-01, 9.99989215e-01, 1.79249517e-03, 9.88351851e-01, 9.79131330e-01, 9.92760437e-01, 9.99991941e-01, 9.07257399e-02, 9.78499241e-01, 5.08482334e-02, 9.71444000e-01, 2.09321694e-03, 9.76015014e-01, 1.71042755e-02, 9.99101449e-01, 6.45719114e-02, 3.12287386e-01, 8.07587678e-01, 9.99388181e-01, 9.95639148e-01, 1.24095669e-01, 1.12780614e-01, 2.12196511e-01, 1.67307569e-01, 1.04953405e-02, 3.31857697e-02, 9.99999972e-01, 2.20789991e-01, 2.91258258e-01, 9.99346582e-01, 9.88404337e-01, 9.98649404e-01, 7.33033500e-01, 9.92860079e-01, 9.99670915e-01, 2.70358712e-01, 1.23139012e-02, 8.84401633e-01, 9.99621456e-01, 9.79356887e-01, 9.98717170e-01, 7.21261032e-01, 2.46997106e-02, 8.04115162e-01, 9.42656217e-01, 2.15238143e-03, 9.84512825e-01, 7.20393157e-01, 3.16032470e-01, 9.99841837e-01, 9.40982981e-01, 9.99992667e-01, 8.73473356e-01, 9.99910991e-01, 1.08025058e-03, 9.98238367e-01, 8.71308701e-01, 9.94454129e-01, 4.88918635e-01, 9.99074408e-01, 9.95310427e-01, 9.99213573e-01, 3.30443276e-01, 6.99586119e-01, 9.99991313e-01, 9.98580256e-01, 7.80756854e-01, 4.70829044e-01, 6.80646756e-01, 1.65694527e-03, 9.96427142e-01, 7.84971720e-01, 6.84343401e-01, 9.55996721e-01, 9.80644659e-01, 9.56073039e-01, 4.67062485e-01, 1.10879313e-03, 1.59448128e-01, 3.65729865e-01, 9.99759446e-01, 9.84235047e-01, 9.99998110e-01, 7.82928465e-02, 8.31960998e-01, 9.99999295e-01, 4.79190025e-03, 7.46471207e-03, 5.44717926e-01, 9.93763037e-01, 3.66082000e-07, 5.67293681e-01, 7.41697351e-01, 9.53659345e-01, 9.99999646e-01, 6.88314678e-01, 9.81789535e-01, 9.98392617e-01, 9.92638575e-01, 2.95455904e-01, 7.19305443e-01, 1.08526848e-01, 9.99997760e-01, 1.07013719e-03, 1.72450054e-01, 9.28010389e-01, 8.64766558e-01, 4.88535829e-02, 9.99999764e-01, 9.99998375e-01, 9.99976110e-01, 6.82920214e-07, 2.99392828e-08, 8.11291073e-04, 9.45982549e-01, 1.02699875e-02, 1.00000000e+00, 9.98299347e-01, 2.42010807e-01, 9.99999607e-01, 9.66413927e-01, 9.99999913e-01, 9.94866785e-01, 9.99988741e-01, 9.99404548e-01, 9.80553396e-01, 1.13329369e-01, 9.99997330e-01, 8.58200397e-01, 2.04815527e-03, 9.99131473e-01, 9.99991302e-01, 9.49623650e-01, 4.42321107e-01, 9.99751710e-01, 7.97051422e-03, 9.99999593e-01, 3.59898641e-02, 9.99909854e-01, 5.41339862e-01, 1.30920407e-03, 9.99999395e-01, 9.98972365e-01, 9.99493573e-01, 9.99977679e-01, 9.99998100e-01, 9.99811059e-01, 9.99999864e-01, 9.99998309e-01, 9.47408170e-01, 9.92083180e-01, 9.99975674e-01, 1.44192908e-04, 9.99999665e-01, 9.21091675e-01, 3.12880624e-01, 8.50525671e-01, 3.35498143e-02, 2.81728231e-01, 9.96186762e-01, 2.05925376e-02, 9.84496190e-01, 9.98510693e-01, 9.18982304e-01, 9.99998503e-01, 9.99260861e-01, 7.43181090e-04, 6.94717864e-01, 1.91751223e-01, 9.41829997e-01, 4.25109814e-06, 9.99999280e-01, 8.00983164e-01, 9.99928380e-01, 9.41005759e-01, 9.99600268e-01, 9.85231620e-01, 9.99976319e-01, 1.12990122e-03, 1.60499900e-04, 5.99925857e-01, 8.30509209e-02, 9.99462900e-01, 9.99891779e-01, 1.33883366e-02, 6.70543537e-01, 9.97629288e-01, 9.99811909e-01, 6.86637693e-01, 3.64910734e-04, 4.42178020e-01, 6.82827832e-02, 3.47758182e-05, 8.90812745e-01, 4.26416751e-01, 3.72931775e-01, 1.62158637e-02, 8.42180141e-01, 9.54935897e-01, 9.09467972e-01, 9.87782415e-01, 7.40518905e-02, 1.17342220e-03, 9.99952786e-01, 9.91367170e-01, 3.99311773e-01, 7.69109221e-01, 5.79014053e-03, 9.99848569e-01, 2.03517547e-02, 9.99690120e-01, 9.99726469e-01, 9.99334065e-01, 6.50233816e-02, 2.50329839e-02, 9.99698766e-01, 1.01382061e-02, 9.99965609e-01, 9.44424660e-01, 9.99999998e-01, 8.40748266e-01, 9.99999298e-01, 6.86463867e-01, 9.93742034e-01, 9.99993036e-01, 9.22022909e-01, 9.97823127e-01, 9.95958367e-01, 9.99874947e-01, 9.92702684e-01, 9.99999544e-01, 9.99998751e-01, 9.99955350e-01, 1.88999128e-03, 9.99139779e-01, 9.41165666e-01, 9.99994850e-01, 3.40576530e-01, 9.03690102e-01, 9.99962700e-01, 9.90327565e-01, 9.99995689e-01, 2.78863157e-01, 9.99956115e-01, 3.96066003e-02, 4.23975661e-01, 7.58303026e-01, 9.94550011e-01, 9.99951786e-01, 9.99987917e-01, 9.23515425e-01, 9.77394186e-01, 9.88802533e-01, 1.95473133e-01, 3.31757637e-01, 6.21729277e-03, 9.93120495e-01, 4.58094395e-04, 9.98194731e-01, 9.73474661e-01, 9.55355838e-01, 9.92430209e-01, 9.97561750e-01, 9.89672216e-01, 9.80365344e-01, 8.85681644e-01, 9.99949155e-01, 2.19075846e-02, 3.10918697e-01, 6.08850140e-01, 1.10807543e-01, 4.16152593e-01, 9.99381264e-01, 9.56255320e-01, 9.57628315e-01, 9.98821876e-01, 2.89830209e-06, 2.31741051e-02, 7.68851573e-01, 8.61984697e-01, 3.06364702e-03, 3.86398256e-04, 9.98550491e-01, 9.80659107e-01, 9.78561230e-01, 9.99999120e-01, 9.99998791e-01, 9.77744460e-01, 2.31048517e-01, 9.99372683e-01, 9.75088094e-01, 6.54356591e-01, 1.00000000e+00, 9.95214347e-01, 9.99724077e-01, 9.85855455e-01, 9.03264065e-01, 9.97744085e-01, 9.87414768e-01, 9.66313500e-01, 7.18404793e-01, 5.68109280e-01, 9.96486959e-01, 9.95007921e-01, 1.30000255e-03, 9.99180186e-01, 9.99762918e-01, 9.98464001e-01, 5.59450297e-03, 3.38386720e-01, 2.73085227e-01, 8.08404643e-03, 9.99130325e-01, 9.99982657e-01, 5.36836759e-01, 9.91486602e-04, 9.60740256e-01, 9.99963678e-01, 6.14931804e-01, 9.92683926e-01, 9.26580874e-01, 9.99443665e-01, 9.63102526e-01, 9.99999999e-01, 3.42089220e-01, 9.99999511e-01, 5.92674592e-01, 6.69306395e-01, 9.98508413e-01, 2.82293689e-01, 1.60440663e-01, 9.99082788e-01, 9.14940182e-01, 9.99935484e-01, 1.74803196e-01, 9.99721848e-01, 8.45169324e-01, 6.53218059e-01, 9.86077831e-01, 9.08699633e-03, 6.23003974e-01, 9.96284133e-01, 9.95856607e-01, 9.62696775e-01, 8.82507928e-01, 9.99983467e-01, 9.54225380e-01, 9.99986453e-01, 8.39336790e-02, 9.97438844e-01, 7.68387895e-01, 1.50053687e-02, 9.99981786e-01, 9.24441330e-01, 9.99999703e-01, 1.00000000e+00, 9.99982054e-01, 9.99940575e-01, 9.99999968e-01, 1.12287284e-03, 3.63304091e-03, 9.99998143e-01, 9.99993370e-01, 9.99999359e-01, 9.98823967e-01, 4.81372620e-03, 7.89788849e-02, 9.84859921e-01, 9.78400096e-01, 9.99095045e-01, 8.56659072e-02, 8.68465556e-01, 9.00102745e-02, 9.99461603e-01, 9.98674103e-01, 6.72976042e-01, 7.07112449e-01, 3.97088571e-01, 9.46073036e-01, 9.75383066e-01, 9.99863415e-01, 9.98832365e-01, 1.85230047e-01, 9.99998732e-01, 6.98740353e-02, 4.56454978e-01, 9.68243913e-01, 9.98124661e-01, 9.97216796e-01, 1.82656550e-03, 8.74145355e-01, 9.95201235e-01, 9.91142259e-01, 9.99998336e-01, 9.99999266e-01, 1.18427134e-01, 9.99999999e-01, 6.29683404e-01])
# calculate the AUC using y_test and y_pred_prob
print(metrics.roc_auc_score(y_test, y_pred_prob))
0.8969479630593633
绘制ROC曲线。
# plot ROC curve using y_test and y_pred_prob
import matplotlib.pyplot as plt
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob)
plt.plot(fpr, tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title('ROC curve for prediction of the classes of emotions in financial news')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)
显示混淆矩阵,并计算敏感度和特异性,评论结果。
# print the confusion matrix
confusion = metrics.confusion_matrix(y_test, y_pred_class)
print(confusion)
[[109 44]
[ 34 305]]
# calculate sensitivity
print(metrics.recall_score(y_test, y_pred_class))
0.8997050147492626
# calculate specificity
TP = confusion[1][1]
TN = confusion[0][0]
FP = confusion[0][1]
FN = confusion[1][0]
print(TN / float(TN + FP))
0.7124183006535948
对模型的敏感度和特异性做出评论:该模型的敏感度高于特异性较低,表明该模型识别出与新闻标题相符的情感类别的水平较高,但同时误报的水平也较高,即该模型容易把实际上有积极和消极情感的标题都识别为有消极情感的标题。
查看测试集中一些被预测错误的新闻标题,即false positives和false negatives。
# first 10 false positives (meaning they were incorrectly classified as positive sentiment)
df1 = X_test[y_test < y_pred_class]
print(df1[0:10])
4186 EB announced in its stock exchange release on ...
4787 According to the company , in addition to norm...
4162 According to Laavainen , Raisio 's food market...
4283 This is bad news for the barbeque season .
4436 `` We can say that the number of deals has bec...
4055 Cerberus Capital Management LP-backed printing...
1707 Furthermore , sales of new passenger cars and ...
4696 Net sales of Kyro 's main business area , Glas...
1989 Danish company FLSmidth has acknowledged that ...
4028 Myllykoski , with one paper plant in Finland ,...
Name: Title, dtype: object
# first 10 false negatives (meaning they were incorrectly classified as negative sentiment)
df2 = X_test[y_test > y_pred_class]
print(df2[0:10])
107 In the fourth quarter of 2008 , net sales incr...
483 Compared with the FTSE 100 index , which rose ...
2060 SCANIA Morgan Stanley lifted the share target ...
2235 The members of the management team will contri...
2169 Operating loss was EUR 179mn , compared to a l...
689 Finnish Sampo Bank , of Danish Danske Bank gro...
708 Due to rapid expansion , the market share of T...
201 First quarter underlying operating profit rose...
215 In the fourth quarter of 2009 , Orion 's net p...
2291 Operating profit totaled EUR 17.7 mn compared ...
Name: Title, dtype: object
为什么这些新闻标题会被预测错?
这些新闻标题的长度过长,含有的单词数量比较多,含义比较复杂,因此容易被预测错。
使用所有的新闻标题做预测,而不仅仅是negative和positive的新闻标题。
# define X and y using the original DataFrame,remember to transform y into integers
financial_news["Sentiment"] = financial_news.Sentiment.map({"negative": 0, "neutral": 1, "positive": 2})
X_new = df_financial_news.Title
y_new = df_financial_news.Sentiment
# split into training and testing sets
X_new_train, X_new_test, y_new_train, y_new_test = train_test_split(X_new, y_new, random_state=1)
print(X_new_train.shape)
print(X_new_test.shape)
(3633,)
(1212,)
# create document-term matrices
X_new_train_dtm = vect.fit_transform(X_new_train)
X_new_test_dtm = vect.transform(X_new_test)
# fit a Naive Bayes model
nb.fit(X_new_train_dtm, y_new_train)
MultinomialNB()
# make class predictions
y_new_pred_class = nb.predict(X_new_test_dtm)
# calculate the testing accuary
print(metrics.accuracy_score(y_new_test, y_new_pred_class))
0.7285478547854786
# print the confusion matrix
confusion_new = metrics.confusion_matrix(y_new_test, y_new_pred_class)
print(confusion_new)
[[ 77 44 22]
[ 19 611 65]
[ 18 161 195]]
有何评论:随着类别的增加,朴素贝叶斯模型预测的精度也会降低。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。