赞
踩
import graphlab; #导入包
products = graphlab.SFrame.read_csv('amazon_baby.csv',verbose = False);
#导入表格
products.head(2)#展示
name | review | rating |
---|---|---|
Planetwise Flannel Wipes | These flannel wipes are OK, but in my opinion … | 3 |
Planetwise Wipe Pouch | it came early and was not disappointed. i love … | 5 |
我们使用text_analytics工具,计算产品review的词频,并作为新的一列。
products['word_count'] = graphlab.text_analytics.count_words(products['review'])
#计算词频,放到新的一列,一个dict Sarray结构
products.head(2)#展示
name | review | rating | word_count |
---|---|---|---|
Planetwise Flannel Wipes | These flannel wipes are OK, but in my opinion … | 3 | {‘and’: 5L, ‘stink’: 1L, ’because’: 1L, ‘order … |
Planetwise Wipe Pouch | it came early and was not disappointed. i love … | 5 | {‘and’: 3L, ‘love’: 1L, ’it’: 2L, ‘highly’: 1L, … |
一般来说,我们对评论中的所有单词使用词频计算来训练情感分类器模型。 现在,我们按照相同的套路,但只使用这个评论所有词的一个子集:selected_words。通常, ML的练习者会在训他们的模型之前抛弃认为“不重要”单词。 此过程通常有助于准确性。 然而,在这里,我们将抛掉除上述极少数之外的所有词语。在我们的模型中使 用如此少的单词将损害我们准确性,但可以帮助我们理解我们的分类器在做什么。
selected_words = ['awesome', 'great', 'fantastic',
'amazing', 'love', 'horrible', 'bad',
'terrible', 'awful', 'wow', 'hate']
#products.unpack('word_count',limit = selected_words,column_name_prefix="",na_value = 0)
for word in selected_words:
def single_word_count(sf):
dic = sf['word_count']
if word in dic:
return dic[word]
else:
return 0
products[word] = products.apply(single_word_count)
products.head(2)
name | review | rating | word_count | awesome | great | fantastic |
---|---|---|---|---|---|---|
Planetwise Flannel Wipes | These flannel wipes are OK, but in my opinion … | 3 | {‘and’: 5L, ‘stink’: 1L, ’because’: 1L, ‘order … | 0 | 0 | 0 |
Planetwise Wipe Pouch | it came early and was not disappointed. i love … | 5 | {‘and’: 3L, ‘love’: 1L, ’it’: 2L, ‘highly’: 1L, … | 0 | 0 | 0 |
amazing | love | horrible | bad | terrible | awful | wow | hate |
---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
在我们创建的新列上,使用sum进行求和,当然,因为数据量大,SArray的求和并未进行优化,这个过程将是很漫长的,我们可以使用SFrame.to_dataframe()
和SFrame.to_numpy()
将其转为pandas和numpy的向量进行处理,但是呢,为了熟悉SFrame的数据结构,我还是尽可能不额外引入别的东西。
word_sums = {}
for word in selected_words:
word_sums[word] = products[word].sum()
print word_sums
print max(word_sums, key=word_sums.get)
print min(word_sums, key=word_sums.get)
products['sentiment'] = products['rating'] >=4
train_data,test_data = products.random_split(.8, seed=0)
selected_words_model = graphlab.logistic_classifier.create(train_data,
target='sentiment',
features= selected_words,
validation_set=test_data);#按词频和敏感度做逻辑回归分类
coe = selected_words_model['coefficients']
print coe
coe.show();
coe.sort('value').print_rows(num_rows=12, num_columns=4)
selected_words_model.evaluate(test_data, metric='roc_curve')#模型评价,使用roc曲线作为度量,auc为下方的面积
selected_words_model.show(view='Evaluation')#展示roc曲线
sentiment_model = graphlab.logistic_classifier.create(train_data,
target='sentiment',
features=['word_count'],
validation_set=test_data);#按词频和敏感度做逻辑回归分类
sentiment_model.evaluate(test_data, metric='roc_curve');#模型评价,使用roc曲线作为度量,auc为下方的面积
sentiment_model.show(view='Evaluation');#展示roc曲线
disper_champ_reviews = products[products['name'] == 'Baby Trend Diaper Champ']
disper_champ_reviews.head(2)
name | review | rating | word_count | awesome | great | fantastic |
---|---|---|---|---|---|---|
Baby Trend Diaper Champ | Ok - newsflash. Diapers are just smelly. We’ve … | 4 | {‘just’: 2L, ‘less’: 1L, ’-‘: 3L, ‘smell- … | 0 | 0 | 0 |
Baby Trend Diaper Champ | This is a good product to start and very easy to … | 3 | {‘and’: 3L, ‘because’: 1L, ‘old’: 1L, ‘use.’: … | 0 | 0 | 0 |
amazing | love | horrible | bad | terrible | awful | wow | hate | sentiment |
---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
我们使用sentiment_model预测disper_champ_reviews每个评论的情绪,并根据predicted_sentiment的值大小进行排序。
所谓的predicted_sentiment,就是那条review所表达的情绪属于是正面的,即属于1(评分大等于4)的一个概率值。
disper_champ_reviews['predicted_sentiment'] =sentiment_model.predict(disper_champ_reviews, output_type='probability')
#做出逻辑回归
disper_champ_reviews = disper_champ_reviews.sort('predicted_sentiment', ascending=False)#按敏感度下降排序
disper_champ_reviews.head(2)
name | review | rating | word_count | awesome | great | fantastic |
---|---|---|---|---|---|---|
Baby Trend Diaper Champ | Diaper Champ or Diaper Genie? That was my … | 5 | {‘all’: 1L, ‘bags.’: 1L, ’son,’: 1L, ‘(i’: 1L, … | 0 | 0 | 0 |
Baby Trend Diaper Champ | Baby Luke can turn a clean diaper to a dirty … | 5 | {‘all’: 1L, ‘less’: 1L, ”friend’s”: 1L, ‘(whi … | 0 | 0 | 0 |
amazing | love | horrible | bad | terrible | awful | wow | hate | sentiment | predicted_sentiment |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.999999830319 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.999999781997 |
现在我们使用selected_words_model来预测上面列出的对Baby Trend Diaper Champ评价最积极的那一项。可以看出它的结果为1,表示,这是一个100%好的一个评价,但是一般来说,我们取概率值都是趋向于1,甚至无限接近,一般取不到1(比如说sigmoid函数作为概率取值),说明选部分词做模型预测,确实精度不够。
print selected_words_model.predict(disper_champ_reviews [0])
[1L]
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。