当前位置:   article > 正文

Python机器学习——分析评论情感_python评论情感分析

python评论情感分析

一、实验目标

1、将文件评论分为积极评论和消极评论两类,其中消极评论包括答非所问,省略回答与拒绝回答

(本文中我暂且规定积极评论为0,消极评论为1)

二、实验思路

1、用jieba库,去除停用词等操作处理原始数据,并挑出数据中评论为空的值(nan)可以直接将预测值变为1(消极评论)

2、根据1500条数据量划分出训练集和测试集,对模型进行测试,并选取正确率最高的方法

(我最终选取的是朴素贝叶斯方法)

3、针对30万条数据进行处理,查看算法的可行性

4、处理更大的300万条数据

三、实验预备知识

1、csv文件操作

(1)文件的创建与打开

  1. with open(filepath,mode,encoding="utf8",newline="") as f:
  2. rows = csv.reader(f) #rows为一个迭代器(大列表套着小列表)
  3. for row in rows:
  4. print(rows)

A.mode:文件打开模式

r:只读模式,文件不存在泽报错,默认模式(文件指针位于文件末尾)
r+:只读模式,文件不存在泽报错(文件指针位于文件开头)
w:写入模式,文件不存在则自动报错,每次打开会覆盖原文件内容,文件不关闭则可以进行多次写入(只会在打开文件时清空文件内容)
w+:写入模式,文件不存在则自动报错,每次打开会覆盖原文件内容,文件不关闭则可以进行多次写入(只会在打开文件时清空文件内容,指针位置在文件内容末尾)
a:追加模式,文件不存在则会自动创建,从末尾追加,不可读。
a+:追加且可读模式,刚打开时文件指针就在文件末尾。

 

B.utf-8和utf-8-sig 

”utf-8“ 是以字节为编码单元,它的字节顺序在所有系统中都是一样的,没有字节序问题,因此它不需要BOM,所以当用"utf-8"编码方式读取带有BOM的文件时,它会把BOM当做是文件内容来处理, 也就会发生类似上边的错误.

“utf-8-sig"中sig全拼为 signature 也就是"带有签名的utf-8”, 因此"utf-8-sig"读取带有BOM的"utf-8文件时"会把BOM单独处理,与文本内容隔离开,也是我们期望的结果.
 

(2)csv文件的写入

  1. headers=[]
  2. datas=[]
  3. with open(filepath,mode,encoding="utf8",newline="") as f:
  4. rows = csv.reader(f)
  5. rows.writerow(headers)
  6. rows.writerows(datas)

2、大文件的查看与修改

推荐一个软件EmEditor,excel最多打开104万行数据,更大的数据量会导致excel崩溃,对于查看或修改更大的csv文件,此时最好使用EmEditor

3、jieba,sklearn等库的使用

代码里会详细解释

四、实验代码

  1. import jieba
  2. import pandas as pd
  3. import numpy as np
  4. from sklearn.model_selection import train_test_split
  5. from sklearn.feature_extraction.text import TfidfVectorizer
  6. from sklearn.feature_extraction.text import CountVectorizer
  7. from sklearn.naive_bayes import MultinomialNB
  8. from sklearn.linear_model import LogisticRegression
  9. from pandas import DataFrame
  10. import openpyxl as op
  11. #训练集
  12. new_reply = pd.DataFrame(None,columns=["reply","answer"])
  13. reply = pd.read_excel("sum.xlsx").astype(str)
  14. print("训练集读取成功!")
  15. #新数据
  16. reply1 = pd.read_csv("result297.csv").astype(str)
  17. new_reply1 = pd.DataFrame(None,columns=["reply","answer"])
  18. print("数据集读取成功")
  19. #stopwords的引用
  20. with open("stopwords.txt","r",encoding = "utf-8") as f:
  21. stops = f.readlines()
  22. stopwords = [x.strip() for x in stops]
  23. #利用jieba对文本进行分段
  24. new_reply["reply"] = reply.回答.apply(lambda s:" ".join(jieba.cut(s)))
  25. new_reply["answer"] = reply["answer(人工阅读)"]
  26. new_reply["reply"] = new_reply.reply.apply(lambda s:s if str(s)!="nan" else np.nan)
  27. df = new_reply["reply"]
  28. df=df.to_frame(name="reply")
  29. df_isnull_remark = df[df["reply"].isnull()]
  30. new_reply.dropna(subset=["reply"],inplace=True)
  31. cut_list = new_reply["reply"].apply(lambda s:s.split(" "))
  32. result = cut_list.apply(lambda s:[x for x in s if x not in stopwords]) #result类型为series
  33. new_reply1["reply"] = reply1.回答.apply(lambda s:" ".join(jieba.cut(s)))
  34. # new_reply1["answer"] = reply1["answer(人工)"]
  35. new_reply1["reply"] = new_reply1.reply.apply(lambda s:s if str(s)!="nan" else np.nan)
  36. cf = new_reply1["reply"]
  37. cf=cf.to_frame(name="reply")
  38. cf_isnull_remark = cf[cf["reply"].isnull()]
  39. new_reply1.dropna(subset=["reply"],inplace=True)
  40. cut_list1 = new_reply1["reply"].apply(lambda s:s.split(" "))
  41. result1 = cut_list1.apply(lambda s:[x for x in s if x not in stopwords]) #result1类型为series
  42. # x_train,x_test,y_train,y_test = train_test_split(result,new_reply["answer"],test_size = 0.25,random_state = 3)
  43. x_train = result
  44. y_train = new_reply["answer"]
  45. # test_index = x_test.index
  46. test_index = result1.index
  47. x_train = x_train.reset_index(drop=True)
  48. # print(type(x_test)) #x_test类型为series
  49. words=[]
  50. for i in range(len(x_train)):
  51. words.append(' '.join(x_train[i]))
  52. result1 = result1.reset_index(drop=True)
  53. test_words=[]
  54. for i in range(len(result1)):
  55. test_words.append(' '.join(result1[i]))
  56. # print(test_words)
  57. # vec = CountVectorizer(analyzer='word',max_features=4000,lowercase=False)
  58. # vec.fit(words)
  59. # classifer = MultinomialNB()
  60. # classifer.fit(vec.transform(words),y_train)
  61. # score = classifer.score(vec.transform(test_words),y_test)
  62. # # a=classifer.predict_proba(vec.transform(test_words))
  63. # print(score)
  64. cv = CountVectorizer(ngram_range=(1,2))
  65. cv.fit(words)
  66. classifer = MultinomialNB()
  67. classifer.fit(cv.transform(words),y_train)
  68. # score = classifer.score(cv.transform(test_words),test)
  69. a=classifer.predict_proba(cv.transform(test_words))
  70. # print(score)
  71. # tv = TfidfVectorizer(analyzer='word',max_features = 4000,ngram_range=(1,2),lowercase=False)
  72. # tv.fit(words)
  73. # classifer = MultinomialNB()
  74. # classifer.fit(tv.transform(words),y_train)
  75. # score = classifer.score(tv.transform(test_words),y_test)
  76. # a=classifer.predict_proba(tv.transform(test_words))
  77. # print(score)
  78. # b=[]
  79. # c=[]
  80. # for nums in a:
  81. # for num in nums:
  82. # num = int(num +0.5)
  83. # b.append(num)
  84. # if(b[0] == 1):
  85. # c.append(1)
  86. # if(b[1] == 1):
  87. # c.append(0)
  88. # b=[]
  89. # print(cf_isnull_remark.index)
  90. # tableAll = op.load_workbook('201401.xlsx')
  91. # table1 = tableAll['Sheet1']
  92. # for i in range(len(c)):
  93. # table1.cell(test_index[i]+2, 12, c[i])8i
  94. # for i in range(len(df_isnull_remark.index)):
  95. # table1.cell(df_isnull_remark.index[i]+2, 12, 0)
  96. # tableAll.save('201401.xlsx')
  97. # judge=Ture
  98. # p=1
  99. # with open("result297.csv","r",encoding='utf-8',newline='') as f:
  100. # rows = csv.reader(f)
  101. # with open("help297.csv", 'w',encoding='utf-8',newline='') as file:
  102. # writer = csv.writer(file)
  103. # for row in rows:
  104. # if(p==1):
  105. # row.append("result")
  106. # p+=1
  107. # else:
  108. # for o in range(len(c)):
  109. # if(p==test_index[o]+2):
  110. # row.append(c[o])
  111. # break
  112. # for u in range(len(cf_isnull_remark.index)):
  113. # if(p==cf_isnull_remark.index[u]+2):
  114. # row.append(1)
  115. # break
  116. # p+=1
  117. # writer.writerow(row)
  118. print("ok")

五、部分代码解析

1、jieba分词部分

  1. #利用jieba对文本进行分段
  2. new_reply["reply"] = reply.回答.apply(lambda s:" ".join(jieba.cut(s)))
  3. new_reply["answer"] = reply["answer(人工阅读)"]
  4. cut_list = new_reply["reply"].apply(lambda s:s.split(" "))
  5. #new_reply["reply"]为DataFrame类型,通过使用""切分将cut_list转化为Series类型

 jieba分词的作用如下所示,它会将输入的参数自动划分为诸多词块

8e13cb3bfc5649318749f1c87c61dea9.png

 2、去除停用词

停用词为评论中一些对于结果的分类影响不大的词语,一般在网上都可以查到。

删除的方法为遍历整个列表,找到跟停用词相同的元素立即删除并继续循环,直到去掉评论中的所有停用词

aa30d6305ea44add9024747998fabb05.png

 

  1. with open("stopwords.txt","r",encoding = "utf-8") as f:
  2. stops = f.readlines()
  3. stopwords = [x.strip() for x in stops]
  4. #strip()删除无意义换行符,空格等等
  5. cut_list1 = new_reply1["reply"].apply(lambda s:s.split(" "))
  6. result1 = cut_list1.apply(lambda s:[x for x in s if x not in stopwords])

3、划分训练集和测试集

  1. x_train,x_test,y_train,y_test = train_test_split(result,new_reply["answer"],test_size = 0.25,random_state = 3)
  2. #此处选取75%数据作训练集,25%数据作测试集

 4、利用机器学习方法进行训练

  1. # vec = CountVectorizer(analyzer='word',max_features=4000,lowercase=False)
  2. # vec.fit(words)
  3. # classifer = MultinomialNB()
  4. # classifer.fit(vec.transform(words),y_train)
  5. # score = classifer.score(vec.transform(test_words),y_test)
  6. # # a=classifer.predict_proba(vec.transform(test_words))
  7. # print(score)
  8. cv = CountVectorizer(ngram_range=(1,2))
  9. cv.fit(words)
  10. classifer = MultinomialNB()
  11. classifer.fit(cv.transform(words),y_train)
  12. # score = classifer.score(cv.transform(test_words),test)
  13. a=classifer.predict_proba(cv.transform(test_words))
  14. # print(score)
  15. # tv = TfidfVectorizer(analyzer='word',max_features = 4000,ngram_range=(1,2),lowercase=False)
  16. # tv.fit(words)
  17. # classifer = MultinomialNB()
  18. # classifer.fit(tv.transform(words),y_train)
  19. # score = classifer.score(tv.transform(test_words),y_test)
  20. # a=classifer.predict_proba(tv.transform(test_words))
  21. # print(score)

cv.fit函数将训练集划分的词块投喂进去,统计有效词词频。

classifer.predict_proba函数可以统计差异,计算正确率,经计算,利用朴素贝叶斯分类的正确率可以达到91%,效果较为理想。

42fd572585364f5bb3afa9925124e98d.png

 

 

5、大csv文件的处理操作具体实现

处理三百万条数据时python读入导致计算机卡顿,于是我将数据进行拆分成10000条数据的小csv文件并对小文件进行操作,最后将得到结果的所有文件再进行合并。

A、拆分为300个小文件=========>>

  1. import csv
  2. import os
  3. #准备10000行的csv作为样例,进行拆分
  4. example_path = 'data1.csv' # 需要拆分文件的路径
  5. example_result_dir = 'chaifen' # 拆分后文件的路径
  6. with open(example_path, 'r', newline='',encoding = 'utf-8') as example_file:
  7. example = csv.reader(example_file)
  8. i = j = 1
  9. for row in example:
  10. # print(row)
  11. # print(f'i 等于 {i}, j 等于 {j}')
  12. # 每1000个就j加1, 然后就有一个新的文件名
  13. if i % 10000 == 0:
  14. print(f'第{j}个文件生成完成')
  15. j += 1
  16. example_result_path = example_result_dir + '\\result' + str(j) + '.csv'
  17. # print(example_result_path)
  18. # 不存在此文件的时候,就创建
  19. if not os.path.exists(example_result_path):
  20. with open(example_result_path, 'w', newline='',encoding = 'utf-8-sig') as file:
  21. csvwriter = csv.writer(file)
  22. csvwriter.writerow(['code', '提问', '回答', '回答时间', 'Question', 'NegAnswer_Reg', '查找方式', 'year','DueOccup', 'Gender', 'SOEearly', 'Exchange', 'NegAnswer_Lasso','Length', 'Salary', 'TotleQuestion', 'Analyst', 'NegTone', 'AccTerm','Readability', 'ShareHold', 'InstOwn', 'SOE', '行业名称', '行业代码','industry', 'quarter'])
  23. csvwriter.writerow(row)
  24. i += 1
  25. # 存在的时候就往里面添加
  26. else:
  27. with open(example_result_path, 'a', newline='',encoding = 'utf-8') as file:
  28. csvwriter = csv.writer(file)
  29. csvwriter.writerow(row)
  30. i += 1

6bc8b0ff5a7c4e918c3a9c6eb8bc2f7f.png

B、分别进行处理=========>>

  1. import jieba
  2. import pandas as pd
  3. import numpy as np
  4. from sklearn.model_selection import train_test_split
  5. from sklearn.feature_extraction.text import TfidfVectorizer
  6. from sklearn.feature_extraction.text import CountVectorizer
  7. from sklearn.naive_bayes import MultinomialNB
  8. from sklearn.linear_model import LogisticRegression
  9. from pandas import DataFrame
  10. import openpyxl as op
  11. import threading
  12. from concurrent.futures import ThreadPoolExecutor
  13. import time
  14. import csv
  15. #训练集
  16. new_reply = pd.DataFrame(None,columns=["reply","answer"])
  17. reply = pd.read_excel("sum.xlsx").astype(str)
  18. print("训练集读取成功!")
  19. # print(reply.head())
  20. with open("stopwords.txt","r",encoding = "utf-8") as f:
  21. stops = f.readlines()
  22. stopwords = [x.strip() for x in stops]
  23. #利用jieba对文本进行分段
  24. new_reply["reply"] = reply.回答.apply(lambda s:" ".join(jieba.cut(s)))
  25. new_reply["answer"] = reply["answer(人工阅读)"]
  26. new_reply["reply"] = new_reply.reply.apply(lambda s:s if str(s)!="nan" else np.nan)
  27. df = new_reply["reply"]
  28. df=df.to_frame(name="reply")
  29. df_isnull_remark = df[df["reply"].isnull()]
  30. new_reply.dropna(subset=["reply"],inplace=True)
  31. cut_list = new_reply["reply"].apply(lambda s:s.split(" "))
  32. result = cut_list.apply(lambda s:[x for x in s if x not in stopwords]) #result类型为series
  33. x_train = result
  34. y_train = new_reply["answer"]
  35. # print(new_reply.head())
  36. time1 = time.time()
  37. x_train = x_train.reset_index(drop=True)
  38. words=[]
  39. for i in range(len(x_train)):
  40. words.append(' '.join(x_train[i]))
  41. cv = CountVectorizer(ngram_range=(1,2))
  42. cv.fit(words)
  43. classifer = MultinomialNB()
  44. classifer.fit(cv.transform(words),y_train)
  45. i=0
  46. # print(words)
  47. #新数据
  48. for i in range(296):
  49. t = time.time()
  50. path = './chaifen/result' + str(i+1) +'.csv'
  51. reply1 = pd.read_csv(path).astype(str)
  52. new_reply1 = pd.DataFrame(None,columns=["reply","answer"])
  53. print(f"第{i+1}个数据集读取成功")
  54. #stopwords的引用
  55. new_reply1["reply"] = reply1.回答.apply(lambda s:" ".join(jieba.cut(s)))
  56. new_reply1["reply"] = new_reply1.reply.apply(lambda s:s if str(s)!="nan" else np.nan)
  57. cf = new_reply1["reply"]
  58. cf=cf.to_frame(name="reply")
  59. cf_isnull_remark = cf[cf["reply"].isnull()]
  60. new_reply1.dropna(subset=["reply"],inplace=True)
  61. cut_list1 = new_reply1["reply"].apply(lambda s:s.split(" "))
  62. result1 = cut_list1.apply(lambda s:[x for x in s if x not in stopwords]) #result1类型为series
  63. test_index = result1.index
  64. # print(result1)
  65. result1 = result1.reset_index(drop=True)
  66. test_words=[]
  67. for j in range(len(result1)):
  68. test_words.append(' '.join(result1[j]))
  69. # print(test_words[0])
  70. a=classifer.predict_proba(cv.transform(test_words))
  71. # print(ad)
  72. b=[]
  73. c=[]
  74. for nums in a:
  75. for num in nums:
  76. num = int(num +0.5)
  77. b.append(num)
  78. if(b[0] == 1):
  79. c.append(1)
  80. if(b[1] == 1):
  81. c.append(0)
  82. b=[]
  83. print(len(c))
  84. p=1
  85. with open(path,"r",encoding='utf-8',newline='') as f:
  86. rows = csv.reader(f)
  87. with open(f"./chaifen/help{i+1}.csv", 'w',encoding='utf-8',newline='') as file:
  88. writer = csv.writer(file)
  89. for row in rows:
  90. if(p==1):
  91. row.append("result")
  92. else:
  93. for o in range(len(c)):
  94. if(p==test_index[o]+2):
  95. row.append(c[o])
  96. break
  97. for u in range(len(cf_isnull_remark.index)):
  98. if(p==cf_isnull_remark.index[u]+2):
  99. row.append(1)
  100. break
  101. writer.writerow(row)
  102. p+=1
  103. print(f"第{i+1}个数据集处理成功,用时{time.time() - t}")
  104. print(f"总计用时{time.time() - time1}")
  105. print("all is over!")

dba707f436bf4a05b076f57cb9a82edf.png

花了我一晚上时间,终于跑出了最终结果

(或许通过多线程可以加快速度,接下来再学习)

C、合并 =========>>

  1. import time
  2. for i in range(297):
  3. time1 = time.time()
  4. with open (f"./chaifen/help{i+1}.csv","r",newline='',encoding = "utf8") as af:
  5. rows = csv.reader(af)
  6. with open("./chaifen/a.csv","a+",encoding = "utf8",newline='') as bf:
  7. for row in rows:
  8. writer = csv.writer(bf)
  9. writer.writerow(row)
  10. print(f"第{i}个数据集已加入,用时{time.time()-time1}")
  11. print(f"all is over,用时{time.time()-time1}")

完结撒花!

 

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/很楠不爱3/article/detail/724274
推荐阅读
相关标签
  

闽ICP备14008679号