当前位置:   article > 正文

微博评论的情感分析_微博评论分析停用词

微博评论分析停用词

#文本处理:情感分析,文本相似度,文本分类(tf-idf逆文档频率)
#NLP:字符串-向量化-贝叶斯训练-测试
#文本相似度:词频
#文本分类:TF-IDF(词频-逆文档频率)

#1.原始文本
#2.分词
#3.词行归一化
#4.去除停用词

  1. import os,re
  2. import numpy as np
  3. import pandas as pd
  4. import jieba.posseg as pseg
  5. from sklearn.model_selection import train_test_split
  6. from sklearn.naive_bayes import MultinomialNB
  7. from sklearn.feature_extraction.text import TfidfVectorizer
  8. #https://blog.csdn.net/mpk_no1/article/details/71698725
  9. dataset_path = './dataset'
  10. text_filenames = ['0_simplifyweibo.txt', '1_simplifyweibo.txt',
  11. '2_simplifyweibo.txt', '3_simplifyweibo.txt']
  12. # 原始数据的csv文件
  13. output_text_filename = 'raw_weibo_text.csv'
  14. # 清洗好的文本数据文件
  15. output_cln_text_filename = 'clean_weibo_text.csv'
  16. stopwords1 = [line.rstrip() for line in open('./中文停用词库.txt', 'r', encoding='utf-8')]
  17. stopwords = stopwords1
  18. #原始数据处理:
  19. '''
  20. text_w_label_df_lst = []
  21. for text_filename in text_filenames:
  22. text_file = os.path.join(dataset_path, text_filename)
  23. # 获取标签,即0, 1, 2, 3
  24. label = int(text_filename[0])
  25. # 读取文本文件
  26. with open(text_file, 'r', encoding='utf-8') as f:
  27. lines = f.read().splitlines()
  28. labels = [label] * len(lines)
  29. #print(labels)
  30. text_series = pd.Series(lines)
  31. label_series = pd.Series(labels)
  32. # 构造dataframe
  33. text_w_label_df = pd.concat([label_series, text_series], axis=1)
  34. text_w_label_df_lst.append(text_w_label_df)
  35. result_df = pd.concat(text_w_label_df_lst, axis=0)
  36. # 保存成csv文件
  37. result_df.columns = ['label', 'text']
  38. result_df.to_csv(os.path.join(dataset_path, output_text_filename),
  39. index=None, encoding='utf-8')
  40. '''
  41. #1. 数据读取,处理,清洗,准备
  42. '''
  43. # 读取处理好的csv文件,构造数据集
  44. text_df = pd.read_csv(os.path.join(dataset_path, output_text_filename),encoding='utf-8')
  45. print(text_df)
  46. def proc_text(raw_line):
  47. """
  48. 处理每行的文本数据
  49. 返回分词结果
  50. """
  51. # 1. 使用正则表达式去除非中文字符
  52. filter_pattern = re.compile('[^\u4E00-\u9FD5]+')
  53. chinese_only = filter_pattern.sub('', raw_line)
  54. # 2. 结巴分词+词性标注
  55. words_lst = pseg.cut(chinese_only)
  56. # 3. 去除停用词
  57. meaninful_words = []
  58. for word, flag in words_lst:
  59. # if (word not in stopwords) and (flag == 'v'):
  60. # 也可根据词性去除非动词等
  61. if word not in stopwords:
  62. meaninful_words.append(word)
  63. return ' '.join(meaninful_words)
  64. # 处理文本数据
  65. text_df['text'] = text_df['text'].apply(proc_text)
  66. # 过滤空字符串
  67. text_df = text_df[text_df['text'] != '']
  68. # 保存处理好的文本数据
  69. text_df.to_csv(os.path.join(dataset_path, output_cln_text_filename),index=None, encoding='utf-8')
  70. print('完成,并保存结果。')
  71. '''
  72. # 2. 分割训练集、测试集
  73. # 对应不同类别的感情:
  74. # 0:喜悦
  75. # 1:愤怒
  76. # 2:厌恶
  77. # 3:低落
  78. clean_text_df = pd.read_csv(os.path.join(dataset_path, output_cln_text_filename),encoding='utf-8')
  79. # 分割训练集和测试集
  80. x_train, x_test, y_train, y_test = train_test_split(clean_text_df['text'].values, clean_text_df['label'].values,test_size=0.25)
  81. # 3. 特征提取
  82. # 计算词频
  83. tf = TfidfVectorizer()
  84. # 以训练集当中的词的列表进行每篇文章的重要性统计
  85. x_train = tf.fit_transform(x_train)
  86. print(tf.get_feature_names())
  87. x_test = tf.transform(x_test)
  88. # 4. 训练模型Naive Bayes
  89. mlt = MultinomialNB(alpha=1.0)
  90. # print(x_train.toarry())
  91. mlt.fit(x_train, y_train)
  92. y_predict = mlt.predict(x_test)
  93. print("预测的文章类别为:", y_predict)
  94. #5. 预测得出准确率 分类模型的评估标准-准确率和召回率(越高越好,预测结果的准确性)
  95. print("预测的准确率:", mlt.score(x_test, y_test))

 

Word2vec可以将词语转换为高维向量空间中的向量表示,它能揭示上下文关系。首先使用word2vec,将其训练得到词向量作为特征权重,然后根据情感词典和词性的两种特征选择方法筛选出有价值的特征。

 

 

 

 

 

 

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/不正经/article/detail/451110
推荐阅读
相关标签
  

闽ICP备14008679号