当前位置:   article > 正文

AI-NLP-3.Word2Vec实战案例课_ainlp小程序应用 word2vec 工具

ainlp小程序应用 word2vec 工具

目录

安装notebooks

1.文本情感分析 英文 && 中文

Data Set

File descriptions

Data fields

第一种方式:bag_of_words_model

用pandas读入训练数据

对影评数据做预处理,大概有以下环节:

清洗数据添加到dataframe里

抽取bag of words特征(TF,用sklearn的CountVectorizer)

训练分类器

删除不用的占内容变量

读取测试数据进行预测

第二种方式:Word2Vec

读入无标签数据

和第一个ipython notebook一样做数据的预处理

用gensim训练词嵌入模型

看看训练的词向量结果如何

第二种方式 续:使用Word2Vec训练后的数据

和之前的操作一致

读入之前训练好的Word2Vec模型

我们可以根据word2vec的结果去对影评文本进行编码

用随机森林构建分类器

清理占用内容的变量

预测测试集结果并上传kaggle

2.中文应用:chinese-sentiment-analysis



安装notebooks

ipynb,顾名思义,ipython notebook

  1. C:\Users\Administrator>python -m pip install jupyter notebook
  2. Collecting jupyter
  3. Downloading https://files.pythonhosted.org/packages/83/df/0f5dd132200728a86190397e1ea87cd76244e42d39ec5e88efd25b2abd7e/jupyter-1.0.0-py2.py3-none-any.whl
  4. Collecting notebook
  5. Downloading https://files.pythonhosted.org/packages/5e/7c/7fd8e9584779d65dfcad9fa2e09c76131a41f999f853a9c7026ed8585586/notebook-5.6.0-py2.py3-none-any.whl (8.9MB)
  6. 100% |████████████████████████████████| 8.9MB 227kB/s

之后cmd中输入jupyter notebook会打开一个页面,先upload这个.ipynb后缀的文件

  1. C:\Users\Administrator>jupyter notebook
  2. [I 16:47:35.666 NotebookApp] Writing notebook server cookie secret to C:\Users\Administrator\AppData\Roaming\jupyter\runtime\notebook_cookie_secret

然后点击上传后的.ipynb文件,点击下面的红色方框中的第一个按钮,运行,运行后,网页的下面部分会输出结果。

选中一个框,方框变成蓝色,表示选中。

如果鼠标点击代码,方框变成绿色,表示处于编辑状态。

选中方框变蓝色后,按下键盘上的小写L可以显示行数。

在cell中可以直接按tab键,可以自动补全,超级实用

 

1.文本情感分析 英文 && 中文

第一个示例:https://www.kaggle.com/c/word2vec-nlp-tutorial/data
bag of words meets bags of popcorn

1.基本的文本预处理技术 (网页解析, 文本抽取, 正则表达式等)
2.word2vec词向量编码与机器学习建模情感分析


数据

Data Set

The labeled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1. No individual movie has more than 30 reviews. The 25,000 review labeled training set does not include any of the same movies as the 25,000 review test set. In addition, there are another 50,000 IMDB reviews provided without any rating labels.

File descriptions

  • labeledTrainData - The labeled training set. The file is tab-delimited and has a header row followed by 25,000 rows containing an id, sentiment, and text for each review.  
  • testData - The test set. The tab-delimited file has a header row followed by 25,000 rows containing an id and text for each review. Your task is to predict the sentiment for each one. 
  • unlabeledTrainData - An extra training set with no labels. The tab-delimited file has a header row followed by 50,000 rows containing an id and text for each review. 
  • sampleSubmission - A comma-delimited sample submission file in the correct format.

Data fields

  • id - Unique ID of each review
  • sentiment - Sentiment of the review; 1 for positive reviews and 0 for negative reviews
  • review - Text of the review

 

第一种方式:bag_of_words_model

  1. import os
  2. import re
  3. import numpy as np
  4. import pandas as pd
  5. from bs4 import BeautifulSoup
  6. import warnings #过滤掉sklearn的警告
  7. warnings.filterwarnings(action='ignore',category=UserWarning,module='sklearn')
  8. from sklearn.feature_extraction.text import CountVectorizer
  9. from sklearn.ensemble import RandomForestClassifier
  10. from sklearn.metrics import confusion_matrix

pandas可以把csv转换成excel类似的,BeautifulSoup用于解析网页,sklearn用于抽取文本特征。

  1. import nltk
  2. #nltk.download()
  3. from nltk.corpus import stopwords

用pandas读入训练数据

  1. datafile = os.path.join('E:/AI/NLP/NLTK/Python/3/', 'data', 'labeledTrainData.tsv')
  2. df = pd.read_csv(datafile, sep='\t', escapechar='\\')
  3. print('Number of reviews: {}'.format(len(df)))
  4. df.head()
Number of reviews: 25000

Out[19]:

 idsentimentreview
05814_81With all this stuff going down at the moment w...
12381_91"The Classic War of the Worlds" by Timothy Hin...

简单看下第一条评论:

df['review'][0]
"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci's character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ's music.<br /><br />Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.<br /><br />Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ's bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i've gave this subject....hmmm well i don't know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter."

对影评数据做预处理,大概有以下环节:

  1. 去掉html标签
  2. 移除标点
  3. 切分成词/token
  4. 去掉停用词
  5. 重组为新的句子
  1. def display(text, title):
  2. print(title)
  3. print("\n----------我是分割线-------------\n")
  4. print(text)

我们来继续显示第2个review的数据: 

  1. raw_example = df['review'][1]
  2. display(raw_example, '原始数据')
  1. 原始数据
  2. ----------我是分割线-------------
  3. "The Classic War of the Worlds" by Timothy Hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate H. G. Wells' classic book. Mr. Hines succeeds in doing so. I, and those who watched his film with me, appreciated the fact that it was not the standard, predictable Hollywood fare that comes out every year, e.g. the Spielberg version with Tom Cruise that had only the slightest resemblance to the book. Obviously, everyone looks for different things in a movie. Those who envision themselves as amateur "critics" look only to criticize everything they can. Others rate a movie on more important bases,like being entertained, which is why most people never agree with the "critics". We enjoyed the effort Mr. Hines put into being faithful to H.G. Wells' classic novel, and we found it to be very entertaining. This made it easy to overlook what the "critics" perceive to be its shortcomings.
去掉HTML标签的数据
  1. example = BeautifulSoup(raw_example, 'html.parser').get_text()
  2. display(example, '去掉HTML标签的数据')
  1. 去掉HTML标签的数据
  2. ----------我是分割线-------------
  3. "The Classic War of the Worlds" by Timothy Hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate H. G. Wells' classic book. Mr. Hines succeeds in doing so. I, and those who watched his film with me, appreciated the fact that it was not the standard, predictable Hollywood fare that comes out every year, e.g. the Spielberg version with Tom Cruise that had only the slightest resemblance to the book. Obviously, everyone looks for different things in a movie. Those who envision themselves as amateur "critics" look only to criticize everything they can. Others rate a movie on more important bases,like being entertained, which is why most people never agree with the "critics". We enjoyed the effort Mr. Hines put into being faithful to H.G. Wells' classic novel, and we found it to be very entertaining. This made it easy to overlook what the "critics" perceive to be its shortcomings.
  4. Press any key to continue . . .
去掉标点的数据
  1. example_letters = re.sub(r'[^a-zA-Z]', ' ', example)
  2. display(example_letters, '去掉标点的数据')
  1. 去掉标点的数据
  2. ----------我是分割线-------------
  3. The Classic War of the Worlds by Timothy Hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate H G Wells classic book Mr Hines succeeds in doing so I and those who watched his film with me appreciated the fact that it was not the standard predictable Hollywood fare that comes out every year e g the Spielberg version with Tom Cruise that had only the slightest resemblance to the book Obviously everyone looks for different things in a movie Those who envision themselves as amateur critics look only to criticize everything they can Others rate a movie on more important bases like being entertained which is why most people never agree with the critics We enjoyed the effort Mr Hines put into being faithful to H G Wells classic novel and we found it to be very entertaining This made it easy to overlook what the critics perceive to be its shortcomings
纯词列表数据
  1. words = example_letters.lower().split()#全转成小写并分隔
  2. display(words, '纯词列表数据')
  1. 纯词列表数据
  2. ----------我是分割线-------------
  3. ['the', 'classic', 'war', 'of', 'the', 'worlds', 'by', 'timothy', 'hines', 'is', 'a', 'very', 'entertaining', 'film', 'that', 'obviously', 'goes', 'to', 'great', 'effort', 'and', 'lengths', 'to', 'faithfully', 'recreate', 'h', 'g', 'wells', 'classic', 'book', 'mr', 'hines', 'succeeds', 'in', 'doing', 'so', 'i', 'and', 'those', 'who', 'watched', 'his', 'film', 'with', 'me', 'appreciated', 'the', 'fact', 'that', 'it', 'was', 'not', 'the', 'standard', 'predictable', 'hollywood', 'fare', 'that', 'comes', 'out', 'every', 'year', 'e', 'g', 'the', 'spielberg', 'version', 'with', 'tom', 'cruise', 'that', 'had', 'only', 'the', 'slightest', 'resemblance', 'to', 'the', 'book', 'obviously', 'everyone', 'looks', 'for', 'different', 'things', 'in', 'a', 'movie', 'those', 'who', 'envision', 'themselves', 'as', 'amateur', 'critics', 'look', 'only', 'to', 'criticize', 'everything', 'they', 'can', 'others', 'rate', 'a', 'movie', 'on', 'more', 'important', 'bases', 'like', 'being', 'entertained', 'which', 'is', 'why', 'most', 'people', 'never', 'agree', 'with', 'the', 'critics', 'we', 'enjoyed', 'the', 'effort', 'mr', 'hines', 'put', 'into', 'being', 'faithful', 'to', 'h', 'g', 'wells', 'classic', 'novel', 'and', 'we', 'found', 'it', 'to', 'be', 'very', 'entertaining', 'this', 'made', 'it', 'easy', 'to', 'overlook', 'what', 'the', 'critics', 'perceive', 'to', 'be', 'its', 'shortcomings']
去掉停用词数据
  1. from nltk.corpus import stopwords
  2. words_nostop = [w for w in words if w not in stopwords.words('english')]
  1. 去掉停用词数据
  2. ----------我是分割线-------------
  3. ['classic', 'war', 'worlds', 'timothy', 'hines', 'entertaining', 'film', 'obviously', 'goes', 'great', 'effort', 'lengths', 'faithfully', 'recreate', 'h', 'g', 'wells', 'classic', 'book', 'mr', 'hines', 'succeeds', 'watched', 'film', 'appreciated', 'fact', 'standard', 'predictable', 'hollywood', 'fare', 'comes', 'every', 'year', 'e', 'g', 'spielberg', 'version', 'tom', 'cruise', 'slightest', 'resemblance', 'book', 'obviously', 'everyone', 'looks', 'different', 'things', 'movie', 'envision', 'amateur', 'critics', 'look', 'criticize', 'everything', 'others', 'rate', 'movie', 'important', 'bases', 'like', 'entertained', 'people', 'never', 'agree', 'critics', 'enjoyed', 'effort', 'mr', 'hines', 'put', 'faithful', 'h', 'g', 'wells', 'classic', 'novel', 'found', 'entertaining', 'made', 'easy', 'overlook', 'critics', 'perceive', 'shortcomings']

前面的所有操作可以写成一个清洗函数:

  1. eng_stopwords = set(stopwords.words('english'))
  2. def clean_text(text):
  3. text = BeautifulSoup(text, 'html.parser').get_text()
  4. text = re.sub(r'[^a-zA-Z]', ' ', text)
  5. words = text.lower().split()
  6. words = [w for w in words if w not in eng_stopwords]
  7. return ' '.join(words)
  8. display(clean_text(raw_example),'清洗数据')
  1. 清洗数据
  2. ----------我是分割线-------------
  3. classic war worlds timothy hines entertaining film obviously goes great effort lengths faithfully recreate h g wells classic book mr hines succeeds watched film appreciated fact standard predictable hollywood fare comes every year e g spielberg version tom cruise slightest resemblance book obviously everyone looks different things movie envision amateur critics look criticize everything others rate movie important bases like entertained people never agree critics enjoyed effort mr hines put faithful h g wells classic novel found entertaining made easy overlook critics perceive shortcomings

清洗数据添加到dataframe

对每一行review进行清洗数据

  1. df['clean_review'] = df.review.apply(clean_text)
  2. print(df.head())
  1. id ... clean_review
  2. 0 5814_8 ... stuff going moment mj started listening music ...
  3. 1 2381_9 ... classic war worlds timothy hines entertaining ...
  4. 2 7759_3 ... film starts manager nicholas bell giving welco...
  5. 3 3630_4 ... must assumed praised film greatest filmed oper...
  6. 4 9495_8 ... superbly trashy wondrously unpretentious explo...

抽取bag of words特征(TF,用sklearn的CountVectorizer)

  1. vectorizer = CountVectorizer(max_features = 5000) #对所有关键词的term frequency(tf)进行降序排序,只取前5000个做为关键词集
  2. train_data_features = vectorizer.fit_transform(df.clean_review).toarray()#fit_transform转换成文档词频矩阵,toarray转成数组
  3. print(train_data_features.shape)
(25000, 5000)

也就是5000行,每行25000个.

train_data_features
  1. array([[0, 0, 0, ..., 0, 0, 0],
  2. [0, 0, 0, ..., 0, 0, 0],
  3. [0, 0, 0, ..., 0, 0, 0],
  4. ...,
  5. [0, 0, 0, ..., 0, 0, 0],
  6. [0, 0, 0, ..., 0, 0, 0],
  7. [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

训练分类器

  1. forest = RandomForestClassifier(n_estimators=100)#随机森林分类
  2. forest = forest.fit(train_data_features, df.sentiment)#开始数据训练

在训练集上做个predict看看效果如何

print (confusion_matrix(df.sentiment, forest.predict(train_data_features)))
  1. array([[12500, 0],
  2. [ 0, 12500]], dtype=int64)

删除不用的占内容变量

  1. del df
  2. del train_data_features

读取测试数据进行预测

  1. datafile = os.path.join('..', 'data', 'testData.tsv')
  2. df = pd.read_csv(datafile, sep='\t', escapechar='\\')
  3. print('Number of reviews: {}'.format(len(df)))
  4. df['clean_review'] = df.review.apply(clean_text)
  5. df.head()
  6. test_data_features = vectorizer.transform(df.clean_review).toarray()
  7. test_data_features.shape
  8. result = forest.predict(test_data_features)#预测结果
  9. output = pd.DataFrame({'id':df.id, 'sentiment':result})
  10. print(output.head())
  11. output.to_csv(os.path.join('..', 'data', 'Bag_of_Words_model.csv'), index=False)
  12. del df
  13. del test_data_features
Number of reviews: 25000

Out[84]:

 idreviewclean_review
012311_10Naturally in a film who's main themes are of m...naturally film main themes mortality nostalgia...
18348_2This movie is a disaster within a disaster fil...movie disaster within disaster film full great...
25828_4All in all, this is a movie for kids. We saw i...movie kids saw tonight child loved one point k...

 Bag_of_Words_model.csv如下:

 

第二种方式:Word2Vec

  1. import os
  2. import re
  3. import numpy as np
  4. import pandas as pd
  5. import warnings
  6. warnings.filterwarnings(action='ignore',category=UserWarning,module="gensim")
  7. from bs4 import BeautifulSoup
  8. from gensim.models.word2vec import Word2Vec

定义一个读取CSV的函数

  1. def load_dataset(name, nrows=None):
  2. datasets = {
  3. 'unlabeled_train': 'unlabeledTrainData.tsv',
  4. 'labeled_train': 'labeledTrainData.tsv',
  5. 'test': 'testData.tsv'
  6. }
  7. if name not in datasets:
  8. raise ValueError(name)
  9. data_file = os.path.join('..', 'data', datasets[name])
  10. df = pd.read_csv(data_file, sep='\t', escapechar='\\', nrows=nrows)
  11. print('Number of reviews: {}'.format(len(df)))
  12. return df

读入无标签数据

用于训练生成word2vec词向量

  1. df = load_dataset('unlabeled_train')
  2. print(df.head())
  1. Number of reviews: 50000
  2. id review
  3. 0 9999_0 Watching Time Chasers, it obvious that it was ...
  4. 1 45057_0 I saw this film about 20 years ago and remembe...
  5. 2 15561_0 Minor Spoilers<br /><br />In New York, Joan Ba...
  6. 3 7161_0 I went to see this film with a great deal of e...
  7. 4 43971_0 Yes, I agree with everyone on this site this m...

和第一个ipython notebook一样做数据的预处理

稍稍有一点不一样的是,我们留了个候选,可以去除停用词,也可以不去除停用词

  1. from nltk.corpus import stopwords
  2. eng_stopwods = set(stopwords.words('english'))
  3. def clean_text(text, remove_stopwords=False):
  4. text = BeautifulSoup(text, 'html.parser').get_text()
  5. text = re.sub(r'[^a-zA-Z]', ' ', text)
  6. words = text.lower().split()
  7. if remove_stopwords:
  8. words = [w for w in words if w not in eng_stopwods]
  9. return words
  10. tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
  11. def print_call_counts(f):
  12. n = 0
  13. def wrapped(*args, **kwargs):
  14. nonlocal n
  15. n += 1
  16. if n % 1000 == 1:
  17. print('method {} called {} times'.format(f.__name__, n))
  18. return f(*args, **kwargs)
  19. return wrapped
  20. @print_call_counts#当解释器读到@的这样的修饰符之后,会先解析@后的内容,直接就把@下一行的函数或者类作为@后边的函数的参数,然后将返回值赋值给下一行修饰的函数对象。
  21. def split_sentences(review):
  22. raw_sentences = tokenizer.tokenizer(review.strip())#strip删除字符串中开头、结尾处,位于 rm删除序列的字符
  23. sentences = [clean_text(s) for s in raw_sentences if s]
  24. return sentences
  25. sentences = sum(df.review.apply(split_sentences), [])

结果:

  1. ................
  2. method split_sentences called 46001 times
  3. method split_sentences called 47001 times
  4. method split_sentences called 48001 times
  5. method split_sentences called 49001 times

 

用gensim训练词嵌入模型

  1. import logging
  2. logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
  3. # 设定词向量训练的参数
  4. num_features = 300 #300维, 建议300或500
  5. min_word_count = 40
  6. num_workers = 4 #训练时的线程数
  7. context = 10 #上下文窗口大小
  8. downsampling = 1e-3 # Downsample setting for frequent words
  9. model_name = '{}features_{}minwords_{}context.model'.format(num_features, min_word_count,context)#要保存的文件名
  10. print('Training model...')
  11. model = Word2Vec(sentences, workers=num_features, size=num_features, min_count = min_word_count, window = context, sample=downsampling)
  12. # If you don't plan to train the model any further, calling
  13. # init_sims will make the model much more memory-efficient.
  14. model.init_sims(replace=True)
  15. # It can be helpful to create a meaningful model name and
  16. # save the model for later use. You can load it later using Word2Vec.load()
  17. model.save(os.path.join('..', 'data', model_name))
  1. .....
  2. 2018-08-16 17:10:55,051 : INFO : worker thread finished; awaiting finish of 0 more threads
  3. 2018-08-16 17:10:55,052 : INFO : EPOCH - 5 : training on 11877527 raw words (8394318 effective words) took 19.0s, 442045 effective words/s
  4. 2018-08-16 17:10:55,053 : INFO : training on a 59387635 raw words (41968004 effective words) took 95.8s, 438035 effective words/s
  5. 2018-08-16 17:14:34,696 : INFO : precomputing L2-norms of word weight vectors
  6. 2018-08-16 17:14:35,547 : INFO : saving Word2Vec object under ..\data\300features_40minwords_10context.model, separately None
  7. 2018-08-16 17:14:35,554 : INFO : not storing attribute vectors_norm
  8. 2018-08-16 17:14:35,562 : INFO : not storing attribute cum_table
  9. 2018-08-16 17:14:35,886 : INFO : saved ..\data\300features_40minwords_10context.model

 

看看训练的词向量结果如何

print(model.most_similar("man"))
  1. [('woman', 0.6256189346313477),
  2. ('lady', 0.5953349471092224),
  3. ('lad', 0.576863169670105),
  4. ('person', 0.5407935380935669),
  5. ('farmer', 0.5382746458053589),
  6. ('chap', 0.536788821220398),
  7. ('soldier', 0.5292650461196899),
  8. ('men', 0.5261573791503906),
  9. ('monk', 0.5237958431243896),
  10. ('guy', 0.5213091373443604)]

 

第二种方式 续:使用Word2Vec训练后的数据

训练后的数据文件名为300features_40minwords_10context.model,前面保存的.

  1. import warnings
  2. warnings.filterwarnings('ignore')
  3. import os
  4. import re
  5. import numpy as np
  6. import pandas as pd
  7. from bs4 import BeautifulSoup
  8. from nltk.corpus import stopwords
  9. from gensim.models.word2vec import Word2Vec
  10. from sklearn.ensemble import RandomForestClassifier
  11. from sklearn.metrics import confusion_matrix
  12. from sklearn.cluster import KMeans

和之前的操作一致

  1. def load_dataset(name, nrows=None):
  2. datasets = {
  3. 'unlabeled_train': 'unlabeledTrainData.tsv',
  4. 'labeled_train': 'labeledTrainData.tsv',
  5. 'test': 'testData.tsv'
  6. }
  7. if name not in datasets:
  8. raise ValueError(name)
  9. data_file = os.path.join('..', 'data', datasets[name])
  10. df = pd.read_csv(data_file, sep='\t', escapechar='\\', nrows=nrows)
  11. print('Number of reviews: {}'.format(len(df)))
  12. return df
  13. eng_stopwords = set(stopwords.words('english'))
  14. def clean_text(text, remove_stopwords=False):
  15. text = BeautifulSoup(text, 'html.parser').get_text()
  16. text = re.sub(r'[^a-zA-Z]', ' ', text)
  17. words = text.lower().split()
  18. if remove_stopwords:
  19. words = [w for w in words if w not in eng_stopwords]
  20. return words

读入之前训练好的Word2Vec模型

  1. model_name = '300features_40minwords_10context.model'
  2. model = Word2Vec.load(os.path.join('..', 'data', model_name))

我们可以根据word2vec的结果去对影评文本进行编码

编码方式有一点粗暴,简单说来就是把这句话中的词的词向量做平均

  1. df = load_dataset('labeled_train')
  2. df.head()
Number of reviews: 25000

Out[14]:

 idsentimentreview
05814_81With all this stuff going down at the moment w...
12381_91"The Classic War of the Worlds" by Timothy Hin...
27759_30The film starts with a manager (Nicholas Bell)...
  1. def to_review_vector(review):
  2. words = clean_text(review, remove_stopwords=True)
  3. array = np.array([model[w] for w in words if w in model])
  4. return pd.Series(array.mean(axis=0))
  5. train_data_features = df.review.apply(to_review_vector)
  6. print(train_data_features.head())
  1. 0 1 2 3 4 ... 295 296 297 298 299
  2. 0 -0.002746 0.005741 0.004646 -0.001938 0.009835 ... -0.005903 0.010316 0.000723 -0.014974 -0.007718
  3. 1 -0.003350 -0.006660 0.000073 0.004966 0.001066 ... 0.014924 0.002365 0.012350 -0.006034 -0.025690
  4. 2 -0.016884 -0.006035 0.000061 0.003758 0.008695 ... 0.006264 0.002883 0.002217 -0.026501 -0.041674
  5. 3 -0.009798 -0.000712 0.006659 -0.017110 0.006017 ... 0.015451 0.011731 0.008902 -0.020935 -0.036668
  6. 4 -0.008019 -0.006775 0.009767 0.002874 0.014989 ... -0.000688 -0.000424 -0.003103 -0.031588 -0.019807
  7. [5 rows x 300 columns]

用随机森林构建分类器

  1. forest = RandomForestClassifier(n_estimators = 100, random_state=42)
  2. forest = forest.fit(train_data_features, df.sentiment)

同样在训练集上试试,确保模型能正常work

confusion_matrix(df.sentiment, forest.predict(train_data_features))

清理占用内容的变量

  1. del df
  2. del train_data_features

预测测试集结果并上传kaggle

  1. df = load_dataset('test')
  2. df.head()
  3. test_data_features = df.review.apply(to_review_vector)
  4. print(test_data_features.head())
  5. result = forest.predict(test_data_features)
  6. output = pd.DataFrame({'id':df.id, 'sentiment':result})
  7. output.to_csv(os.path.join('..', 'data', 'Word2Vec_model.csv'), index=False)
  8. output.head()
  9. del df
  10. del test_data_features
  11. del forest
  1. Number of reviews: 25000
  2. 0 1 2 3 4 ... 295 296 297 298 299
  3. 0 0.003222 -0.002921 0.009352 -0.027743 0.018592 ... 0.011904 0.004627 0.015087 -0.016692 -0.018632
  4. 1 -0.013426 0.003515 0.002579 -0.022269 -0.009693 ... -0.008517 -0.005674 -0.007146 -0.026965 -0.019395
  5. 2 0.001031 -0.001867 0.021952 -0.033233 0.005209 ... -0.004877 0.008913 0.017697 -0.007476 -0.006233
  6. 3 -0.014347 0.002951 0.022032 -0.009660 0.005736 ... 0.003137 0.004633 0.020197 -0.016389 -0.033783
  7. 4 -0.000612 -0.006142 0.000142 -0.000970 0.011840 ... 0.007408 -0.011372 0.014652 -0.018350 -0.011623
  8. [5 rows x 300 columns]

 

 

2.中文应用:chinese-sentiment-analysis

  1. #我们依旧会用gensim去做word2vec的处理,会用sklearn当中的SVM进行建模
  2. import warnings
  3. warnings.filterwarnings("ignore")
  4. from sklearn.model_selection import train_test_split
  5. from gensim.models.word2vec import Word2Vec
  6. import numpy as np
  7. import pandas as pd
  8. import jieba
  9. from sklearn.externals import joblib
  10. from sklearn.svm import SVC
  11. import sys
  12. #载入数据,做预处理(分词),切分训练集与测试集
  13. def load_file_and_preprocessing():
  14. neg = pd.read_excel("../data/neg.xls",header=None,index=None)
  15. pos = pd.read_excel("../data/pos.xls",header=None,index=None)
  16. cw = lambda x: list(jieba.cut(x))
  17. pos['words'] = pos[0].apply(cw)
  18. neg['words'] = neg[0].apply(cw)
  19. #print (pos['words'])
  20. #use 1 for positive sentiment, 0 for negative
  21. y = np.concatenate((np.ones(len(pos)), np.zeros(len(neg))))#数组拼接,ones用来构造全一矩阵,zeros用来构造全零矩阵
  22. x_train, x_test, y_train, y_test = train_test_split(np.concatenate((pos['words'], neg['words'])), y, test_size=0.2)#train_test_split函数用于将矩阵随机划分为训练子集和测试子集,并返回划分好的训练集测试集样本和训练集测试集标签。
  23. np.save('../data/y_train.npy',y_train)
  24. np.save('../data/y_test.npy',y_test)
  25. return x_train,x_test
  26. #对每个句子的所有词向量取均值,来生成一个句子的vector
  27. def build_sentence_vector(text,size,imdb_w2v):
  28. vec = np.zeros(size).reshape((1,size))
  29. count = 0
  30. for word in text:
  31. try:
  32. vec += imdb_w2v[word].reshape((1,size))#逐点求和
  33. count += 1.
  34. except KeyError:
  35. continue
  36. if count != 0:
  37. vec /= count
  38. return vec
  39. #计算词向量
  40. def get_train_vecs(x_train,x_test):
  41. n_dim = 300
  42. #初始化模型和词表
  43. imdb_w2v = Word2Vec(size=n_dim, min_count=10)
  44. imdb_w2v.build_vocab(x_train)
  45. #在评论训练集上建模(可能会花费几分钟)
  46. imdb_w2v.train(x_train,total_examples=imdb_w2v.corpus_count,epochs=imdb_w2v.epochs)
  47. train_vecs = np.concatenate([build_sentence_vector(z, n_dim, imdb_w2v) for z in x_train])
  48. np.save("../data/train_vecs.npy",train_vecs)
  49. print (train_vecs.shape)
  50. #在测试集上训练
  51. imdb_w2v.train(x_test,total_examples=imdb_w2v.corpus_count,epochs=imdb_w2v.epochs)
  52. imdb_w2v.save("../data/w2v_model.pkl")
  53. #Build test tweet vectors then scale
  54. test_vecs = np.concatenate([build_sentence_vector(z, n_dim,imdb_w2v) for z in x_test])
  55. np.save("../data/test_vecs.npy",test_vecs)
  56. print (test_vecs.shape)
  57. def get_data():
  58. train_vecs=np.load('../data/train_vecs.npy')
  59. y_train=np.load('../data/y_train.npy')
  60. test_vecs=np.load('../data/test_vecs.npy')
  61. y_test=np.load('../data/y_test.npy')
  62. return train_vecs,y_train,test_vecs,y_test
  63. #训练svm模型
  64. def svm_train(train_vecs, y_train, test_vecs, y_test):
  65. clf=SVC(kernel='rbf',verbose=True)
  66. clf.fit(train_vecs,y_train)
  67. joblib.dump(clf, '../data/model.pkl')
  68. print (clf.score(test_vecs, y_test))
  69. #构建待预测句子的向量
  70. def get_predict_vecs(words):
  71. n_dim = 300
  72. imdb_w2v = Word2Vec.load('../data/w2v_model.pkl')
  73. train_vecs = build_sentence_vector(words, n_dim, imdb_w2v)
  74. return train_vecs
  75. #对单个句子进行情感判断
  76. def svm_predict(string):
  77. words=jieba.lcut(string)
  78. words_vecs = get_predict_vecs(words)
  79. clf = joblib.load('../data/model.pkl')
  80. result = clf.predict(words_vecs)
  81. if int(result[0])==1:
  82. print (string,' positive')
  83. else:
  84. print (string,' negative')
  85. #初始化训练,第一次调用时生成model.pkl,后面就可以不用跑了
  86. #x_train,x_test = load_file_and_preprocessing()
  87. #get_train_vecs(x_train,x_test)
  88. #train_vecs,y_train,test_vecs,y_test = get_data()
  89. #svm_train(train_vecs, y_train, test_vecs, y_test)
  90. ##对输入句子情感进行判断
  91. string='电池充完了电连手机都打不开.简直烂的要命.真是金玉其外,败絮其中!连5号电池都不如'
  92. svm_predict(string)
  93. string='牛逼的手机,从3米高的地方摔下去都没坏,质量非常好'
  94. svm_predict(string)
  1. Loading model cost 0.742 seconds.
  2. Prefix dict has been built succesfully.
  3. 电池充完了电连手机都打不开.简直烂的要命.真是金玉其外,败絮其中!连5号电池都不如 negative
  4. 牛逼的手机,从3米高的地方摔下去都没坏,质量非常好 positive

 

 

 

 

 

 

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/小小林熬夜学编程/article/detail/353474
推荐阅读
相关标签
  

闽ICP备14008679号