赞
踩
这一节我们使用gensim来进行单词的向量化。
import spacy all_texts = np.array(twitter_train_df['text']).tolist() + np.array(twitter_test_df['text']).tolist() all_tokenized_texts = [] token_freq_dict = {} nlp = spacy.load("en_core_web_sm") for twitt in all_texts: doc = nlp(twitt) token_twitt = [] for token in doc: token = token.text.lower() token_twitt.append(token) if token in token_freq_dict: token_freq_dict[token] += 1 else: token_freq_dict[token] = 1 all_tokenized_texts.append(token_twitt)
gensim包的用法可以参考官方网站:
https://radimrehurek.com/gensim/models/word2vec.html
from gensim.models import Word2Vec
model = Word2Vec(all_tokenized_texts, size=300)
每一条推文的向量表示可以通过其所有token的向量取平均来计算:
all_vec_tweets = []
for tweet in all_tokenized_texts:
tw_vecs = []
for token in tweet:
if token_freq_dict[token]>=5:
tw_vecs.append(model.wv[token].tolist())
if len(tw_vecs)==0:
all_vec_tweets.append(np.zeros(300).tolist())
else:
all_vec_tweets.append(np.mean(np.array(tw_vecs), 0).tolist())
这里就和上一节一样了。
from sklearn.linear_model import LogisticRegression
train_X = np.array(all_vec_tweets[:len(twitter_train_df)])
train_y = twitter_train_df['sentiment']
test_X = all_vec_tweets[len(twitter_train_df):]
test_y = twitter_test_df['sentiment']
clf = LogisticRegression(random_state=0).fit(train_X, train_y)
print("The accuracy of the trained classifier is "+str(clf.score(test_X, test_y)*100)+"%")
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。