赞
踩
I am Happy Because I am learning NLP @deeplearning
经过预处理后得到
[happy,learn,nlp]
然后进行特征提取有
[1,4,2]
将一个语料集中的所有m个语料都进行特征提取,最终可以获得一个矩阵
{%raw%}
1
X
1
(
1
)
X
2
(
1
)
1
X
1
(
2
)
X
2
(
2
)
.
.
.
.
.
.
.
.
.
1
X
1
(
m
)
X
2
(
m
)
{%endraw%}
freqs = build_freqs(tweets,labels)#Build frequencies dictionary
X = np.zeros((m,3))#Initialize matrix (预先import numpy as np)
for i in range(m):# For every tweet
p_tweet = process_tweet(tweets[i])#Process tweet
X[i,:] = extract_features(p_tweet,freqs)# Extract Features
首先依然是 引入必要的库
import nltk # Python library for NLP
from nltk.corpus import twitter_samples # sample Twitter dataset from NLTK
import matplotlib.pyplot as plt # visualization library
import numpy as np # library for scientific computing and matrix operations
其中
process_tweet()函数实现清除文本,将其分割为单独的单词,删除stopwords,并将单词转换为Stem。
build_freqs()函数计算整个Corpus中单词在positive的频率和在negative的频率,构建freqs字典,key是元组(word,label)
下载所需的内容
# download the stopwords for the process_tweet function
nltk.download('stopwords')
# import our convenience functions
from utils import process_tweet, build_freqs
加载数据集
仍然与上一节相同
# select the lists of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')
# concatenate the lists, 1st part is the positive tweets followed by the negative
tweets = all_positive_tweets + all_negative_tweets
# let's see how many tweets we have
print("Number of tweets: ", len(tweets))
建立一个Labels array其中前5000个元素的label是1,后5000个元素的label是0.
# make a numpy array representing labels of the tweets
labels = np.append(np.ones((len(all_positive_tweets))), np.zeros((len(all_negative_tweets))))
(python字典知识可参考菜鸟教程)
构建Word frequency dictionary
def build_freqs(tweets, ys): """Build frequencies. Input: tweets: a list of tweets ys: an m x 1 array with the sentiment label of each tweet (either 0 or 1) Output: freqs: a dictionary mapping each (word, sentiment) pair to its frequency """ # Convert np array to list since zip needs an iterable. #将np数组转换为列表,因为zip需要一个可迭代对象 # The squeeze is necessary or the list ends up with one element. #Squeeze是必要的,否则list将以单个元素解为 # Also note that this is just a NOP if ys is already a list. yslist = np.squeeze(ys).tolist() # Start with an empty dictionary and populate it by looping over all tweets # and over all processed words in each tweet. freqs = {} for y, tweet in zip(yslist, tweets): for word in process_tweet(tweet): pair = (word, y) # 键为(word,label)元组 if pair in freqs: freqs[pair] += 1 # 计数+1 else: freqs[pair] = 1 # 如果没有计数过则初始化为1 return freqs
现在将这个函数投入使用
# create frequency dictionary
freqs = build_freqs(tweets, labels)
# check data type
print(f'type(freqs) = {type(freqs)}')
# check length of the dictionary
print(f'len(freqs) = {len(freqs)}')
但是字典信息太繁杂了,难以观察。
选择想要可视化的一部分单词
可以用元组来储存这个临时信息。
# select some words to appear in the report. we will assume that each word is unique (i.e. no duplicates)
keys = ['happi', 'merri', 'nice', 'good', 'bad', 'sad', 'mad', 'best', 'pretti',
'❤', ':)', ':(', '声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/笔触狂放9/article/detail/379926
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。