当前位置:   article > 正文

Coursera课程自然语言处理(NLP)笔记整理(二)(第一周课程内容完结)_"metadata\":{\"coursera\":{\"course_slug\":\"nlp-s

"metadata\":{\"coursera\":{\"course_slug\":\"nlp-sequence-models\",\"graded_item_id"

1. 词频的提取和可视化

1.1. 基本内容

I am Happy Because I am learning NLP @deeplearning
  • 1

经过预处理后得到

[happy,learn,nlp]
  • 1

然后进行特征提取有

[1,4,2]
  • 1

将一个语料集中的所有m个语料都进行特征提取,最终可以获得一个矩阵
{%raw%}
1 X 1 ( 1 ) X 2 ( 1 ) 1 X 1 ( 2 ) X 2 ( 2 ) . . . . . . . . . 1 X 1 ( m ) X 2 ( m )

1X1(1)X2(1)1X1(2)X2(2).........1X1(m)X2(m)
11...1X1(1)X1(2)...X1(m)X2(1)X2(2)...X2(m)
{%endraw%}

  1. 首先,建立frequency dictionary
freqs = build_freqs(tweets,labels)#Build frequencies dictionary
  • 1
  1. 初始化相同大小的矩阵X
X = np.zeros((m,3))#Initialize matrix (预先import numpy as np)
  • 1
  1. 遍历删除stop words、stemming、URLs、handles并进行lower casing
for i in range(m):# For every tweet
    p_tweet = process_tweet(tweets[i])#Process tweet
  • 1
  • 2
  1. 通过对positive和negative的频率求和来提取特征
    X[i,:] = extract_features(p_tweet,freqs)# Extract Features
  • 1

1.2. 代码实现

首先依然是 引入必要的库

import nltk                                  # Python library for NLP
from nltk.corpus import twitter_samples      # sample Twitter dataset from NLTK
import matplotlib.pyplot as plt              # visualization library
import numpy as np                           # library for scientific computing and matrix operations
  • 1
  • 2
  • 3
  • 4

其中
process_tweet()函数实现清除文本,将其分割为单独的单词,删除stopwords,并将单词转换为Stem。
build_freqs()函数计算整个Corpus中单词在positive的频率和在negative的频率,构建freqs字典,key是元组(word,label)

下载所需的内容

# download the stopwords for the process_tweet function
nltk.download('stopwords')

# import our convenience functions
from utils import process_tweet, build_freqs
  • 1
  • 2
  • 3
  • 4
  • 5

加载数据集
仍然与上一节相同

# select the lists of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

# concatenate the lists, 1st part is the positive tweets followed by the negative
tweets = all_positive_tweets + all_negative_tweets

# let's see how many tweets we have
print("Number of tweets: ", len(tweets))
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

建立一个Labels array其中前5000个元素的label是1,后5000个元素的label是0.

# make a numpy array representing labels of the tweets
labels = np.append(np.ones((len(all_positive_tweets))), np.zeros((len(all_negative_tweets))))
  • 1
  • 2

(python字典知识可参考菜鸟教程)

构建Word frequency dictionary

def build_freqs(tweets, ys):
    """Build frequencies.
    Input:
        tweets: a list of tweets
        ys: an m x 1 array with the sentiment label of each tweet
            (either 0 or 1)
    Output:
        freqs: a dictionary mapping each (word, sentiment) pair to its
        frequency
    """
    # Convert np array to list since zip needs an iterable.
    #将np数组转换为列表,因为zip需要一个可迭代对象
    # The squeeze is necessary or the list ends up with one element.
    #Squeeze是必要的,否则list将以单个元素解为
    # Also note that this is just a NOP if ys is already a list.
    yslist = np.squeeze(ys).tolist()

    # Start with an empty dictionary and populate it by looping over all tweets
    # and over all processed words in each tweet.
    freqs = {}
    for y, tweet in zip(yslist, tweets):
        for word in process_tweet(tweet):
            pair = (word, y) # 键为(word,label)元组
            if pair in freqs:
                freqs[pair] += 1 # 计数+1
            else:
                freqs[pair] = 1  # 如果没有计数过则初始化为1
    return freqs
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28

现在将这个函数投入使用

# create frequency dictionary
freqs = build_freqs(tweets, labels)

# check data type
print(f'type(freqs) = {type(freqs)}')

# check length of the dictionary
print(f'len(freqs) = {len(freqs)}')
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

但是字典信息太繁杂了,难以观察。

选择想要可视化的一部分单词
可以用元组来储存这个临时信息。

# select some words to appear in the report. we will assume that each word is unique (i.e. no duplicates)
keys = ['happi', 'merri', 'nice', 'good', 'bad', 'sad', 'mad', 'best', 'pretti',
        '❤', ':)', ':(', '
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/笔触狂放9/article/detail/379926
推荐阅读