赞
踩
CountVectorizer
TfidfTransformer
TfidfVectorizer 个数+归一化(不包括idf)
vectorizer=CountVectorizer() #该类会将文本中的词语转换为词频矩阵,矩阵元素a[i][j] 表示j词在i类文本下的词频
count=vectorizer.fit_transform(corpus)#将文本转为词频矩阵
transformer=TfidfTransformer()#该类会统计每个词语的tf-idf权值
tfidf=transformer.fit_transform(count)#计算tf-idf
TfidfVec=TfidfVectorizer()
count2=TfidfVec.fit_transform(corpus)
# coding:utf-8 可用中文注释
- # coding:utf-8
- __author__ = "liuxuejiang"
- #import jieba
- #import jieba.posseg as pseg
- import os
- import sys
- from sklearn import feature_extraction
- from sklearn.feature_extraction.text import TfidfTransformer
- from sklearn.feature_extraction.text import CountVectorizer
- from sklearn.feature_extraction.text import TfidfVectorizer
-
- if __name__ == "__main__":
- corpus = [
- 'Today the weather is sunny', #第一类文本切词后的结果,词之间以空格隔开
- 'Sunny day weather is suitable to exercise ', #第二类文本切词后的结果
- 'I ate a Hotdog' ] #第三类文本切词后的结果
- vectorizer=CountVectorizer() #该类会将文本中的词语转换为词频矩阵,矩阵元素a[i][j] 表示j词在i类文本下的词频
- count=vectorizer.fit_transform(corpus)#将文本转为词频矩阵
-
- print(vectorizer.vocabulary_)
- word=vectorizer.get_feature_names()#获取词袋模型中的所有词语
- print(word)
- print(vectorizer.fit_transform(corpus))
- print(vectorizer.fit_transform(corpus).todense())#显示词频矩阵
-
-
-
- transformer=TfidfTransformer()#该类会统计每个词语的tf-idf权值
- tfidf=transformer.fit_transform(count)#计算tf-idf
- print(tfidf)
- weight=tfidf.toarray()#将tf-idf矩阵抽取出来,元素a[i][j]表示j词在i类文本中的tf-idf权重
- print(weight)
-
-
- for i in range(len(weight)):#打印每类文本的tf-idf词语权重,第一个for遍历所有文本,第二个for便利某一类文本下的词语权重
- print u"-------这里输出第",i+1,u"类文本的词语tf-idf权重------"
- for j in range(len(word)):
- print word[j],weight[i][j]
-
-
- TfidfVec=TfidfVectorizer()
- count2=TfidfVec.fit_transform(corpus)
- print("--------直接使用TfidfVectorizer()-------")
- print(TfidfVec.fit_transform(corpus).todense())
- {u'ate': 0, u'is': 4, u'sunny': 6, u'to': 8, u'weather': 10, u'today': 9, u'the': 7, u'suitable': 5, u'day': 1, u'exercise': 2, u'hotdog': 3}
- [u'ate', u'day', u'exercise', u'hotdog', u'is', u'suitable', u'sunny', u'the', u'to', u'today', u'weather']
- (0, 6) 1
- (0, 4) 1
- (0, 10) 1
- (0, 7) 1
- (0, 9) 1
- (1, 2) 1
- (1, 8) 1
- (1, 5) 1
- (1, 1) 1
- (1, 6) 1
- (1, 4) 1
- (1, 10) 1
- (2, 3) 1
- (2, 0) 1
- [[0 0 0 0 1 0 1 1 0 1 1]
- [0 1 1 0 1 1 1 0 1 0 1]
- [1 0 0 1 0 0 0 0 0 0 0]]
- (0, 9) 0.517419943932
- (0, 7) 0.517419943932
- (0, 10) 0.393511204094
- (0, 4) 0.393511204094
- (0, 6) 0.393511204094
- (1, 10) 0.317570180428
- (1, 4) 0.317570180428
- (1, 6) 0.317570180428
- (1, 1) 0.417566623878
- (1, 5) 0.417566623878
- (1, 8) 0.417566623878
- (1, 2) 0.417566623878
- (2, 0) 0.707106781187
- (2, 3) 0.707106781187
- [[ 0. 0. 0. 0. 0.3935112 0.
- 0.3935112 0.51741994 0. 0.51741994 0.3935112 ]
- [ 0. 0.41756662 0.41756662 0. 0.31757018 0.41756662
- 0.31757018 0. 0.41756662 0. 0.31757018]
- [ 0.70710678 0. 0. 0.70710678 0. 0. 0.
- 0. 0. 0. 0. ]]
- -------这里输出第 1 类文本的词语tf-idf权重------
- ate 0.0
- day 0.0
- exercise 0.0
- hotdog 0.0
- is 0.393511204094
- suitable 0.0
- sunny 0.393511204094
- the 0.517419943932
- to 0.0
- today 0.517419943932
- weather 0.393511204094
- -------这里输出第 2 类文本的词语tf-idf权重------
- ate 0.0
- day 0.417566623878
- exercise 0.417566623878
- hotdog 0.0
- is 0.317570180428
- suitable 0.417566623878
- sunny 0.317570180428
- the 0.0
- to 0.417566623878
- today 0.0
- weather 0.317570180428
- -------这里输出第 3 类文本的词语tf-idf权重------
- ate 0.707106781187
- day 0.0
- exercise 0.0
- hotdog 0.707106781187
- is 0.0
- suitable 0.0
- sunny 0.0
- the 0.0
- to 0.0
- today 0.0
- weather 0.0
- [[ 0. 0. 0. 0. 0.3935112 0.
- 0.3935112 0.51741994 0. 0.51741994 0.3935112 ]
- [ 0. 0.41756662 0.41756662 0. 0.31757018 0.41756662
- 0.31757018 0. 0.41756662 0. 0.31757018]
- [ 0.70710678 0. 0. 0.70710678 0. 0. 0.
- 0. 0. 0. 0. ]]
- [Finished in 0.6s]
中文分词采用的jieba分词,安装jieba分词包
1 安装scikit-learn包
对 p
这 r
句 q
话 n
进行 v
分词 n
4 采用scikit-learn包进行tf-idf分词权重计算关键用到了两个类:CountVectorizer和TfidfTransformer,具体参见这里
一个简单的代码如下:
CountVectorizer
- import pandas as pd
- import numpy as np
- from sklearn.feature_extraction.text import CountVectorizer
-
- texts=["dog cat fish","dog cat cat","fish bird","bird"]
- cv = CountVectorizer()
- cv_fit=cv.fit_transform(texts)
-
- print cv.vocabulary_
- {u'bird': 0, u'cat': 1, u'dog': 2, u'fish': 3}
在这种情况下,这是一个dict,其中的键是您找到的单词(功能),值是索引
cv.vocabulary_0, 1, 2, 3,不是词频排序。
您需要使用cv_fit
对象来获取计数
- from sklearn.feature_extraction.text import CountVectorizer
-
- texts=["dog cat fish","dog cat cat","fish bird", 'bird']
- cv = CountVectorizer()
- cv_fit=cv.fit_transform(texts)
-
- print(cv.get_feature_names())
- print(cv_fit.toarray())
- #['bird', 'cat', 'dog', 'fish']
- #[[0 1 1 1]
- # [0 2 1 0]
- # [1 0 0 1]
- # [1 0 0 0]]
数组中的每一行都是您的原始文档(字符串)之一,每列都是一个特征(单词),该元素是该特定单词和文档的计数。你可以看到,如果你把每列相加,你会得到正确的数字
- print(cv_fit.toarray().sum(axis=0))
- #[2 3 2 2]
老实说,我建议使用collections.Counter
或从NLTK的东西,除非你有一些具体的理由使用scikit学习,因为它会更简单。
- def build_vocab(sentences):
- """
- Builds a vocabulary mapping from word to index based on the sentences.
- Returns vocabulary mapping and inverse vocabulary mapping.
- """
- # Build vocabulary
- word_counts = Counter(itertools.chain(*sentences))
- # Mapping from index to word
- vocabulary_inv = [x[0] for x in word_counts.most_common()]
- # Mapping from word to index
- vocabulary = {x: i for i, x in enumerate(vocabulary_inv)}
- return [vocabulary, vocabulary_inv]
- '''
- import collections
- sentence = ["i", "love", "mom", "mom", "loves", "me"]
- collections.Counter(sentence)
- >>> Counter({'i': 1, 'love': 1, 'loves': 1, 'me': 1, 'mom': 2})
- '''
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。