赞
踩
sklearn.feature_extraction.text.CountVectorizer官方文档
sklearn.feature_extraction.text.TfidfVectorizer官方文档
class sklearn.feature_extraction.text.CountVectorizer()
input=’content’
:string {‘filename’, ‘file’, ‘content’}
encoding='utf-8'
:string
decode_error='strict'
:{‘strict’, ‘ignore’, ‘replace’}
strip_accents=None
:{‘ascii’, ‘unicode’}
lowercase=True
:boolean
preprocessor=None
:callable
tokenizer=None
:callable
stop_words=None
: string{‘english’}, list
token_pattern='(?u)\b\w\w+\b'
:string
ngram_range=(1, 1)
:tuple (min_n, max_n)
analyzer=’word’
:string, {‘word’, ‘char’, ‘char_wb’} or callable
char_wb
只从单词边界内的文本创建ngram,单词边缘的ngram用空格填充。input
是filename
或file
,则首先从文件中读取数据,然后传递给给定的可调用分析器。max_df=1.0
:float in range [0.0, 1.0] or int
min_df=1
:float in range [0.0, 1.0] or int
max_features=None
:int or None
vocabulary=None
: Mapping or iterable, optional
binary=False
:bool
dtype=np.int64
:type
vocabulary_
:dict字典。terms到特征索引的映射。get_feature_
:boolean。如果用户提供了术语到索引映射的固定词汇表,则为True。stop_words_
:返回停用词表。被忽略的terms(出现在太多文档中max_df,出现在太少文档中min_df,被特征选择丢弃max_features)fit(self, raw_documents[, y])
fit_transform(self, raw_documents, y=None)
transform(self, raw_documents)
get_feature_names(self)
>>> from sklearn.feature_extraction.text import CountVectorizer >>> corpus = [ ... 'This is the first document.', ... 'This document is the second document.', ... 'And this is the third one.', ... 'Is this the first document?', ... ] >>> vectorizer = CountVectorizer() >>> X = vectorizer.fit_transform(corpus) >>> print(vectorizer.get_feature_names()) ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this'] >>> print(X.toarray()) [[0 1 1 1 0 0 1 0 1] [0 2 0 1 0 1 1 0 1] [1 0 0 1 1 0 1 1 1] [0 1 1 1 0 0 1 0 1]] >>> vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2)) >>> X2 = vectorizer2.fit_transform(corpus) >>> print(vectorizer2.get_feature_names()) ['and this', 'document is', 'first document', 'is the', 'is this', 'second document', 'the first', 'the second', 'the third', 'third one', 'this document', 'this is', 'this the'] >>> print(X2.toarray()) [[0 0 1 1 0 0 1 0 0 0 0 1 0] [0 1 0 1 0 1 0 1 0 0 1 0 0] [1 0 0 1 0 0 0 0 1 1 0 1 0] [0 0 1 0 1 0 1 0 0 0 0 0 1]]
词 频 ( T F ( d , t ) ) = 词 t 在 文 档 d 中 出 现 的 次 数 文 档 d 中 的 总 词 数 词频(TF(d,t))=\frac{词 t 在文档d中出现的次数}{文档d中的总词数} 词频(TF(d,t))=文档d中的总词数词t在文档d中出现的次数
逆 文 档 频 率 ( I D F ( d , t ) ) = log 文 档 总 数 包 含 单 词 t 的 文 章 总 数 + 1 逆文档频率(IDF(d,t))=\log\frac{文档总数}{包含单词t的文章总数+1} 逆文档频率(IDF(d,t))=log包含单词t的文章总数+1文档总数
T F − I D F ( d , t ) = T F ( d , t ) × I D F ( d , t ) TF-IDF(d,t)=TF(d,t)\times IDF(d,t) TF−IDF(d,t)=TF(d,t)×IDF(d,t)
逆文档频率用来衡量单词对表达语义所起的重要性:如果一个单词在非常多的文档中都出现,那么它可能是一个比较通用的词汇,对于区分某篇文档特殊语义的贡献较小。
class sklearn.feature_extraction.text.TfidfVectorizer
除了有CountVectorizer中的所有参数外,还有以下参数。
norm=’l2’
:{‘l1’, ‘l2’}
use_idf=True
:bool
smooth_idf=True
:bool
sublinear_tf=False
:bool
vocabulary_
:dict字典。terms到特征索引的映射。fixed_vocabulary_
: bool。如果用户提供了术语到索引的固定词汇表则返回True。idf_
:array of shape (n_features,)。反文档频率(IDF)向量,只有在use_idf为真时才定义。stop_words_
:set,返回停用词表。被忽略的terms(出现在太多文档中max_df,出现在太少文档中min_df,被特征选择丢弃max_features)fit(self, raw_documents[, y])
fit_transform(self, raw_documents, y=None)
transform(self, raw_documents)
get_feature_names(self)
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> corpus = [
... 'This is the first document.',
... 'This document is the second document.',
... 'And this is the third one.',
... 'Is this the first document?',
... ]
>>> vectorizer = TfidfVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> print(vectorizer.get_feature_names())
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
>>> print(X.shape)
(4, 9)
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。