赞
踩
CountVectorizer会将文本中的词语转换为词频矩阵,它通过fit_transform()函数计算各个词语出现的次数,通过get_feature_names()可获取词袋中所有文本的关键字,通过toarray()可看到词频矩阵的结果。
TfidfTransformer类用于将词频矩阵转化为每个词语的TF-IDF值,通过fit_transform()函数。
- from sklearn.feature_extraction.text import CountVectorizer
-
- from sklearn.feature_extraction.text import TfidfTransformer
-
- corpus=[
- 'this is the first document',
- 'this is the second second document',
- 'and And and the third one'
-
- ]
-
- vectorizer=CountVectorizer()
-
- vectorizer.fit(corpus)
-
- x=vectorizer.transform(corpus)
-
- word=vectorizer.get_feature_names()
-
- word_1=vectorizer.vocabulary_
-
- print(word)
-
- #['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
-
- print(word_1)
-
- #{'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4} key:词,value:词在词袋列表中的索引值。
-
- print(x.toarray())
-
- #[[0 1 1 1 0 0 1 0 1]
- [0 1 0 1 0 2 1 0 1]
- [3 0 0 0 1 0 1 1 0]] 文本中有多少个词就有多少列,有几个文本就有几行。
-
- tf=TfidfTransformer()
-
- y=tf.fit_transform(x)
-
- print(y.toarray())
-
- # [[0. 0.43306685 0.56943086 0.43306685 0. 0.
- 0.33631504 0. 0.43306685]
- [0. 0.30833187 0. 0.30833187 0. 0.81083871
- 0.2394472 0. 0.30833187]
- [0.89052427 0. 0. 0. 0.29684142 0.
- 0.17531933 0.29684142 0. ]]
-
data:image/s3,"s3://crabby-images/deb9d/deb9d52e6c78f73fbfaadc6e519fd00d286664e1" alt=""
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。