赞
踩
CountVectorizer是属于常见的特征数值计算类,是一个文本特征提取方法
对于每一个训练文本,它只考虑每种词汇在该训练文本中出现的频率
CountVectorizer会将文本中的词语转换为词频矩阵,它通过fit_transform函数计算各个词语出现的次数
返回数据是一个稀疏向量,内容是一个键值对,值为个数(double类型)
01.创建对象,模拟数据
from pyspark.ml.feature import CountVectorizer
from pyspark.sql import SparkSession
spark = SparkSession.builder.config("spark.driver.host","192.168.1.4")\
.config("spark.ui.showConsoleProgress","false")\
.appName("CountVectorizer").master("local[*]").getOrCreate()
data = spark.createDataFrame([
(0, ["a", "b", "c"]),
(1, ["a", "b", "b", "c", "a"]),
(2,["d","d","d","b","c","b","a"]),
(3,["a","d","f"]),
(4,["a","e","a","a"]),
(5,["e","f","f","d"]),
],["label","raw"])
data.show()
输出结果:
+-----+--------------------+
|label| raw|
+-----+--------------------+
| 0| [a, b, c]|
| 1| [a, b, b, c, a]|
| 2|[d, d, d, b, c, b...|
| 3| [a, d, f]|
| 4| [a, e, a, a]|
| 5| [e, f, f, d]|
+-----+--------------------+
02.查看结构:
data.printSchema()
输出结果:
root
|-- label: long (nullable = true)
|-- raw: array (nullable = true)
| |-- element: string (containsNull = true)
03.使用CountVectorizer,转换创建的dataframe,并查看数据
cv = CountVectorizer(inputCol="raw", outputCol="vectors")
model = cv.fit(data)
model.transform(data).show()
输出结果:
+-----+--------------------+--------------------+
|label| raw| vectors|
+-----+--------------------+--------------------+
| 0| [a, b, c]|(6,[0,2,4],[1.0,1...|
| 1| [a, b, b, c, a]|(6,[0,2,4],[2.0,2...|
| 2|[d, d, d, b, c, b...|(6,[0,1,2,4],[1.0...|
| 3| [a, d, f]|(6,[0,1,3],[1.0,1...|
| 4| [a, e, a, a]| (6,[0,5],[3.0,1.0])|
| 5| [e, f, f, d]|(6,[1,3,5],[1.0,2...|
+-----+--------------------+--------------------+
04.详细查看前6行数据,这里只有6行
resdata = model.transform(data)
resdata.head(6)
输出结果:
[Row(label=0, raw=['a', 'b', 'c'], vectors=SparseVector(6, {0: 1.0, 2: 1.0, 4: 1.0})),
Row(label=1, raw=['a', 'b', 'b', 'c', 'a'], vectors=SparseVector(6, {0: 2.0, 2: 2.0, 4: 1.0})),
Row(label=2, raw=['d', 'd', 'd', 'b', 'c', 'b', 'a'], vectors=SparseVector(6, {0: 1.0, 1: 3.0, 2: 2.0, 4: 1.0})),
Row(label=3, raw=['a', 'd', 'f'], vectors=SparseVector(6, {0: 1.0, 1: 1.0, 3: 1.0})),
Row(label=4, raw=['a', 'e', 'a', 'a'], vectors=SparseVector(6, {0: 3.0, 5: 1.0})),
Row(label=5, raw=['e', 'f', 'f', 'd'], vectors=SparseVector(6, {1: 1.0, 3: 2.0, 5: 1.0}))]
05.查看转换后dataframe的结构
resdata.printSchema()
输出结果:
root
|-- label: long (nullable = true)
|-- raw: array (nullable = true)
| |-- element: string (containsNull = true)
|-- vectors: vector (nullable = true)
输出结果解释:
step1:假设有原始数据如下(原始数据列),先统计原始数据中所有元素的个数,并从大到小排序:
res:b—5个 > a—4个 > c—3个 > d—3个
step2:那么[b,a,c,d]对应[0,1,2,3]即为对应下面的(存在标签列),个数为(对应个数列)
原始数据 | 存在标签 | 对应个数 |
---|---|---|
[a,b,c] | [0,1,2] | [1,1,1] |
[a,b,b,c,a] | [0,1,2] | [2,2,1] |
[d,d,b,c,b,a] | [0,1,2,3] | [2,1,1,3] |
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。