当前位置:   article > 正文

Pyspark特征工程--CountVectorizer_pyspark countvectorizer

pyspark countvectorizer

CountVectorizer是属于常见的特征数值计算类,是一个文本特征提取方法

对于每一个训练文本,它只考虑每种词汇在该训练文本中出现的频率

CountVectorizer会将文本中的词语转换为词频矩阵,它通过fit_transform函数计算各个词语出现的次数

返回数据是一个稀疏向量,内容是一个键值对,值为个数(double类型)

01.创建对象,模拟数据

from pyspark.ml.feature import CountVectorizer
from pyspark.sql import SparkSession
spark = SparkSession.builder.config("spark.driver.host","192.168.1.4")\
    .config("spark.ui.showConsoleProgress","false")\
    .appName("CountVectorizer").master("local[*]").getOrCreate()
data = spark.createDataFrame([
    (0, ["a", "b", "c"]),
    (1, ["a", "b", "b", "c", "a"]),
    (2,["d","d","d","b","c","b","a"]),
    (3,["a","d","f"]),
    (4,["a","e","a","a"]),
    (5,["e","f","f","d"]),
],["label","raw"])
data.show()
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14

​ 输出结果:

+-----+--------------------+
|label|                 raw|
+-----+--------------------+
|    0|           [a, b, c]|
|    1|     [a, b, b, c, a]|
|    2|[d, d, d, b, c, b...|
|    3|           [a, d, f]|
|    4|        [a, e, a, a]|
|    5|        [e, f, f, d]|
+-----+--------------------+
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

02.查看结构:

data.printSchema()
  • 1

​ 输出结果:

root
 |-- label: long (nullable = true)
 |-- raw: array (nullable = true)
 |    |-- element: string (containsNull = true)
  • 1
  • 2
  • 3
  • 4

03.使用CountVectorizer,转换创建的dataframe,并查看数据

cv = CountVectorizer(inputCol="raw", outputCol="vectors")
model = cv.fit(data)
model.transform(data).show()
  • 1
  • 2
  • 3

​ 输出结果:

+-----+--------------------+--------------------+
|label|                 raw|             vectors|
+-----+--------------------+--------------------+
|    0|           [a, b, c]|(6,[0,2,4],[1.0,1...|
|    1|     [a, b, b, c, a]|(6,[0,2,4],[2.0,2...|
|    2|[d, d, d, b, c, b...|(6,[0,1,2,4],[1.0...|
|    3|           [a, d, f]|(6,[0,1,3],[1.0,1...|
|    4|        [a, e, a, a]| (6,[0,5],[3.0,1.0])|
|    5|        [e, f, f, d]|(6,[1,3,5],[1.0,2...|
+-----+--------------------+--------------------+
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

04.详细查看前6行数据,这里只有6行

resdata = model.transform(data)
resdata.head(6)
  • 1
  • 2

​ 输出结果:

[Row(label=0, raw=['a', 'b', 'c'], vectors=SparseVector(6, {0: 1.0, 2: 1.0, 4: 1.0})),
 Row(label=1, raw=['a', 'b', 'b', 'c', 'a'], vectors=SparseVector(6, {0: 2.0, 2: 2.0, 4: 1.0})),
 Row(label=2, raw=['d', 'd', 'd', 'b', 'c', 'b', 'a'], vectors=SparseVector(6, {0: 1.0, 1: 3.0, 2: 2.0, 4: 1.0})),
 Row(label=3, raw=['a', 'd', 'f'], vectors=SparseVector(6, {0: 1.0, 1: 1.0, 3: 1.0})),
 Row(label=4, raw=['a', 'e', 'a', 'a'], vectors=SparseVector(6, {0: 3.0, 5: 1.0})),
 Row(label=5, raw=['e', 'f', 'f', 'd'], vectors=SparseVector(6, {1: 1.0, 3: 2.0, 5: 1.0}))]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

05.查看转换后dataframe的结构

resdata.printSchema()
  • 1

​ 输出结果:

root
 |-- label: long (nullable = true)
 |-- raw: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- vectors: vector (nullable = true)
  • 1
  • 2
  • 3
  • 4
  • 5

输出结果解释:

​ step1:假设有原始数据如下(原始数据列),先统计原始数据中所有元素的个数,并从大到小排序:

​ res:b—5个 > a—4个 > c—3个 > d—3个

​ step2:那么[b,a,c,d]对应[0,1,2,3]即为对应下面的(存在标签列),个数为(对应个数列)

原始数据存在标签对应个数
[a,b,c][0,1,2][1,1,1]
[a,b,b,c,a][0,1,2][2,2,1]
[d,d,b,c,b,a][0,1,2,3][2,1,1,3]
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/Cpp五条/article/detail/361189
推荐阅读
相关标签
  

闽ICP备14008679号