赞
踩
第一章:分类模型
第一步:通过pandas、createDataFrame创造模型原始数据:
# spark version 3.0.1
from pyspark.ml.classification import LogisticRegression
import pandas as pd
# 模型数据
pandas_df = pd.DataFrame({
'a': [1,1,0,1,0],
'b': [1,0,1,1,1],
'c': [0,1,0,0,0],
'y': [0,0,0,1,1],
'id':['A001', 'A002', 'A003','A004','A005']
})
df = spark.createDataFrame(pandas_df).select("id","a","b","c","y")
df.show()
+----+---+---+---+---+
| id| a| b| c| y|
+----+---+---+---+---+
|A001| 1| 1| 0| 0|
|A002| 1| 0| 1| 0|
|A003| 0| 1| 0| 0|
|A004| 1| 1| 0| 1|
|A005| 0| 1| 0| 1|
+----+---+---+---+---+
第二步:features向量化、标准化处理
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import Normalizer
# 生成features
vecAss = VectorAssembler(inputCols=['a','b','c'], outputCol='features')
df_features = vecAss.transform(df)
+----+---+---+---+---+-------------+
| id| a| b| c| y| features|
+----+---+---+---+---+-------------+
|A001| 1| 1| 0| 0|[1.0,1.0,0.0]|
|A002| 1| 0| 1| 0|[1.0,0.0,1.0]|
|A003| 0| 1| 0| 0|[0.0,1.0,0.0]|
|A004| 1| 1| 0| 1|[1.0,1.0,0.0]|
|A005| 0| 1| 0| 1|[0.0,1.0,0.0]|
+----+---+---+---+---+-------------+
Norm = Normalizer(inputCol="features", outputCol="normFeatures", p=1.0)
df_norm_features = Norm.transform(df_features)
df_norm_features.show()
+----+---+---+---+---+-------------+-------------+
| id| a| b| c| y| features| normFeatures|
+----+---+---+---+---+-------------+-------------+
|A001| 1| 1| 0| 0|[1.0,1.0,0.0]|[0.5,0.5,0.0]|
|A002| 1| 0| 1| 0|[1.0,0.0,1.0]|[0.5,0.0,0.5]|
|A003| 0| 1| 0| 0|[0.0,1.0,0.0]|[0.0,1.0,0.0]|
|A004| 1| 1| 0| 1|[1.0,1.0,0.0]|[0.5,0.5,0.0]|
|A005| 0| 1| 0| 1|[0.0,1.0,0.0]|[0.0,1.0,0.0]|
+----+---+---+---+---+-------------+-------------+
第三步:模型训练
# 模型训练
model = LogisticRegression(featuresCol='normFeatures', labelCol='y',maxIter=100,tol=1e-06,threshold=0.5,predictionCol='prediction',
probabilityCol='probability', rawPredictionCol='rawPrediction',standardization=True).fit(df_norm_features)
print(model.coefficients)
[1.8029996152867545,1.803003434834563,-36.96577573215852]
print(model.intercept)
-1.80300332247
第四步:模型预测
# 模型预测
result = model.transform(df_norm_features)
result.show()
+----+---+---+---+---+-------------+-------------+--------------------+--------------------+----------+
| id| a| b| c| y| features| normFeatures| rawPrediction| probability|prediction|
+----+---+---+---+---+-------------+-------------+--------------------+--------------------+----------+
|A001| 1| 1| 0| 0|[1.0,1.0,0.0]|[0.5,0.5,0.0]|[1.79741316608250...|[0.50000044935329...| 0.0|
|A002| 1| 0| 1| 0|[1.0,0.0,1.0]|[0.5,0.0,0.5]|[19.3843913809097...|[0.99999999618525...| 0.0|
|A003| 0| 1| 0| 0|[0.0,1.0,0.0]|[0.0,1.0,0.0]|[-1.1236073826914...|[0.49999997190981...| 1.0|
|A004| 1| 1| 0| 1|[1.0,1.0,0.0]|[0.5,0.5,0.0]|[1.79741316608250...|[0.50000044935329...| 0.0|
|A005| 0| 1| 0| 1|[0.0,1.0,0.0]|[0.0,1.0,0.0]|[-1.1236073826914...|[0.49999997190981...| 1.0|
+----+---+---+---+---+-------------+-------------+--------------------+--------------------+----------+
result.printSchema()
[1] https://spark.apache.org/docs/3.0.0/api/python/pyspark.ml.html#pyspark.ml.Model
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。