赞
踩
尝试通过Spark上的决策树模型来训练模型,通过人群的其他信息来判断婚姻状况
此项目基于UCI上的开放数据 adult.data
github地址:AdultBase - Truedick23
name := "AdultBase" version := "0.1" scalaVersion := "2.11.8" // https://mvnrepository.com/artifact/org.apache.spark/spark-core libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.1" // https://mvnrepository.com/artifact/org.apache.spark/spark-streaming libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.3.1" // https://mvnrepository.com/artifact/org.apache.spark/spark-mllib libraryDependencies += "org.apache.spark" %% "spark-mllib" % "2.3.1" // https://mvnrepository.com/artifact/org.apache.spark/spark-mllib-local libraryDependencies += "org.apache.spark" %% "spark-mllib-local" % "2.3.1" // https://mvnrepository.com/artifact/org.scalanlp/breeze-viz libraryDependencies += "org.scalanlp" %% "breeze-viz" % "0.13.2"
因为我们不是在spark shell内编写,需要首先建立一个SparkContext类来读取数据
import org.apache.spark.SparkContext
val sc = new SparkContext("local[*]", "AdultData")
val raw_data = sc.textFile("./data/machine-learning-databases/adult.data")
val data = raw_data.map(line => line.split(", ")).filter(fields => fields.length == 15)
data.cache()
首先提取各个特征值,通过distinct函数返回互不相同的数据集合
val number_set = data.map(fields => fields(2).toInt).collect().toSet
val education_types = data.map(fields => fields(3)).distinct.collect()
val marriage_types = data.map(fields => fields(5)).distinct.collect()
val family_condition_types = data.map(fields => fields(7)).distinct.collect()
val occupation_category_types = data.map(fields => fields(1)).distinct.collect()
val occupation_types = data.map(fields => fields(6)).distinct.collect()
val racial_types = data.map(fields => fields(8)).distinct.collect()
val nationality_types = data.map(fields => fields(13)).distinct.collect()
println(marriage_types.length)
定义一个函数,用于根据数据集合构建映射集
def acquireDict(types: Array[String]): Map[String, Int] = {
var idx = 0
var dict: Map[String, Int] = Map()
for (item <- types) {
dict += (item -> idx)
idx += 1
}
dict
}
通过调用函数来生成特征值映射
val education_dict = acquireDict(education_types)
val marriage_dict = acquireDict(marriage_types)
val family_condition_dict = acquireDict(family_condition_types)
val occupation_category_dict = acquireDict(occupation_category_types)
val occupation_dict = acquireDict(occupation_types)
val racial_dict = acquireDict(racial_types)
val nationality_dict = acquireDict(nationality_types)
val sex_dict = Map("Male" -> 1, "Female" -> 0)
构造LabelPoint类,用于输入数据到决策树中:
val data_set = data.map { fields =>
val number = fields(2).toInt
val education = education_dict(fields(3))
val marriage = marriage_dict(fields(5))
val family_condition = family_condition_dict(fields(7))
val occupation_category = occupation_category_dict(fields(1))
val occupation = occupation_dict(fields(6))
val sex = sex_dict(fields(9))
val race = racial_dict(fields(8))
val nationality = nationality_dict(fields(13))
val featureVector = Vectors.dense(education, occupation, occupation_category, sex, family_condition, race, nationality)
val label = marriage
LabeledPoint(label, featureVector)}
如代码所示,我们将婚姻状况设置为类别标签,其他特征(教育状况、家庭情况、职业类别、职业名称、性别、种族、国籍)用来作为特征值,建立一个Vector类型来保存它们,我们打印几组看看格式:
data_set.take(10).foreach(println)
结果如下:
(3.0,[11.0,3.0,1.0,1.0,0.0,4.0,2.0])
(4.0,[11.0,11.0,4.0,1.0,4.0,4.0,2.0])
(2.0,[1.0,9.0,6.0,1.0,0.0,4.0,2.0])
(4.0,[4.0,9.0,6.0,1.0,4.0,3.0,2.0])
(4.0,[11.0,1.0,6.0,0.0,1.0,3.0,31.0])
我们通过ramdomSplit函数来将data_set随机分割成三组数据,分别用于训练模型、交叉检验模型和测试模型,为调用方便将其缓存到内存:
val Array(trainData, cvData, testData) = data_set.randomSplit(Array(0.8, 0.1, 0.1))
trainData.cache
cvData.cache
testData.cache
首先我们建立决策树模型,其中需要提供六个参数:
我们先如下设置参数,进行第一次训练尝试:
val model = DecisionTree.
trainClassifier(trainData, 7, Map[Int, Int](), "entropy", 10, 100)
下面我们设置一个元组量,用于保存预测值与真的类别,用MulticlassMetrics来分析模型,得到训练准确值:
val predictionsAndLabels = cvData.map(example =>
(model.predict(example.features), example.label)
)
val metrics = new MulticlassMetrics(predictionsAndLabels)
println(metrics.precision)
准确值如下:
0.8082788671023965
考虑到我们只有三万多组数据,这样的准确度已经很可贵了,我们再通过三重循环来探索更优的参数设置:
val evaluations =
for (impurity <- Array("gini", "entropy");
depth <- Array(1, 10, 25);
bins <- Array(10, 50, 150))
yield{
val _model = DecisionTree.
trainClassifier(trainData, 7, Map[Int, Int](), impurity, depth, bins)
val _predictionsAndLabels = cvData.map(example =>
(_model.predict(example.features), example.label)
)
val _accuracy = new MulticlassMetrics(_predictionsAndLabels).precision
((depth, bins, impurity), _accuracy)
}
evaluations.sortBy(_._2).reverse.foreach(println)
结果如下:
((10,150,entropy),0.8085365853658537) ((10,50,entropy),0.8085365853658537) ((10,10,entropy),0.8042682926829269) ((10,150,gini),0.8021341463414634) ((10,50,gini),0.8021341463414634) ((10,10,gini),0.8009146341463415) ((25,10,gini),0.7969512195121952) ((25,150,entropy),0.7957317073170732) ((25,50,entropy),0.7957317073170732) ((25,10,entropy),0.7942073170731707) ((25,150,gini),0.7905487804878049) ((25,50,gini),0.7905487804878049) ((1,150,entropy),0.7024390243902439) ((1,50,entropy),0.7024390243902439) ((1,10,entropy),0.7024390243902439) ((1,150,gini),0.7024390243902439) ((1,50,gini),0.7024390243902439) ((1,10,gini),0.7024390243902439)
可以看到(10,150,entropy)这一对组合还不错,虽然准确度还是很可怜,后期再尝试改进一下模型
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。