赞
踩
关键字:spark mllib、文本分类、朴素贝叶斯、naive bayes
文本分类是指将一篇文章归到事先定义好的某一类或者某几类,在数据平台的一个典型的应用场景是,通过爬取用户浏览过的页面内容,识别出用户的浏览偏好,从而丰富该用户的画像。
本文介绍使用Spark MLlib提供的朴素贝叶斯(Naive Bayes)算法,完成对中文文本的分类过程。主要包括中文分词、文本表示(TF-IDF)、模型训练、分类预测等。
对于中文文本分类而言,需要先对文章进行分词,我使用的是IKAnalyzer中文分析工具,之前有篇文章介绍过《中文分词工具-IKAnalyzer下载及使用》,其中自己可以配置扩展词库来使分词结果更合理,我从搜狗、百度输入法下载了细胞词库,将其作为扩展词库。这里不再介绍分词。
分好词后,每一个词都作为一个特征,但需要将中文词语转换成Double型来表示,通常使用该词语的TF-IDF值作为特征值,Spark提供了全面的特征抽取及转换的API,非常方便,详见http://spark.apache.org/docs/latest/ml-features.html,这里介绍下TF-IDF的API
机器学习算法一般都有很多个步骤迭代计算的过程,机器学习的计算需要在多次迭代后获得足够小的误差或者足够收敛才会停止,迭代时如果使用Hadoop的MapReduce计算框架,每次计算都要读/写磁盘以及任务的启动等工作,这回导致非常大的I/O和CPU消耗。而Spark基于内存的计算模型天生就擅长迭代计算,多个步骤计算直接在内存中完成,只有在必要时才会操作磁盘和网络,所以说Spark正是机器学习的理想的平台。
MLlib(Machine Learnig lib) 是Spark对常用的机器学习算法的实现库,同时包括相关的测试和数据生成器。Spark的设计初衷就是为了支持一些迭代的Job, 这正好符合很多机器学习算法的特点。MLlib目前支持4种常见的机器学习问题: 分类、回归、聚类和协同过滤,MLlib在Spark整个生态系统中的位置如图下图所示。
朴素贝叶斯法(NaiveBayes)是基于贝叶斯定理和特征条件独立假设的分类方法,对于给定的训练数据集,首先基于特征条件独立假设学习输入/输出的联合概率分布,然后基于此模型,对给定的输入 xx , 利用贝叶斯定理求出后验概率最大的预测值 yy,它属于生成模型。
https://blog.csdn.net/zuolixiangfisher/article/details/80725466
https://blog.csdn.net/baiyangfu_love/article/details/41014267
比如,训练语料C:\\Users\\yyz\\Downloads\\test_tokenizer\\1.txt:
0,苹果 官网 苹果 宣布
1,苹果 梨 香蕉
逗号分隔的第一列为分类编号,0为科技,1为水果。
- import TestNaiveBayes.RawDataRecord
- import org.apache.log4j.{Level, Logger}
- import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
- import org.apache.spark.mllib.linalg.Vectors
- import org.apache.spark.mllib.regression.LabeledPoint
- import org.apache.spark.sql.Row
- import org.apache.spark.{SparkConf, SparkContext}
-
- /**
- * @class TestNB
- * @author yyz
- * @date 2021/06/20 17:15
- * */
- object TestNB {
- def main(args : Array[String]) {
- Logger.getLogger("org").setLevel(Level.OFF)
-
- val conf = new SparkConf().setMaster("local").setAppName("local_testNB")
- val sc = new SparkContext(conf)
- val sqlContext = new org.apache.spark.sql.SQLContext(sc)
- import sqlContext.implicits._
-
- //将原始数据映射到DataFrame中,字段category为分类编号,字段text为分好的词,以空格分隔
- println("step1、读取数据并且分列")
- var srcRDD = sc.textFile("C:\\Users\\yyz\\Downloads\\test_tokenizer\\1.txt").map {
- x =>
- var data = x.split(",")
- RawDataRecord(data(0), data(1))
- }
-
- var srcDF = srcRDD.toDF()
- srcDF.select("category", "text").take(2).foreach(println)
-
-
- //将分好的词转换为数组
- println("step2、将分好的词转换为数组,并展示结果")
- var tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
- var wordsData = tokenizer.transform(srcDF)
- wordsData.select($"category",$"text",$"words").take(2).foreach(println)
-
-
- //将每个词转换成Int型,并计算其在文档中的词频(TF)
- println("step3、计算TF值,并展示TF值")
- var hashingTF =
- new HashingTF().setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(100)
- var featurizedData = hashingTF.transform(wordsData)
-
- featurizedData.select($"category", $"words", $"rawFeatures").take(2).foreach(println)
-
-
- //计算TF-IDF值
- println("step4、计算TF-IDF值,并展示TF-IDF值")
- var idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
- var idfModel = idf.fit(featurizedData)
- var rescaledData = idfModel.transform(featurizedData)
- rescaledData.select($"category", $"words", $"features").take(2).foreach(println)
-
- rescaledData.select($"category",$"features").printSchema()
-
- //将上面的数据转换成Bayes算法需要的格式
- //https://github.com/apache/spark/blob/branch-1.5/data/mllib/sample_naive_bayes_data.txt
- println("step5、将上面的数据转换成Bayes算法需要的格式,并打印样例数据")
- var trainDataRdd = rescaledData.select($"category",$"features").map {
- row => LabeledPoint(
- row.getAs[String]("category").toDouble,
- Vectors.dense(row.getAs[org.apache.spark.ml.linalg.SparseVector]("features").toArray))
- // case Row(label: String, features: Vector) =>
- // LabeledPoint(label.toDouble, Vectors.dense(features.toArray))
- }
-
- trainDataRdd.show()
- trainDataRdd.collect().foreach(println)
-
- trainDataRdd.printSchema()
- }
- }
- step1、读取数据并且分列
- [0,苹果 官网 苹果 宣布]
- [1,苹果 梨 香蕉]
- step2、将分好的词转换为数组,并展示结果
- [0,苹果 官网 苹果 宣布,WrappedArray(苹果, 官网, 苹果, 宣布)]
- [1,苹果 梨 香蕉,WrappedArray(苹果, 梨, 香蕉)]
- step3、计算TF值,并展示TF值
- [0,WrappedArray(苹果, 官网, 苹果, 宣布),(100,[8,60,85],[1.0,2.0,1.0])]
- [1,WrappedArray(苹果, 梨, 香蕉),(100,[29,60,65],[1.0,1.0,1.0])]
- step4、计算TF-IDF值,并展示TF-IDF值
- [0,WrappedArray(苹果, 官网, 苹果, 宣布),(100,[8,60,85],[0.4054651081081644,0.0,0.4054651081081644])]
- [1,WrappedArray(苹果, 梨, 香蕉),(100,[29,60,65],[0.4054651081081644,0.0,0.4054651081081644])]
- root
- |-- category: string (nullable = true)
- |-- features: vector (nullable = true)
-
- step5、将上面的数据转换成Bayes算法需要的格式,并打印样例数据
- +-----+--------------------+
- |label| features|
- +-----+--------------------+
- | 0.0|[0.0,0.0,0.0,0.0,...|
- | 1.0|[0.0,0.0,0.0,0.0,...|
- +-----+--------------------+
-
- (0.0,[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.4054651081081644,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.4054651081081644,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0])
- (1.0,[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.4054651081081644,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.4054651081081644,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0])
- root
- |-- label: double (nullable = false)
- |-- features: vector (nullable = true)
-
-
- Process finished with exit code 0
这里将中文词语转换成INT型的Hashing算法,类似于Bloomfilter,上面的setNumFeatures(100)表示将Hash分桶的数量设置为100个,这个值默认为2的20次方,即1048576,可以根据你的词语数量来调整,一般来说,这个值越大,不同的词被计算为一个Hash值的概率就越小,数据也更准确,但需要消耗更大的内存,和Bloomfilter是一个道理。
- featurizedData.select($"category", $"words", $"rawFeatures").take(2).foreach(println)
- [0,WrappedArray(苹果, 官网, 苹果, 宣布),(100,[8,60,85],[1.0,2.0,1.0])]
- [1,WrappedArray(苹果, 梨, 香蕉),(100,[29,60,65],[1.0,1.0,1.0])]
- 结果中,“苹果”用29来表示,第一个文档中,词频为2,第二个文档中词频为1.
https://github.com/apache/spark/blob/branch-1.5/data/mllib/sample_naive_bayes_data.txt
step5、将上面的数据转换成Bayes算法需要的格式,并打印样例数据
+-----+--------------------+
|label| features|
+-----+--------------------+
| 0.0|[0.0,0.0,0.0,0.0,...|
| 1.0|[0.0,0.0,0.0,0.0,...|
+-----+--------------------+(0.0,[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.4054651081081644,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.4054651081081644,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0])
(1.0,[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.4054651081081644,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.4054651081081644,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0])
root
|-- label: double (nullable = false)
|-- features: vector (nullable = true)
每一个LabeledPoint中,特征数组的长度为100(setNumFeatures(100)),”官网”和”宣布”对应的特征索引号分别为81和96,因此,在特征数组中,第81位和第96位分别为它们的TF-IDF值。
到此,中文词语特征表示的工作已经完成,trainDataRdd已经可以作为Bayes算法的输入了。
训练模型,语料非常重要,我这里使用的是搜狗提供的分类语料库,很早之前的了,这里只作为学习测试使用。
程序中使用的/tmp/lxw1234/sougou/目录下的文件提供下载:
链接: https://pan.baidu.com/s/1qYWWK48 密码: qrcj
下载地址在:http://www.sogou.com/labs/dl/c.html,语料库一共有10个分类:
C000007 汽车
C000008 财经
C000010 IT
C000013 健康
C000014 体育
C000016 旅游
C000020 教育
C000022 招聘
C000023 文化
C000024 军事
每个分类下有几千个文档,这里将这些语料进行分词,然后每一个分类生成一个文件,在该文件中,每一行数据表示一个文档的分词结果,重新用0-9作为这10个分类的编号:
0 汽车
1 财经
2 IT
3 健康
4 体育
5 旅游
6 教育
7 招聘
8 文化
9 军事
比如,汽车分类下的文件内容为:
数据准备好了,接下来进行模型训练及分类预测。
- import org.apache.log4j.{Level, Logger}
- import org.apache.spark.SparkConf
- import org.apache.spark.SparkContext
- import org.apache.spark.ml.feature.HashingTF
- import org.apache.spark.ml.feature.IDF
- import org.apache.spark.ml.feature.Tokenizer
- import org.apache.spark.mllib.classification.NaiveBayes
- import org.apache.spark.mllib.linalg.Vectors
- import org.apache.spark.mllib.regression.LabeledPoint
-
- /**
- * @class TestNaiveBayes
- * @author yyz
- * @date 2021/06/19 10:31
- * */
- object TestNaiveBayes {
- case class RawDataRecord(category: String, text: String)
-
- def main(args : Array[String]) {
- Logger.getLogger("org").setLevel(Level.OFF)
-
-
- val conf = new SparkConf().setMaster("local").setAppName("local_testNB")
- val sc = new SparkContext(conf)
-
- val sqlContext = new org.apache.spark.sql.SQLContext(sc)
- import sqlContext.implicits._
-
- var srcRDD = sc.textFile("C:\\Users\\yyz\\Downloads\\sougou").map {
- x =>
- var data = x.split(",")
- RawDataRecord(data(0),data(1))
- }
-
- //70%作为训练数据,30%作为测试数据
- val splits = srcRDD.randomSplit(Array(0.7, 0.3))
- var trainingDF = splits(0).toDF()
- var testDF = splits(1).toDF()
-
- //将词语转换成数组
- var tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
- var wordsData = tokenizer.transform(trainingDF)
- println("output1:(将词语转换成数组)")
- println(wordsData.select($"category",$"text",$"words").take(2).mkString("#"))
-
- //计算每个词在文档中的词频
- var hashingTF = new HashingTF().setNumFeatures(500000).setInputCol("words").setOutputCol("rawFeatures")
- var featurizedData = hashingTF.transform(wordsData)
- println("output2:(计算每个词在文档中的词频)")
- println(featurizedData.select($"category", $"words", $"rawFeatures").take(2).mkString("#"))
-
-
- //计算每个词的TF-IDF
- var idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
- var idfModel = idf.fit(featurizedData)
- var rescaledData = idfModel.transform(featurizedData)
- println("output3:(计算每个词的TF-IDF)")
- println(rescaledData.select($"category", $"features").take(2).mkString("#"))
-
- rescaledData.select($"category",$"features").show(false)
-
- // rescaledData.select($"category",$"features").rdd.foreach(println)
- println(rescaledData.select($"category",$"features").schema)
-
- rescaledData.select($"category",$"features").printSchema()
- println("****************************************")
-
-
- //转换成Bayes的输入格式
- // var trainDataRdd = rescaledData.select($"category",$"features").map {
- // case Row(label: String, features: Vector) =>
- // LabeledPoint(label.toDouble, Vectors.dense(features.toArray))
- // }
- var trainDataRdd = rescaledData.select($"category",$"features").map {
- row => LabeledPoint(
- row.getAs[String]("category").toDouble,
- Vectors.dense(row.getAs[org.apache.spark.ml.linalg.SparseVector]("features").toArray))
- }
- println("output4:训练数据集样例")
- trainDataRdd.show()
-
-
- //测试数据集,做同样的特征表示及格式转换
- var testwordsData = tokenizer.transform(testDF)
- var testfeaturizedData = hashingTF.transform(testwordsData)
- var testrescaledData = idfModel.transform(testfeaturizedData)
- var testDataRdd = testrescaledData.select($"category",$"features").map {
- row => LabeledPoint(
- row.getAs[String]("category").toDouble,
- Vectors.dense(row.getAs[org.apache.spark.ml.linalg.SparseVector]("features").toArray))
- }
- println("output5:测试数据集样例")
- testDataRdd.show()
-
- //训练模型
- println("output6:用测试数据集训练贝叶斯模型")
- val model = NaiveBayes.train(trainDataRdd.rdd, lambda = 1.0, modelType = "multinomial")
-
- //对测试数据集使用训练模型进行分类预测
- println("output7:使用训练好的贝叶斯模型在测试数据集上进行验证")
- val testpredictionAndLabel = testDataRdd.map(p => (model.predict(p.features), p.label))
-
- //统计分类准确率
- println("output8:统计贝叶斯分类模型在测试数据集上的准确率")
- var testaccuracy = 1.0 * testpredictionAndLabel.filter(x => x._1 == x._2).count() / testDataRdd.count()
- println(testaccuracy)
-
- }
-
- }
- output1:(将词语转换成数组)
- [0,昨天下午 北四环 二里庄 南口 一辆 726 公交 刹车 滑行 20米 重重 在前 一班 726 公交 屁股 前车 进站 停靠 一位 乘客 飞溅 玻璃 破脸 乘客 司机 斗气 车速 过快 导致 追尾 王女士 下午 点多 五道口 上了 726 公交 坐在 司机 后方 王女士 发现 跑得 很快 超过 车辆 发现 司机 一辆 656 公交 斗气 王女士 几名 乘客 出声 提醒 减速 司机 理会 王女士 这辆 726 公交车 车牌 号为 ab2032 临近 二里庄 南口 656 公交 总算 分道扬镳 车速 很快 司机 车站 公交车 停靠 不住 撞了 下午 5点 记者 来到 726 公交 二里庄 南口 停车 管理员 何先生 指着 一地 碎玻璃 前车 乘客 坐在 后排 脸上 玻璃 划了 几处 口子 年轻人 下车 捂着 726 公交 搭载 2位 受伤 男子 去了 医院 记者 注意到 后车 刹车 长达 20米 之多 末端 向左 痕迹 下午 点多 726 车队 相关 负责人 承认 追尾 事件 追尾 下过 所致 ,WrappedArray(昨天下午, 北四环, 二里庄, 南口, 一辆, 726, 公交, 刹车, 滑行, 20米, 重重, 在前, 一班, 726, 公交, 屁股, 前车, 进站, 停靠, 一位, 乘客, 飞溅, 玻璃, 破脸, 乘客, 司机, 斗气, 车速, 过快, 导致, 追尾, 王女士, 下午, 点多, 五道口, 上了, 726, 公交, 坐在, 司机, 后方, 王女士, 发现, 跑得, 很快, 超过, 车辆, 发现, 司机, 一辆, 656, 公交, 斗气, 王女士, 几名, 乘客, 出声, 提醒, 减速, 司机, 理会, 王女士, 这辆, 726, 公交车, 车牌, 号为, ab2032, 临近, 二里庄, 南口, 656, 公交, 总算, 分道扬镳, 车速, 很快, 司机, 车站, 公交车, 停靠, 不住, 撞了, 下午, 5点, 记者, 来到, 726, 公交, 二里庄, 南口, 停车, 管理员, 何先生, 指着, 一地, 碎玻璃, 前车, 乘客, 坐在, 后排, 脸上, 玻璃, 划了, 几处, 口子, 年轻人, 下车, 捂着, 726, 公交, 搭载, 2位, 受伤, 男子, 去了, 医院, 记者, 注意到, 后车, 刹车, 长达, 20米, 之多, 末端, 向左, 痕迹, 下午, 点多, 726, 车队, 相关, 负责人, 承认, 追尾, 事件, 追尾, 下过, 所致)]#[0,昨天下午 3点 机场 五元 桥底下 一辆 973路 公交车 桥洞 发生 撞倒 路边 一根 又将 一辆 三轮车 公交 车队 工作人员 介绍 事故 下雨 目击者 973路 公车 由东向西 拐过 五元 桥洞 逆行 车道 滑去 撞上 路边 一根 一辆 三轮车 顶到 一辆 909路 公交车 停下 三轮 车主 倒在 身上 流出 下午 4点 车牌 g31316 973路 公交车 停在 由西向东 道上 车前 挡风玻璃 破裂 左侧 一根 10米 路灯 连根拔起 车头 左侧 一辆 装有 马达 三轮车 变形 车上 留有 血迹 公交车 尾部 一条 5米 刹车 下午 5点 酒仙桥医院 受伤 三轮车 车主 接受 治疗 医生 伤者 右腿 骨折 生命危险 973路 车队 一名 工作人员 下雨 路面 司机 拐弯 发生 车速 三十 事故 发生后 上将 伤者 送到 医院 救治 付了 医疗 费用 受伤 住院治疗 ,WrappedArray(昨天下午, 3点, 机场, 五元, 桥底下, 一辆, 973路, 公交车, 桥洞, 发生, 撞倒, 路边, 一根, 又将, 一辆, 三轮车, 公交, 车队, 工作人员, 介绍, 事故, 下雨, 目击者, 973路, 公车, 由东向西, 拐过, 五元, 桥洞, 逆行, 车道, 滑去, 撞上, 路边, 一根, 一辆, 三轮车, 顶到, 一辆, 909路, 公交车, 停下, 三轮, 车主, 倒在, 身上, 流出, 下午, 4点, 车牌, g31316, 973路, 公交车, 停在, 由西向东, 道上, 车前, 挡风玻璃, 破裂, 左侧, 一根, 10米, 路灯, 连根拔起, 车头, 左侧, 一辆, 装有, 马达, 三轮车, 变形, 车上, 留有, 血迹, 公交车, 尾部, 一条, 5米, 刹车, 下午, 5点, 酒仙桥医院, 受伤, 三轮车, 车主, 接受, 治疗, 医生, 伤者, 右腿, 骨折, 生命危险, 973路, 车队, 一名, 工作人员, 下雨, 路面, 司机, 拐弯, 发生, 车速, 三十, 事故, 发生后, 上将, 伤者, 送到, 医院, 救治, 付了, 医疗, 费用, 受伤, 住院治疗)]
- output2:(计算每个词在文档中的词频)
- [0,WrappedArray(昨天下午, 北四环, 二里庄, 南口, 一辆, 726, 公交, 刹车, 滑行, 20米, 重重, 在前, 一班, 726, 公交, 屁股, 前车, 进站, 停靠, 一位, 乘客, 飞溅, 玻璃, 破脸, 乘客, 司机, 斗气, 车速, 过快, 导致, 追尾, 王女士, 下午, 点多, 五道口, 上了, 726, 公交, 坐在, 司机, 后方, 王女士, 发现, 跑得, 很快, 超过, 车辆, 发现, 司机, 一辆, 656, 公交, 斗气, 王女士, 几名, 乘客, 出声, 提醒, 减速, 司机, 理会, 王女士, 这辆, 726, 公交车, 车牌, 号为, ab2032, 临近, 二里庄, 南口, 656, 公交, 总算, 分道扬镳, 车速, 很快, 司机, 车站, 公交车, 停靠, 不住, 撞了, 下午, 5点, 记者, 来到, 726, 公交, 二里庄, 南口, 停车, 管理员, 何先生, 指着, 一地, 碎玻璃, 前车, 乘客, 坐在, 后排, 脸上, 玻璃, 划了, 几处, 口子, 年轻人, 下车, 捂着, 726, 公交, 搭载, 2位, 受伤, 男子, 去了, 医院, 记者, 注意到, 后车, 刹车, 长达, 20米, 之多, 末端, 向左, 痕迹, 下午, 点多, 726, 车队, 相关, 负责人, 承认, 追尾, 事件, 追尾, 下过, 所致),(500000,[1296,3390,12567,18147,18317,21709,38692,46566,49180,54504,55987,56219,58510,63631,71744,75805,89198,89407,91372,94813,105958,111754,112759,112894,118541,121199,148264,153559,154087,161703,162204,166211,170490,180565,194689,194758,195120,201318,204097,204802,209623,210585,217798,223075,223241,231720,237554,247599,255591,259029,259093,265117,271928,289921,291931,293855,296009,306001,306005,307745,317246,317558,331245,338068,346693,353701,355940,360047,361336,366501,369213,369450,371230,371544,374335,377757,380582,382794,385884,392309,409203,416762,436888,440476,454285,463234,466443,467066,469676,478824,486902,488240,492343,498071],[1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,4.0,3.0,1.0,2.0,1.0,1.0,1.0,3.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,7.0,1.0,3.0,1.0,2.0,2.0,1.0,1.0,2.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,1.0,1.0,1.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,2.0,1.0,1.0,7.0,4.0,1.0])]#[0,WrappedArray(昨天下午, 3点, 机场, 五元, 桥底下, 一辆, 973路, 公交车, 桥洞, 发生, 撞倒, 路边, 一根, 又将, 一辆, 三轮车, 公交, 车队, 工作人员, 介绍, 事故, 下雨, 目击者, 973路, 公车, 由东向西, 拐过, 五元, 桥洞, 逆行, 车道, 滑去, 撞上, 路边, 一根, 一辆, 三轮车, 顶到, 一辆, 909路, 公交车, 停下, 三轮, 车主, 倒在, 身上, 流出, 下午, 4点, 车牌, g31316, 973路, 公交车, 停在, 由西向东, 道上, 车前, 挡风玻璃, 破裂, 左侧, 一根, 10米, 路灯, 连根拔起, 车头, 左侧, 一辆, 装有, 马达, 三轮车, 变形, 车上, 留有, 血迹, 公交车, 尾部, 一条, 5米, 刹车, 下午, 5点, 酒仙桥医院, 受伤, 三轮车, 车主, 接受, 治疗, 医生, 伤者, 右腿, 骨折, 生命危险, 973路, 车队, 一名, 工作人员, 下雨, 路面, 司机, 拐弯, 发生, 车速, 三十, 事故, 发生后, 上将, 伤者, 送到, 医院, 救治, 付了, 医疗, 费用, 受伤, 住院治疗),(500000,[14684,23020,28472,37616,50673,64428,83064,85891,85935,89831,101129,103023,112086,112894,117415,121199,126612,136872,140043,154087,156233,161703,161806,162204,165819,166049,169260,179445,183814,189061,194722,202127,204097,206359,212909,219214,222404,223241,238709,247629,251151,254347,255591,258087,265432,276069,276900,284767,285702,295891,296009,296108,296444,303945,320885,321352,326751,331245,338068,340864,341901,348484,360047,365147,365174,386812,388022,390741,391341,391487,396487,402754,403756,410303,419182,433404,436888,440718,441264,445257,449430,452052,456858,468856,469507,480510,499548],[1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,2.0,2.0,3.0,1.0,1.0,2.0,1.0,1.0,1.0,4.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,2.0,1.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])]
- output3:(计算每个词的TF-IDF)
- [0,(500000,[1296,3390,12567,18147,18317,21709,38692,46566,49180,54504,55987,56219,58510,63631,71744,75805,89198,89407,91372,94813,105958,111754,112759,112894,118541,121199,148264,153559,154087,161703,162204,166211,170490,180565,194689,194758,195120,201318,204097,204802,209623,210585,217798,223075,223241,231720,237554,247599,255591,259029,259093,265117,271928,289921,291931,293855,296009,306001,306005,307745,317246,317558,331245,338068,346693,353701,355940,360047,361336,366501,369213,369450,371230,371544,374335,377757,380582,382794,385884,392309,409203,416762,436888,440476,454285,463234,466443,467066,469676,478824,486902,488240,492343,498071],[6.2766987730854735,6.936010092407153,3.321482168425204,4.679095318298376,4.5403691087967495,4.951029033782018,5.421032663027753,8.416764936581744,5.554564055652276,3.7369512075978872,6.912687539805471,4.092632280326765,5.851815579120208,6.470854787526431,3.0095931651216254,4.799112991756176,6.712016844343319,12.941709575052862,5.2921997911847845,6.165473137975249,4.604562266435809,3.5297402354568645,6.712016844343319,3.2928009571784855,2.2704356789128473,4.49479160030043,2.542536699611783,26.50002186941476,9.069412170688148,5.207939447567045,9.480928529349336,8.704447009033526,5.421032663027753,6.507222431697306,16.116727496574967,8.416764936581744,2.703857554148543,7.238109940240098,4.5613122826419925,4.079474195749254,5.826935978694339,6.433775285693921,6.219540359245525,4.369337294147395,33.54696802523766,4.208347918099796,19.521667295091916,3.3338089809058626,5.8432959881182,10.842065326055506,2.033258301697739,5.512599856553244,13.424033688686638,26.113341027100578,4.2347147939405385,6.584183472833434,3.868165102082047,7.500474204707589,4.2657250306830985,6.306551736235154,4.193587502516675,8.120112219784305,8.46181053104774,8.765048596858698,3.908656463436784,6.219540359245525,6.584183472833434,17.38454912273268,5.568952793104375,4.644003998487106,1.7549101960364333,5.97441790121254,2.6531424615695265,5.408610143029196,4.326595745770124,4.519855568963647,3.1723759120592634,7.500474204707589,5.952911695991577,2.2193029969945233,3.138650277351227,4.484939303857419,5.54037942066032,5.52639317868558,4.437083282679784,12.037739327566747,18.21982423428338,4.974745560399334,2.6967600233492797,6.470854787526431,5.583551592525528,60.93112906323468,15.368215832313446,6.165473137975249])]#[0,(500000,[14684,23020,28472,37616,50673,64428,83064,85891,85935,89831,101129,103023,112086,112894,117415,121199,126612,136872,140043,154087,156233,161703,161806,162204,165819,166049,169260,179445,183814,189061,194722,202127,204097,206359,212909,219214,222404,223241,238709,247629,251151,254347,255591,258087,265432,276069,276900,284767,285702,295891,296009,296108,296444,303945,320885,321352,326751,331245,338068,340864,341901,348484,360047,365147,365174,386812,388022,390741,391341,391487,396487,402754,403756,410303,419182,433404,436888,440718,441264,445257,449430,452052,456858,468856,469507,480510,499548],[4.649767703203856,11.288352428683925,15.447235512043598,5.795726112469164,1.9513981198123695,9.10991211714169,3.4061296424854888,6.969845953645419,6.19214138505741,6.807327024147644,13.71724063707039,4.519571901992378,3.402801852392814,3.2928009571784855,7.209161162418654,8.98958320060086,14.3178641289276,5.613404555675209,3.480494057774351,6.0462747804587655,2.0357953009443275,5.207939447567045,5.337151179047051,18.961857058698673,10.151342957978589,5.777707606966485,3.9031619441191436,5.644176214341963,5.911238999591008,6.858620318535195,6.807327024147644,32.04519931389432,4.5613122826419925,4.62127574740955,12.037739327566747,2.8735425269379853,5.30324962737137,4.79242400360538,5.6599245713101025,8.71264385207065,6.308149495353717,6.62500546735369,14.6082399702955,7.405164024903264,4.632575302663483,4.465521218000317,4.293670961073658,4.868585364570944,5.675924912656543,4.999038252968378,7.736330204164094,2.9073765999537673,5.30324962737137,6.758536859978212,6.711506870320844,6.089487230997327,25.349293579607632,4.23090526552387,4.382524298429349,5.408610143029196,5.832767384149513,8.704447009033526,3.476909824546536,4.990874942329217,5.628672027805998,8.193621385267535,4.347738182343933,8.704447009033526,6.507222431697306,5.814075251137361,6.62500546735369,5.911238999591008,5.093529096389301,2.9003165605430423,6.19214138505741,4.951029033782018,5.54037942066032,2.8027248374264047,5.568952793104375,4.386958895497215,4.2657250306830985,6.758536859978212,4.446473023029623,8.193621385267535,5.52639317868558,3.824173532800513,5.675924912656543])]
- +--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
- |category|features |
- +--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
- |0 |(500000,[1296,3390,12567,18147,18317,21709,38692,46566,49180,54504,55987,56219,58510,63631,71744,75805,89198,89407,91372,94813,105958,111754,112759,112894,118541,121199,148264,153559,154087,161703,162204,166211,170490,180565,194689,194758,195120,201318,204097,204802,209623,210585,217798,223075,223241,231720,237554,247599,255591,259029,259093,265117,271928,289921,291931,293855,296009,306001,306005,307745,317246,317558,331245,338068,346693,353701,355940,360047,361336,366501,369213,369450,371230,371544,374335,377757,380582,382794,385884,392309,409203,416762,436888,440476,454285,463234,466443,467066,469676,478824,486902,488240,492343,498071],[6.2766987730854735,6.936010092407153,3.321482168425204,4.679095318298376,4.5403691087967495,4.951029033782018,5.421032663027753,8.416764936581744,5.554564055652276,3.7369512075978872,6.912687539805471,4.092632280326765,5.851815579120208,6.470854787526431,3.0095931651216254,4.799112991756176,6.712016844343319,12.941709575052862,5.2921997911847845,6.165473137975249,4.604562266435809,3.5297402354568645,6.712016844343319,3.2928009571784855,2.2704356789128473,4.49479160030043,2.542536699611783,26.50002186941476,9.069412170688148,5.207939447567045,9.480928529349336,8.704447009033526,5.421032663027753,6.507222431697306,16.116727496574967,8.416764936581744,2.703857554148543,7.238109940240098,4.5613122826419925,4.079474195749254,5.826935978694339,6.433775285693921,6.219540359245525,4.369337294147395,33.54696802523766,4.208347918099796,19.521667295091916,3.3338089809058626,5.8432959881182,10.842065326055506,2.033258301697739,5.512599856553244,13.424033688686638,26.113341027100578,4.2347147939405385,6.584183472833434,3.868165102082047,7.500474204707589,4.2657250306830985,6.306551736235154,4.193587502516675,8.120112219784305,8.46181053104774,8.765048596858698,3.908656463436784,6.219540359245525,6.584183472833434,17.38454912273268,5.568952793104375,4.644003998487106,1.7549101960364333,5.97441790121254,2.6531424615695265,5.408610143029196,4.326595745770124,4.519855568963647,3.1723759120592634,7.500474204707589,5.952911695991577,2.2193029969945233,3.138650277351227,4.484939303857419,5.54037942066032,5.52639317868558,4.437083282679784,12.037739327566747,18.21982423428338,4.974745560399334,2.6967600233492797,6.470854787526431,5.583551592525528,60.93112906323468,15.368215832313446,6.165473137975249])|
- +--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
- only showing top 1 row
-
- StructType(StructField(category,StringType,true), StructField(features,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
- root
- |-- category: string (nullable = true)
- |-- features: vector (nullable = true)
-
- ****************************************
- output4:训练数据集样例
- +-----+--------------------+
- |label| features|
- +-----+--------------------+
- | 0.0|[0.0,0.0,0.0,0.0,...|
- +-----+--------------------+
- only showing top 1 row
-
- output5:测试数据集样例
- +-----+--------------------+
- |label| features|
- +-----+--------------------+
- | 0.0|[0.0,0.0,0.0,0.0,...|
- +-----+--------------------+
- only showing top 1 row
-
- output6:用测试数据集训练贝叶斯模型
- 21/06/20 18:12:42 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
- 21/06/20 18:12:42 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
- output7:使用训练好的贝叶斯模型在测试数据集上进行验证
- output8:统计贝叶斯分类模型在测试数据集上的准确率
- 0.901815392482741
-
- Process finished with exit code 0
准确率90%,还可以。接下来需要收集分类更细,时间更新的数据来训练和测试了。。
- <properties>
- <maven.compiler.source>1.8</maven.compiler.source>
- <maven.compiler.target>1.8</maven.compiler.target>
- <scala.version>2.11.8</scala.version>
- <spark.version>2.3.1</spark.version>
- <scope>compile</scope>
- </properties>
-
- <dependency>
- <groupId>org.scala-lang</groupId>
- <artifactId>scala-library</artifactId>
- <version>2.11.8</version>
- </dependency>
- <dependency>
- <groupId>org.apache.spark</groupId>
- <artifactId>spark-core_2.11</artifactId>
- <exclusions>
- <exclusion>
- <groupId>com.google.guava</groupId>
- <artifactId>guava</artifactId>
- </exclusion>
- </exclusions>
- <version>2.3.4</version>
- <scope>${scope}</scope>
- </dependency>
- <dependency>
- <groupId>com.google.guava</groupId>
- <artifactId>guava</artifactId>
- <version>29.0-jre</version>
- </dependency>
- <dependency>
- <groupId>org.apache.spark</groupId>
- <artifactId>spark-sql_2.11</artifactId>
- <version>2.3.4</version>
- <scope>${scope}</scope>
- </dependency>
- <dependency>
- <groupId>org.apache.spark</groupId>
- <artifactId>spark-hive_2.11</artifactId>
- <version>2.3.4</version>
- <scope>${scope}</scope>
- </dependency>
- <!-- 20210618 add-->
- <dependency>
- <groupId>org.apache.spark</groupId>
- <artifactId>spark-mllib_2.11</artifactId>
- <version>${spark.version}</version>
- </dependency>
- <!-- 20210618 add-->
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。