赞
踩
机器学习API使用来自Spark SQL的DataFrame作为数据集,它能包括多种数据类型,如文本、特征向量、标签、预测值等。
一个Transformers是一个能转化一个DataFrame到另一个DataFrame的算法,例如,一个model可以转化带有特征的DataFrame为一个带有预测值的DataFrame。
Transformers包括特征转化器(feature transformers)和已训练模型(learned models),通常实现方法 transform(),一般通过附加上更多列的方式转化DataFrame为另一个DataFrame。
一个Estimators能通过一个DataFrame生成一个Transformer,例如,一个机器学习算法是一个Estimators,它能在DataFrame上训练得到model。
通常实现方法fit()
一个Pipeline链接多个Transformers和Estimators,指定一个机器学习工作流。
例如,一个简单的文本文件处理需要以下步骤:
这些步骤就是一个机器学习工作流,也就是Pipeline,它包含一系列PipelineStages,并且按一定顺序运行。
- import org.apache.spark.ml.classification.LogisticRegression
- import org.apache.spark.ml.linalg.{Vector, Vectors}
- import org.apache.spark.ml.param.ParamMap
- import org.apache.spark.sql.Row
-
- // Prepare training data from a list of (label, features) tuples.
- val training = spark.createDataFrame(Seq(
- (1.0, Vectors.dense(0.0, 1.1, 0.1)),
- (0.0, Vectors.dense(2.0, 1.0, -1.0)),
- (0.0, Vectors.dense(2.0, 1.3, 1.0)),
- (1.0, Vectors.dense(0.0, 1.2, -0.5))
- )).toDF("label", "features")
-
- // Create a LogisticRegression instance. This instance is an Estimator.
- //这是一个逻辑回归实例,是一个Estimator
- val lr = new LogisticRegression()
- // Print out the parameters, documentation, and any default values.
- //打印逻辑回归参数
- println(s"LogisticRegression parameters:\n ${lr.explainParams()}\n")
-
- // We may set parameters using setter methods.
- //设置参数
- lr.setMaxIter(10)
- .setRegParam(0.01)
-
- // Learn a LogisticRegression model. This uses the parameters stored in lr.
- //训练逻辑回归模型
- val model1 = lr.fit(training)
- // Since model1 is a Model (i.e., a Transformer produced by an Estimator),
- // we can view the parameters it used during fit().
- // This prints the parameter (name: value) pairs, where names are unique IDs for this
- // LogisticRegression instance.
- //打印训练model1所用的参数
- println(s"Model 1 was fit using parameters: ${model1.parent.extractParamMap}")
-
- // We may alternatively specify parameters using a ParamMap,
- // which supports several methods for specifying parameters.
- //使用ParamMap制定参数
- val paramMap = ParamMap(lr.maxIter -> 20)
- .put(lr.maxIter, 30) // Specify 1 Param. This overwrites the original maxIter.
- .put(lr.regParam -> 0.1, lr.threshold -> 0.55) // Specify multiple Params.
-
- // One can also combine ParamMaps.
- val paramMap2 = ParamMap(lr.probabilityCol -> "myProbability") // Change output column name.
- val paramMapCombined = paramMap ++ paramMap2
-
- // Now learn a new model using the paramMapCombined parameters.
- // paramMapCombined overrides all parameters set earlier via lr.set* methods.
- val model2 = lr.fit(training, paramMapCombined)
- println(s"Model 2 was fit using parameters: ${model2.parent.extractParamMap}")
-
- // Prepare test data.
- val test = spark.createDataFrame(Seq(
- (1.0, Vectors.dense(-1.0, 1.5, 1.3)),
- (0.0, Vectors.dense(3.0, 2.0, -0.1)),
- (1.0, Vectors.dense(0.0, 2.2, -1.5))
- )).toDF("label", "features")
-
- // Make predictions on test data using the Transformer.transform() method.
- // LogisticRegression.transform will only use the 'features' column.
- // Note that model2.transform() outputs a 'myProbability' column instead of the usual
- // 'probability' column since we renamed the lr.probabilityCol parameter previously.
- //用model2做预测
- model2.transform(test)
- .select("features", "label", "myProbability", "prediction")
- .collect()
- .foreach { case Row(features: Vector, label: Double, prob: Vector, prediction: Double) =>
- println(s"($features, $label) -> prob=$prob, prediction=$prediction")
- }
![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png)
- import org.apache.spark.ml.{Pipeline, PipelineModel}
- import org.apache.spark.ml.classification.LogisticRegression
- import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
- import org.apache.spark.ml.linalg.Vector
- import org.apache.spark.sql.Row
-
- // Prepare training documents from a list of (id, text, label) tuples.
- val training = spark.createDataFrame(Seq(
- (0L, "a b c d e spark", 1.0),
- (1L, "b d", 0.0),
- (2L, "spark f g h", 1.0),
- (3L, "hadoop mapreduce", 0.0)
- )).toDF("id", "text", "label")
-
- // Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
- //配置pipeline,包含三个阶段:tokenizer, hashingTF, and lr.
- val tokenizer = new Tokenizer()
- .setInputCol("text")
- .setOutputCol("words")
- val hashingTF = new HashingTF()
- .setNumFeatures(1000)
- .setInputCol(tokenizer.getOutputCol)
- .setOutputCol("features")
- val lr = new LogisticRegression()
- .setMaxIter(10)
- .setRegParam(0.001)
- val pipeline = new Pipeline()
- .setStages(Array(tokenizer, hashingTF, lr))
-
- // Fit the pipeline to training documents.
- //使用pipeline训练模型
- val model = pipeline.fit(training)
-
- // Now we can optionally save the fitted pipeline to disk
- //保存模型到磁盘
- model.write.overwrite().save("/tmp/spark-logistic-regression-model")
-
- // We can also save this unfit pipeline to disk
- //保存pipeline到磁盘
- pipeline.write.overwrite().save("/tmp/unfit-lr-model")
-
- // And load it back in during production
- //从磁盘加载已保存的model
- val sameModel = PipelineModel.load("/tmp/spark-logistic-regression-model")
-
- // Prepare test documents, which are unlabeled (id, text) tuples.
- val test = spark.createDataFrame(Seq(
- (4L, "spark i j k"),
- (5L, "l m n"),
- (6L, "spark hadoop spark"),
- (7L, "apache hadoop")
- )).toDF("id", "text")
-
- // Make predictions on test documents.
- model.transform(test)
- .select("id", "text", "probability", "prediction")
- .collect()
- .foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =>
- println(s"($id, $text) --> prob=$prob, prediction=$prediction")
- }
![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png)
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。