赞
踩
我们在做用户画像的时候,需要获得用户对某商品、品牌的评价记录这样的事实标签。这个值获取很麻烦,不好根据一句标语得出一个分值。
我们可以为评价的偏好得分,定义一个规则:
业务系统中有大量的用户商品评价,存在于商品评论表中:
sku_id | user_id | comment |
---|---|---|
sku0001 | user0008 | 穿的舒服,卖家发货挺快的,服务态度也很好 |
sku0001 | user0006 | 东西质量不错,性价比高,特别轻便,舒适透气,宝贝与描述完全一致。 |
sku0002 | user0003 | 版型挺好看的,穿起来挺合身的。这个价买到相当值! |
sku0003 | user0012 | 穿了没几天就坏了,客服还不理人,不会再买了! |
… | … | … |
如何让程序,输入一句评语后,能自动判别出是好评还是中评、差评呢?
无法通过sql编程实现,必须使用机器学习算法,比如朴素贝叶斯!
HanLP使用说明 -> 官网:http://www.hanlp.com/
<dependency>
<groupId>com.hankcs</groupId>
<artifactId>hanlp</artifactId>
<version>portable-1.7.4</version>
</dependency>
package cn.ianlou.bayes import java.util import com.hankcs.hanlp.HanLP import com.hankcs.hanlp.seg.common.Term import org.apache.log4j.{Level, Logger} import org.apache.spark.broadcast.Broadcast import org.apache.spark.ml.classification.{NaiveBayes, NaiveBayesModel} import org.apache.spark.ml.feature.{HashingTF, IDF, IDFModel} import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession} import scala.collection.mutable /** * @date: 2020/2/23 22:04 * @site: www.ianlou.cn * @author: lekko 六水 * @qq: 496208110 * @description: * */ object NLP_CommentClassify { def main(args: Array[String]): Unit = { Logger.getLogger("org.apache.spark").setLevel(Level.WARN) val spark: SparkSession = SparkSession .builder() .appName(this.getClass.getSimpleName) .master("local[*]") .getOrCreate() import spark.implicits._ // 加载禁用词、敏感词 收集到Driver端并广播出去 val stopWords: Dataset[String] = spark.read.textFile("userProfile/data/comment/stopwords") val sw: Set[String] = stopWords.collect().toSet val bc: Broadcast[Set[String]] = spark.sparkContext.broadcast(sw) // 1、加载样本数据,并将三种样本数据合一块 // 0.0->差评 1.0->中评 2.0->好评 val hp: Dataset[String] = spark.read.textFile("userProfile/data/comment/good") val zp: Dataset[String] = spark.read.textFile("userProfile/data/comment/general") val cp: Dataset[String] = spark.read.textFile("userProfile/data/comment/poor") val sample: Dataset[(Double, String)] = hp.map(e => (2.0, e)) union zp.map(e => (1.0, e)) union cp.map(e => (0.0, e)) // 2、对样本数据进行分词,并过滤 val sampleWcDF: DataFrame = sample.map(ds => { val label: Double = ds._1 val str: String = ds._2 //分词 java类型的List -> 需将java中的自动转换成scala的,方便调用 import scala.collection.JavaConversions._ val terms: util.List[Term] = HanLP.segment(str) val filtered: mutable.Buffer[Term] = terms.filter(term => { //term是个对象,包含了词、声调等 !term.nature.startsWith("p") && //过滤介词 !term.nature.startsWith("u") && //过滤助词 !term.nature.startsWith("y") && //过滤语气词 !term.nature.startsWith("w") //过滤标点 }) .filter(term => { !bc.value.contains(term.word) }) //将term数组转换成字符串Array val wcArr: Array[String] = filtered.map(term => term.word).toArray (label, wcArr) }).toDF("label", "wcArr") /** 切分后的部分词如下: * +-----+-----------------------------------------------------------------------------+ * |label|wcArr | * +-----+-----------------------------------------------------------------------------+ * |3.0 |[不错, 家居, 用品, 推荐, 这家, 微, 店] | * |3.0 |[物流, 也, 是, 很, 快, 价格, 实惠, 而且, 卖家, 服务, 态度, 很好] | * |3.0 |[不说, 话] | * |3.0 |[手机, 有, 一, 段, 时间, 总体, 感觉, 还, 可以, 就, 是, 指纹, 解锁, 感觉, 不, 太, 好] | * |3.0 |[穿, 身上, 挺, 舒适, 物流, 也, 快, 很, 满意] | * |3.0 |[你好, 好看, 穿, 上, 很, 有, 型, 非常, 满意] | * | .. | .......................... | * +-----+-----------------------------------------------------------------------------+ */ // 3、将词的数组转成 TF向量 val tfBean: HashingTF = new HashingTF() .setNumFeatures(1000000) //设定TF向量的长度 .setInputCol("wcArr") .setOutputCol("tf_vec") val tfDF: DataFrame = tfBean.transform(sampleWcDF) // 4、计算出IDF 并得到最终的TF-IDF val idfBean: IDF = new IDF() .setInputCol("tf_vec") .setOutputCol("tfidf_vec") val tf_model: IDFModel = idfBean.fit(tfDF) val tf_idf: DataFrame = tf_model.transform(tfDF) // 5、将特征向量数据,随机切分成了两部分:训练集、测试集 // 返回值自动匹配值对应位置上:0.9->train 0.1->test val Array(train, test) = tf_idf.randomSplit(Array(0.9, 0.1)) // 6、使用 Naive bayes算法训练模型 val bayes: NaiveBayes = new NaiveBayes() .setLabelCol("label") .setFeaturesCol("tfidf_vec") .setSmoothing(1.0) val bayes_model: NaiveBayesModel = bayes.fit(train) //将模型保存输出 // bayes_model.save("userProfile/data/comment/cmt_classfiy_model") // 7、加载模型,对测试数据进行预测 // val import_model: NaiveBayesModel = NaiveBayesModel.load("userProfile/data/comment/cmt_classfiy_model") val predict: DataFrame = bayes_model.transform(test) predict.drop("wcArr", "tf_vec", "tfidf_vec", "probability") .show(20, false) /** 预测数据如下: * +-----+-------------------------------------------------------------+----------+ * |label|rawPrediction |prediction| * +-----+-------------------------------------------------------------+----------+ * |2.0 |[-747.637914564428,-731.2594545039569,-670.607711511326] |2.0 | * |2.0 |[-272.37994460222103,-282.57209202292,-311.7303961540079] |0.0 | * |2.0 |[-5767.454153488989,-5535.821898484342,-5800.9303308607905] |1.0 | * |2.0 |[-2684.464068025356,-2708.624329110813,-2583.469344023015] |2.0 | * |2.0 |[-277.9643926287218,-268.26311309671814,-227.48341719673542] |2.0 | * |2.0 |[-284.10668520829034,-274.7353536977782,-231.0692541159483] |2.0 | * |2.0 |[-6215.519311631931,-6073.2560110828745,-6310.240365291973] |1.0 | * |2.0 |[-6931.386427418773,-6714.489964773973,-6948.950297510344] |1.0 | * |2.0 |[-4373.763017497593,-4413.6535762004105,-4422.23896893744] |0.0 | * |2.0 |[-372.6616634591026,-370.30136899146345,-342.6704036063473] |2.0 | * | ... | ........................ | ...... | * +-----+-------------------------------------------------------------+----------+ */ // 8、评估模型预测的准确率 val correctCnts: DataFrame = predict.selectExpr("count(1) as total", "count(if(label = prediction,1,null)) as correct") correctCnts.show(10, false) /** * +------+-------+ * |total |correct| * +------+-------+ * |245733|204328 | * +------+-------+ * * 模型预测的准确率为: * correct ÷ total ×100% = 83.15% */ // 9、 接下来就是读取待预测分类数据, // 然后分词、特征工程、调用算法、预测分类 // 逻辑和上面的差不多,不再手写。 spark.close() } }
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。