赞
踩
假定用户有某个周末网民网购停留时间的日志文本,基于某些业务要求,要求开发
Spark应用程序实现如下功能:
1、实时统计连续网购时间超过半个小时的女性网民信息。
2、周末两天的日志文件第一列为姓名,第二列为性别,第三列为本次停留时间,单
位为分钟,分隔符为“,”。
数据:
log1.txt:周六网民停留日志
LiuYang,female,20 YuanJing,male,10 GuoYijun,male,5 CaiXuyu,female,50 Liyuan,male,20 FangBo,female,50 LiuYang,female,20 YuanJing,male,10 GuoYijun,male,50 CaiXuyu,female,50 FangBo,female,60
log2.txt:周日网民停留日志
LiuYang,female,20 YuanJing,male,10 CaiXuyu,female,50 FangBo,female,50 GuoYijun,male,5 CaiXuyu,female,50 Liyuan,male,20 CaiXuyu,female,50 FangBo,female,50 LiuYang,female,20 YuanJing,male,10 FangBo,female,50 GuoYijun,male,50 CaiXuyu,female,50 FangBo,female,60
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.xu.sparktest1</groupId> <artifactId>sparktest1</artifactId> <version>1.0-SNAPSHOT</version> <properties> <spark.version>2.1.0</spark.version> <scala.version>2.11</scala.version> </properties> <dependencies> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_${scala.version}</artifactId> <version>${spark.version}</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming_${scala.version}</artifactId> <version>${spark.version}</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_${scala.version}</artifactId> <version>${spark.version}</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming-kafka-0-8_2.11</artifactId> <version>2.1.1</version> </dependency> </dependencies> <build> <sourceDirectory>src/main/scala</sourceDirectory> <testSourceDirectory>src/test/scala</testSourceDirectory> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <version>3.5.1</version> <configuration> <source>1.8</source> <target>1.8</target> </configuration> </plugin> <plugin> <groupId>net.alchim31.maven</groupId> <artifactId>scala-maven-plugin</artifactId> <version>3.2.0</version> <executions> <execution> <goals> <goal>compile</goal> <goal>testCompile</goal> </goals> <configuration> <args> <arg>-dependencyfile</arg> <arg>${project.build.directory}/.scala_dependencies</arg> </args> </configuration> </execution> </executions> </plugin> </plugins> </build> </project>
package com.xu import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream} import org.apache.spark.streaming.kafka.KafkaUtils import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.{HashPartitioner, SparkConf} object SparkStreamTest5 { def main(args: Array[String]): Unit = { val sparkConf=new SparkConf().setAppName("SparkHomeWork2").setMaster("local[2]") val ssc=new StreamingContext(sparkConf,Seconds(5)) ssc.sparkContext.setLogLevel("WARN") ssc.checkpoint(".") //创建连接kafka的参数 val brokeList = "node01:9092,node02:9092,node03:9092" val zk = "node01:2181/kafka" val sourceTopic = "sparkhomework-test4" val consumerGroup = "sparkhomework2" val topicMap = sourceTopic.split(",").map((_, 1.toInt)).toMap val lines = KafkaUtils.createStream(ssc, zk, consumerGroup, topicMap).map(_._2) val results = lines.flatMap(_.split(" ")).filter(_.contains("female")) val femaleData: DStream[(String, Int)] = results.map { line => val t = line.split(',') (t(0), t(2).toInt) }.reduceByKey(_ + _) //筛选出时间大于两个小时的女性网民信息,并输出 val date = femaleData.updateStateByKey(updateFunction, new HashPartitioner(ssc.sparkContext.defaultParallelism), true).filter(line => line._2 > 120) date.print() ssc.start() ssc.awaitTermination() } val updateFunction = (iter: Iterator[(String, Seq[Int], Option[Int])]) => { iter.flatMap { case (x, y, z) => Some(y.sum + z.getOrElse(0)).map(v => (x, v)) } } }
kafka集群启动,必须先启动zookeeper:
三台服务器启动zookeeper,三台机器都执行以下命令启动zookeeper:
cd /export/servers/zookeeper-3.4.5-cdh5.14.0
bin/zkServer.sh start
bin/zkServer.sh status
cd /export/servers/kafka_2.11-0.10.0.0
nohup bin/kafka-server-start.sh config/server.properties 2>&1 &
cd /export/servers/kafka_2.11-0.10.0.0
bin/kafka-server-stop.sh
将上述log进行压缩,并以空格进行进行分割,如下:
LiuYang,female,20 YuanJing,male,10 GuoYijun,male,5 CaiXuyu,female,50 Liyuan,male,20 FangBo,female,50 LiuYang,female,20 YuanJing,male,10 GuoYijun,male,50 CaiXuyu,female,50 FangBo,female,60
LiuYang,female,20 YuanJing,male,10 CaiXuyu,female,50 FangBo,female,50 GuoYijun,male,5 CaiXuyu,female,50 Liyuan,male,20 CaiXuyu,female,50 FangBo,female,50 LiuYang,female,20 YuanJing,male,10 FangBo,female,50 GuoYijun,male,50 CaiXuyu,female,50 FangBo,female,60
cd /export/servers/kafka_2.11-0.10.0.0
--使用命令创建 Topic
bin/kafka-topics.sh --create --topic sparkhomework-test4 --replication-factor 1 --partitions 3 --zookeeper node01:2181/kafka
--开启 Producer
bin/kafka-console-producer.sh --broker-list node01:9092,node02:9092,node03:9092 --topic sparkhomework-test4
注意:我这里kafka的路径为: node01:2181/kafka
,这个是在config/server.properties中配置的,
启动scala的main方法,然后在控制台依次输入 log1,log2:
结果:
如果在LInux控制台输入log时,程序打印,并没有出现红色部分的内容,表示topic配置问题,并没有正确连接!!!
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。