赞
踩
上节我们完成了如下的内容:
针对不同的Spark、Kafka版本,集成处理数据的方式有两种:
对应的版本:
版本的发展:
基于 Receiver 的方式使用 Kafka 旧版本消费者高阶 API 实现。
对于所有的 Receiver,通过 Kafka 接收的数据被存储于 Spark 的 Executors 上,底层是写入 BlockManager中,默认200ms生成一个block(spark.streaming.blockInterval)
然后由SparkStreaming提交的Job构建BlockRDD,最终以SparkCore任务的形式运行。
对应Receiver方式,有以下几点需要注意:
Direct Approach 是 Spark Streaming 不使用 Receiver 集成 Kafka 的方式,在企业生产环境中使用较多,相较于 Receiver,有以下特点:
Spark Streaming 与 Kafka 0.10整合,和 0.8版本的Direct方式很像,Kafka的分区和Spark的RDD分区是一一对应的,可以获取 Offsets 和 元数据,API使用起来没有显著的区别。
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.12</artifactId>
<version>${spark.version}</version>
</dependency>
不要手动添加 org.apache.kafka相关的依赖,如 kafka-clients,spark-streaming-kafka-0-10已经包含相关的依赖了,不同的版本会有不同程度的不兼容。
使用 kafka010接口从Kafka中获取数据:
package icu.wzk
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerConfig, ProducerRecord}
import org.codehaus.jackson.map.ser.std.StringSerializer
import java.util.Properties
object KafkaProducerTest {
def main(args: Array[String]): Unit = {
// 定义 Kafka 参数
val brokers = "h121.wzk.icu:9092"
val topic = "topic_test"
val prop = new Properties()
prop.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)
prop.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, classOf[StringSerializer])
prop.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, classOf[StringSerializer])
// KafkaProducer
val producer = new KafkaProducer[String, String](prop)
for (i <- 1 to 1000) {
val msg = new ProducerRecord[String, String](topic, i.toString, i.toString)
// 发送消息
producer.send(msg)
println(s"i = $i")
Thread.sleep(100)
}
producer.close()
}
}
i = 493
i = 494
i = 495
i = 496
i = 497
i = 498
i = 499
i = 500
i = 501
i = 502
i = 503
i = 504
运行过程截图为:
我们在服务器上查看当前Kafka中的队列信息:
kafka-topics.sh --list --zookeeper h121.wzk.icu:2181
可以看到队列已经加入了,spark_streaming_test01:
package icu.wzk
import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord}
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object KafkaDStream1 {
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.ERROR)
val conf = new SparkConf()
.setAppName("KafkaDStream1")
.setMaster("local[*]")
val ssc = new StreamingContext(conf, Seconds(2))
val kafkaParams: Map[String, Object] = getKafkaConsumerParameters("wzkicu")
val topics: Array[String] = Array("spark_streaming_test01")
// 从 Kafka 中获取数据
val dstream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils
.createDirectStream(
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topics, kafkaParams))
// dstream输出
dstream.foreachRDD {
(rdd, time) => if (!rdd.isEmpty()) {
println(s"========== rdd.count = ${rdd.count()}, time = $time ============")
}
}
ssc.start()
ssc.awaitTermination()
}
private def getKafkaConsumerParameters(groupId: String): Map[String, Object] = {
Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "h121.wzk.icu:9092",
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.GROUP_ID_CONFIG -> groupId,
ConsumerConfig.AUTO_OFFSET_RESET_CONFIG -> "earliest",
ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG -> (false: java.lang.Boolean)
)
}
}
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/Users/wuzikang/.m2/repository/org/apache/spark/spark-unsafe_2.12/2.4.5/spark-unsafe_2.12-2.4.5.jar) to method java.nio.Bits.unaligned()
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
========== rdd.count = 1000, time = 1721721502000 ms ============
运行截图如下:
继续启动 KafkaProducer 的程序,让数据不断地写入
我们会看到控制台输出内容如下:
========== rdd.count = 1000, time = 1721721502000 ms ============
========== rdd.count = 9, time = 1721721710000 ms ============
========== rdd.count = 19, time = 1721721712000 ms ============
========== rdd.count = 19, time = 1721721714000 ms ============
========== rdd.count = 19, time = 1721721716000 ms ============
========== rdd.count = 20, time = 1721721718000 ms ============
========== rdd.count = 19, time = 1721721720000 ms ============
========== rdd.count = 19, time = 1721721722000 ms ============
========== rdd.count = 19, time = 1721721724000 ms ============
运行结果如下图所示:
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。