赞
踩
真实面试题:
val dataKv: RDD[String] = sc.parallelize(List( "hello world", "hello spark", "hello world", "hello hadoop", "hello world", "hello world" )) val words: RDD[String] = dataKv.flatMap(_.split(" ")) val kv: RDD[(String, Int)] = words.map((_, 1)) val res: RDD[(String, Int)] = kv.reduceByKey(_ + _) /** * 关注这一步 */ val res1: RDD[(String, Int)] = res.map(x => (x._1, x._2 * 10)) val result: RDD[(String, Iterable[Int])] = res1.groupByKey() result.foreach(println) while (true) { }
从上面截图可知,这里产生了两次shuffle,分别是因为使用了reduceByKey()和groupByKey()方法产生的
val dataKv: RDD[String] = sc.parallelize(List( "hello world", "hello spark", "hello world", "hello hadoop", "hello world", "hello world" )) val words: RDD[String] = dataKv.flatMap(_.split(" ")) val kv: RDD[(String, Int)] = words.map((_, 1)) val res: RDD[(String, Int)] = kv.reduceByKey(_ + _) /** * 关注这一步:上文使用的是 map(x => (x._1, x._2 * 10)) */ val res1: RDD[(String, Int)] = res.mapValues(x => x * 10) val result: RDD[(String, Iterable[Int])] = res1.groupByKey() result.foreach(println) while (true) { }
我们回到上文两个算子的官方解释寻找不同点:
下面我们用一张不是很正确的图来解释下,我们可以看到,map会将源RDD的分区关系进行打散,因为涉及到修改Key操作,且返回了新的RDD,而mapValue则不会影响到分区依赖关系,因为不涉及修改Key
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。