当前位置:   article > 正文

Spark_Spark JOIN的种类 以及选择依据_spark 限制join 类型

spark 限制join 类型

参考文章 :

 

1.Spark join种类(>3种)及join选择依据

https://blog.csdn.net/rlnLo2pNEfx9c/article/details/106066081

 

 

Spark 内部JOIN 大致分为以下3种实现方式 :

1.BroadCastHashJoin

2.ShuffledHashJoin

3.SortMergeJoin

 

 

1.BroadCastHashJoin

     翻过源码之后你就会发现,Spark 1.6之前实现BroadCastHashJoin就是利用的Java的HashMap来实现的。大家感兴趣可以去Spark 1.6的源码里搜索BroadCastHashJoin,HashedRelation,探查一下源码。

    具体实现就是driver端根据表的统计信息,当发现一张小表达到广播条件的时候,就会将小表collect到driver端,然后构建一个HashedRelation,然后广播。

    其实,就跟我们在使用Spark Streaming的时候广播hashmap一样。

    重点强调里面 最大行数限制最大bytes限制 并不是我们设置的自动广播参数限制,而是内部存储结构的限制。

 

 

2.ShuffledHashJoin

    BroadCastHashJoin适合的是大表和小表的join策略,将整个小表广播。很多时候,参与join的表本身都不适合广播,也不适合放入内存,但是按照一定分区拆开后就可以放入内存构建为HashRelation。这个就是分治思想了,将两张表按照相同的hash分区器及分区数进行,对join条件进行分区,那么需要join的key就会落入相同的分区里,然后就可以利用本地join的策略来进行join了
 

也即是ShuffledHashJoin有两个重要步骤:

  1. join的两张表有一张是相对小表,经过拆分后可以实现本地join。
  2. 相同的分区器及分区数,按照joinkey进行分区,这样约束后joinkey范围就限制在相同的分区中,不依赖其他分区完成join。
  3. 对小表分区构建一个HashRelation。然后就可以完成本地hashedjoin了,参考ShuffleHashJoinExec代码。

这个如下图:

 

 

3.SortMergeJoin

   上面两张情况都是小表本身适合放入内存或者中表经过分区治理后适合放入内存,来完成本地化hashedjoin,小表数据放在内存中,很奢侈的,所以经常会遇到join,就oom。小表,中表都是依据内存说的,你内存无限,那是最好。

   那么,大表和大表join怎么办?这时候就可以利用SortMergeJoin来完成。

SortMergeJoin基本过程如下:

  1. 首先采取相同的分区器及分区数对两张表进行重分区操作,保证两张表相同的key落到相同的分区。
  2. 对于单个分区节点两个表的数据,分别进行按照key排序。
  3. 对排好序的两张分区表数据执行join操作。join操作很简单,分别遍历两个有序序列,碰到相同join key就merge输出,否则取更小一边。


 

 

4.Spark 中 JOIN 策略的选择 

 

1) Spark 3.1 +,  基于Hint 

 

假如用户使用Spark SQL的适合用了hints,那Spark会先采用Hints提示的join方式。

 

BroadcastHashJoin

hints写法如下:

-- 支持 BROADCAST, BROADCASTJOIN and MAPJOIN 来表达 broadcast hint

SELECT /*+ BROADCAST(r) */  *  FROM records r JOIN src s ON r.key = s.key

 

ShuffledHashJoin

hints的sql写法如下:

-- 支持 SHUFFLE_MERGE, MERGE and MERGEJOIN 来表达 SortMergeJoin hint

SELECT /*+ MERGEJOIN(r) */  * FROM records r JOIN src s ON r.key = s.key

 

SortMergeJoin

hints的SQL写法如下:

-- 支持 SHUFFLE_MERGE, MERGE and MERGEJOIN 来表达 SortMergeJoin hint

SELECT /*+ MERGEJOIN(r) */  * FROM records r JOIN src s ON r.key = s.key

 

 

2) 未使用Hint  

 

默认判断规则如下

Step1

1.先判断,假设join的表统计信息现实,一张表大小大于0,且小于等于用户配置的自动广播阈值则,采用广播。

plan.stats.sizeInBytes >= 0 && plan.stats.sizeInBytes <= conf.autoBroadcastJoinThreshold参数:spark.sql.autoBroadcastJoinThreshold

假设两张表都满足广播需求,选最小的。 

 

Step2

2.不满足广播就判断是否满足ShuffledHashJoin,首先下面参数要设置为false,默认为true。

spark.sql.join.preferSortMergeJoin=true

 

还有两个条件,根据统计信息,表的bytes是广播的阈值*总并行度: 

i

plan.stats.sizeInBytes < conf.autoBroadcastJoinThreshold * conf.numShufflePartitions

ii

并且该表bytes乘以3  要小于等于另一张表的bytes:

a.stats.sizeInBytes * 3 <= b.stats.sizeInBytes

 

那么这张表就适合分治之后,作为每个分区构建本地hashtable的表。

 

Step3

3.不满足广播,也不满足ShuffledHashJoin,就判断是否满足SortMergeJoin。条件很简单,那就是key要支持可排序。

  1. def createSortMergeJoin() = {
  2. if (RowOrdering.isOrderable(leftKeys)) {
  3. Some(Seq(
  4. joins.SortMergeJoinExec(
  5. leftKeys
  6. , rightKeys
  7. , joinType
  8. , condition
  9. , planLater(left)
  10. , planLater(right))))
  11. } else {
  12. None
  13. }
  14. }

这段代码是在SparkStrageties类,JoinSelection单例类内部。

  1. createBroadcastHashJoin(hintToBroadcastLeft(hint), hintToBroadcastRight(hint))
  2. .orElse {
  3. if (hintToSortMergeJoin(hint)) createSortMergeJoin()
  4. else None
  5. }
  6. .orElse(createShuffleHashJoin(hintToShuffleHashLeft(hint), hintToShuffleHashRight(hint))) .orElse {
  7. if (hintToShuffleReplicateNL(hint)) createCartesianProduct()
  8. else None
  9. }
  10. .getOrElse(createJoinWithoutHint())

 

5.Spark 中 JOIN 策略对于等值和非等值连接的支持

   当然,这三种join都是等值join,之前的版本Spark仅仅支持等值join但是不支持非等值join,常见的业务开发中确实存在非等值join的情况,spark目前支持非等值join的实现有以下两种,由于实现问题,确实很容易oom。

 

Broadcast nested loop joinShuffle-and-replicate nested loop join。

 

6.测试代码  基于Spark 2.2.0

 

  我们写了一段代码用来测试如何进行策略的选择

 

  1. package com.spark.test.offline.spark_sql
  2. import org.apache.spark.SparkConf
  3. import org.apache.spark.sql.SparkSession
  4. import scala.collection.mutable.ArrayBuffer
  5. import scala.util.Random
  6. /**
  7. * Created by szh on 2020/6/7.
  8. */
  9. object SparkSQLStrategy {
  10. def main(args: Array[String]): Unit = {
  11. val sparkConf = new SparkConf
  12. sparkConf
  13. .setAppName("Union data test")
  14. .setMaster("local[1]")
  15. .set("spark.sql.autoBroadcastJoinThreshold", "1048576")
  16. .set("spark.sql.shuffle.partitions", "10")
  17. .set("spark.sql.join.preferSortMergeJoin", "false")
  18. val spark = SparkSession.builder()
  19. .config(sparkConf)
  20. .getOrCreate()
  21. val sparkContext = spark.sparkContext
  22. sparkContext.setLogLevel("WARN")
  23. val arrayA = Array(
  24. (1, "mm")
  25. , (2, "cs")
  26. , (3, "cc")
  27. , (4, "px")
  28. , (5, "kk")
  29. )
  30. val rddA = sparkContext
  31. .parallelize(arrayA)
  32. val rddADF = spark.createDataFrame(rddA).toDF("uid", "name")
  33. rddADF.createOrReplaceTempView("userA")
  34. spark.sql("CACHE TABLE userA")
  35. //--------------------------
  36. //--------------------------
  37. val arrayB = new ArrayBuffer[(Int, String)]()
  38. val nameArr = Array[String]("sun", "zhen", "hua", "kk", "cc")
  39. //1000000
  40. for (i <- 1 to 1000000) {
  41. val id = i
  42. val name = nameArr(Random.nextInt(5))
  43. arrayB.+=((id, name))
  44. }
  45. val rddB = sparkContext.parallelize(arrayB)
  46. val rddBDF = spark.createDataFrame(rddB).toDF("uid", "name")
  47. rddBDF.createOrReplaceTempView("userB")
  48. val arrListA = new ArrayBuffer[(Int, Int)]
  49. for (i <- 1 to 40) {
  50. val id = i
  51. val salary = Random.nextInt(100)
  52. arrListA.+=((id, salary))
  53. }
  54. spark
  55. .createDataFrame(arrListA).toDF("uid", "salary")
  56. .createOrReplaceTempView("listA")
  57. val arrList = new ArrayBuffer[(Int, Int)]
  58. for (i <- 1 to 4000000) {
  59. val id = i
  60. val salary = Random.nextInt(100)
  61. arrList.+=((id, salary))
  62. }
  63. spark
  64. .createDataFrame(arrList).toDF("uid", "salary")
  65. .createOrReplaceTempView("listB")
  66. val resultBigDF = spark
  67. .sql("SELECT userB.uid, name, salary FROM userB LEFT JOIN listA ON userB.uid = listA.uid")
  68. resultBigDF.show()
  69. resultBigDF.explain(true)
  70. val resultSmallDF = spark
  71. .sql("SELECT userA.uid, name, salary FROM userA LEFT JOIN listA ON userA.uid = listA.uid")
  72. resultSmallDF.show()
  73. resultSmallDF.explain(true)
  74. val resultBigDF2 = spark
  75. .sql("SELECT userB.uid, name, salary FROM userB LEFT JOIN listb ON userB.uid = listB.uid")
  76. resultBigDF2.show()
  77. resultBigDF2.explain(true)
  78. Thread
  79. .sleep(60 * 10 * 1000)
  80. sparkContext.stop()
  81. }
  82. }

作业JOB划分

 

输出 

  1. +---+----+------+
  2. |uid|name|salary|
  3. +---+----+------+
  4. | 1| sun| 62|
  5. | 2| kk| 76|
  6. | 3| sun| 64|
  7. | 4| kk| 33|
  8. | 5|zhen| 20|
  9. | 6| hua| 17|
  10. | 7| kk| 4|
  11. | 8| cc| 62|
  12. | 9| sun| 97|
  13. | 10| sun| 87|
  14. | 11| hua| 71|
  15. | 12| kk| 42|
  16. | 13| hua| 76|
  17. | 14| sun| 93|
  18. | 15|zhen| 7|
  19. | 16| kk| 59|
  20. | 17| hua| 98|
  21. | 18| sun| 88|
  22. | 19| cc| 49|
  23. | 20| cc| 62|
  24. +---+----+------+
  25. only showing top 20 rows
  26. == Parsed Logical Plan ==
  27. 'Project ['userB.uid, 'name, 'salary]
  28. +- 'Join LeftOuter, ('userB.uid = 'listA.uid)
  29. :- 'UnresolvedRelation `userB`
  30. +- 'UnresolvedRelation `listA`
  31. == Analyzed Logical Plan ==
  32. uid: int, name: string, salary: int
  33. Project [uid#58, name#59, salary#70]
  34. +- Join LeftOuter, (uid#58 = uid#69)
  35. :- SubqueryAlias userb
  36. : +- Project [_1#53 AS uid#58, _2#54 AS name#59]
  37. : +- SerializeFromObject [assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1 AS _1#53, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, true) AS _2#54]
  38. : +- ExternalRDD [obj#52]
  39. +- SubqueryAlias lista
  40. +- Project [_1#64 AS uid#69, _2#65 AS salary#70]
  41. +- LocalRelation [_1#64, _2#65]
  42. == Optimized Logical Plan ==
  43. Project [uid#58, name#59, salary#70]
  44. +- Join LeftOuter, (uid#58 = uid#69)
  45. :- Project [_1#53 AS uid#58, _2#54 AS name#59]
  46. : +- SerializeFromObject [assertnotnull(input[0, scala.Tuple2, true])._1 AS _1#53, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._2, true) AS _2#54]
  47. : +- ExternalRDD [obj#52]
  48. +- LocalRelation [uid#69, salary#70]
  49. == Physical Plan ==
  50. *Project [uid#58, name#59, salary#70]
  51. +- *BroadcastHashJoin [uid#58], [uid#69], LeftOuter, BuildRight
  52. :- *Project [_1#53 AS uid#58, _2#54 AS name#59]
  53. : +- *SerializeFromObject [assertnotnull(input[0, scala.Tuple2, true])._1 AS _1#53, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._2, true) AS _2#54]
  54. : +- Scan ExternalRDDScan[obj#52]
  55. +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
  56. +- LocalTableScan [uid#69, salary#70]
  57. +---+----+------+
  58. |uid|name|salary|
  59. +---+----+------+
  60. | 1| mm| 62|
  61. | 2| cs| 76|
  62. | 3| cc| 64|
  63. | 4| px| 33|
  64. | 5| kk| 20|
  65. +---+----+------+
  66. == Parsed Logical Plan ==
  67. 'Project ['userA.uid, 'name, 'salary]
  68. +- 'Join LeftOuter, ('userA.uid = 'listA.uid)
  69. :- 'UnresolvedRelation `userA`
  70. +- 'UnresolvedRelation `listA`
  71. == Analyzed Logical Plan ==
  72. uid: int, name: string, salary: int
  73. Project [uid#8, name#9, salary#70]
  74. +- Join LeftOuter, (uid#8 = uid#69)
  75. :- SubqueryAlias usera
  76. : +- Project [_1#3 AS uid#8, _2#4 AS name#9]
  77. : +- SerializeFromObject [assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1 AS _1#3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, true) AS _2#4]
  78. : +- ExternalRDD [obj#2]
  79. +- SubqueryAlias lista
  80. +- Project [_1#64 AS uid#69, _2#65 AS salary#70]
  81. +- LocalRelation [_1#64, _2#65]
  82. == Optimized Logical Plan ==
  83. Project [uid#8, name#9, salary#70]
  84. +- Join LeftOuter, (uid#8 = uid#69)
  85. :- InMemoryRelation [uid#8, name#9], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), `userA`
  86. : +- *Project [_1#3 AS uid#8, _2#4 AS name#9]
  87. : +- *SerializeFromObject [assertnotnull(input[0, scala.Tuple2, true])._1 AS _1#3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._2, true) AS _2#4]
  88. : +- Scan ExternalRDDScan[obj#2]
  89. +- LocalRelation [uid#69, salary#70]
  90. == Physical Plan ==
  91. *Project [uid#8, name#9, salary#70]
  92. +- *BroadcastHashJoin [uid#8], [uid#69], LeftOuter, BuildRight
  93. :- InMemoryTableScan [uid#8, name#9]
  94. : +- InMemoryRelation [uid#8, name#9], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), `userA`
  95. : +- *Project [_1#3 AS uid#8, _2#4 AS name#9]
  96. : +- *SerializeFromObject [assertnotnull(input[0, scala.Tuple2, true])._1 AS _1#3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._2, true) AS _2#4]
  97. : +- Scan ExternalRDDScan[obj#2]
  98. +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
  99. +- LocalTableScan [uid#69, salary#70]
  100. 20/06/08 00:50:40 WARN TaskSetManager: Stage 4 contains a task of very large size (160161 KB). The maximum recommended task size is 100 KB.
  101. 20/06/08 00:50:43 WARN TaskSetManager: Stage 5 contains a task of very large size (20512 KB). The maximum recommended task size is 100 KB.
  102. +---+----+------+
  103. |uid|name|salary|
  104. +---+----+------+
  105. | 22|zhen| 40|
  106. | 32|zhen| 81|
  107. | 60| cc| 73|
  108. | 90| cc| 12|
  109. | 92|zhen| 90|
  110. | 95| cc| 95|
  111. |108| cc| 49|
  112. |123| hua| 44|
  113. |128| sun| 50|
  114. |144|zhen| 63|
  115. |148| cc| 2|
  116. |153| cc| 64|
  117. |155|zhen| 88|
  118. |167| cc| 94|
  119. |168| sun| 18|
  120. |205| kk| 6|
  121. |209| hua| 78|
  122. |229| cc| 22|
  123. |247| sun| 53|
  124. |288| cc| 94|
  125. +---+----+------+
  126. only showing top 20 rows
  127. == Parsed Logical Plan ==
  128. 'Project ['userB.uid, 'name, 'salary]
  129. +- 'Join LeftOuter, ('userB.uid = 'listB.uid)
  130. :- 'UnresolvedRelation `userB`
  131. +- 'UnresolvedRelation `listb`
  132. == Analyzed Logical Plan ==
  133. uid: int, name: string, salary: int
  134. Project [uid#58, name#59, salary#81]
  135. +- Join LeftOuter, (uid#58 = uid#80)
  136. :- SubqueryAlias userb
  137. : +- Project [_1#53 AS uid#58, _2#54 AS name#59]
  138. : +- SerializeFromObject [assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1 AS _1#53, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, true) AS _2#54]
  139. : +- ExternalRDD [obj#52]
  140. +- SubqueryAlias listb
  141. +- Project [_1#75 AS uid#80, _2#76 AS salary#81]
  142. +- LocalRelation [_1#75, _2#76]
  143. == Optimized Logical Plan ==
  144. Project [uid#58, name#59, salary#81]
  145. +- Join LeftOuter, (uid#58 = uid#80)
  146. :- Project [_1#53 AS uid#58, _2#54 AS name#59]
  147. : +- SerializeFromObject [assertnotnull(input[0, scala.Tuple2, true])._1 AS _1#53, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._2, true) AS _2#54]
  148. : +- ExternalRDD [obj#52]
  149. +- LocalRelation [uid#80, salary#81]
  150. == Physical Plan ==
  151. *Project [uid#58, name#59, salary#81]
  152. +- SortMergeJoin [uid#58], [uid#80], LeftOuter
  153. :- *Sort [uid#58 ASC NULLS FIRST], false, 0
  154. : +- Exchange hashpartitioning(uid#58, 10)
  155. : +- *Project [_1#53 AS uid#58, _2#54 AS name#59]
  156. : +- *SerializeFromObject [assertnotnull(input[0, scala.Tuple2, true])._1 AS _1#53, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._2, true) AS _2#54]
  157. : +- Scan ExternalRDDScan[obj#52]
  158. +- *Sort [uid#80 ASC NULLS FIRST], false, 0
  159. +- Exchange hashpartitioning(uid#80, 10)
  160. +- LocalTableScan [uid#80, salary#81]

 

 

分SQL分析

其中 userA 是小表 ,userB是大表 , listA是小表,listB是大表

 

阶段一

  1. val resultBigDF = spark
  2. .sql("SELECT userB.uid, name, salary FROM userB LEFT JOIN listA ON userB.uid = listA.uid")
  3. resultBigDF.show()
  4. resultBigDF.explain(true)

 

可以看到userB LEFT JOIN listA 是使用的Broadcast SHUFFLE JOIN  

== Parsed Logical Plan ==
'Project ['userB.uid, 'name, 'salary]
+- 'Join LeftOuter, ('userB.uid = 'listA.uid)
   :- 'UnresolvedRelation `userB`
   +- 'UnresolvedRelation `listA`

== Analyzed Logical Plan ==
uid: int, name: string, salary: int
Project [uid#58, name#59, salary#70]
+- Join LeftOuter, (uid#58 = uid#69)
   :- SubqueryAlias userb
   :  +- Project [_1#53 AS uid#58, _2#54 AS name#59]
   :     +- SerializeFromObject [assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1 AS _1#53, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, true) AS _2#54]
   :        +- ExternalRDD [obj#52]
   +- SubqueryAlias lista
      +- Project [_1#64 AS uid#69, _2#65 AS salary#70]
         +- LocalRelation [_1#64, _2#65]

== Optimized Logical Plan ==
Project [uid#58, name#59, salary#70]
+- Join LeftOuter, (uid#58 = uid#69)
   :- Project [_1#53 AS uid#58, _2#54 AS name#59]
   :  +- SerializeFromObject [assertnotnull(input[0, scala.Tuple2, true])._1 AS _1#53, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._2, true) AS _2#54]
   :     +- ExternalRDD [obj#52]
   +- LocalRelation [uid#69, salary#70]

== Physical Plan ==
*Project [uid#58, name#59, salary#70]
+- *BroadcastHashJoin [uid#58], [uid#69], LeftOuter, BuildRight
   :- *Project [_1#53 AS uid#58, _2#54 AS name#59]
   :  +- *SerializeFromObject [assertnotnull(input[0, scala.Tuple2, true])._1 AS _1#53, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._2, true) AS _2#54]
   :     +- Scan ExternalRDDScan[obj#52]
   +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
      +- LocalTableScan [uid#69, salary#70]

 

 

 

阶段二

  1. val resultSmallDF = spark
  2. .sql("SELECT userA.uid, name, salary FROM userA LEFT JOIN listA ON userA.uid = listA.uid")
  3. resultSmallDF.show()
  4. resultSmallDF.explain(true)

可以看到userA LEFT JOIN listA 使用的是Broadcast Hash JOIN

== Parsed Logical Plan ==
'Project ['userA.uid, 'name, 'salary]
+- 'Join LeftOuter, ('userA.uid = 'listA.uid)
   :- 'UnresolvedRelation `userA`
   +- 'UnresolvedRelation `listA`

== Analyzed Logical Plan ==
uid: int, name: string, salary: int
Project [uid#8, name#9, salary#70]
+- Join LeftOuter, (uid#8 = uid#69)
   :- SubqueryAlias usera
   :  +- Project [_1#3 AS uid#8, _2#4 AS name#9]
   :     +- SerializeFromObject [assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1 AS _1#3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, true) AS _2#4]
   :        +- ExternalRDD [obj#2]
   +- SubqueryAlias lista
      +- Project [_1#64 AS uid#69, _2#65 AS salary#70]
         +- LocalRelation [_1#64, _2#65]

== Optimized Logical Plan ==
Project [uid#8, name#9, salary#70]
+- Join LeftOuter, (uid#8 = uid#69)
   :- InMemoryRelation [uid#8, name#9], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), `userA`
   :     +- *Project [_1#3 AS uid#8, _2#4 AS name#9]
   :        +- *SerializeFromObject [assertnotnull(input[0, scala.Tuple2, true])._1 AS _1#3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._2, true) AS _2#4]
   :           +- Scan ExternalRDDScan[obj#2]
   +- LocalRelation [uid#69, salary#70]

== Physical Plan ==
*Project [uid#8, name#9, salary#70]
+- *BroadcastHashJoin [uid#8], [uid#69], LeftOuter, BuildRight
   :- InMemoryTableScan [uid#8, name#9]
   :     +- InMemoryRelation [uid#8, name#9], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), `userA`
   :           +- *Project [_1#3 AS uid#8, _2#4 AS name#9]
   :              +- *SerializeFromObject [assertnotnull(input[0, scala.Tuple2, true])._1 AS _1#3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._2, true) AS _2#4]
   :                 +- Scan ExternalRDDScan[obj#2]
   +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
      +- LocalTableScan [uid#69, salary#70]

 

 

阶段三

  1. val resultBigDF2 = spark
  2. .sql("SELECT userB.uid, name, salary FROM userB LEFT JOIN listb ON userB.uid = listB.uid")
  3. resultBigDF2.show()
  4. resultBigDF2.explain(true)

userB LEFT JOIN listB ,大表之间关联使用的是  SortMergeJoin

== Parsed Logical Plan ==
'Project ['userB.uid, 'name, 'salary]
+- 'Join LeftOuter, ('userB.uid = 'listB.uid)
   :- 'UnresolvedRelation `userB`
   +- 'UnresolvedRelation `listb`

== Analyzed Logical Plan ==
uid: int, name: string, salary: int
Project [uid#58, name#59, salary#81]
+- Join LeftOuter, (uid#58 = uid#80)
   :- SubqueryAlias userb
   :  +- Project [_1#53 AS uid#58, _2#54 AS name#59]
   :     +- SerializeFromObject [assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1 AS _1#53, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, true) AS _2#54]
   :        +- ExternalRDD [obj#52]
   +- SubqueryAlias listb
      +- Project [_1#75 AS uid#80, _2#76 AS salary#81]
         +- LocalRelation [_1#75, _2#76]

== Optimized Logical Plan ==
Project [uid#58, name#59, salary#81]
+- Join LeftOuter, (uid#58 = uid#80)
   :- Project [_1#53 AS uid#58, _2#54 AS name#59]
   :  +- SerializeFromObject [assertnotnull(input[0, scala.Tuple2, true])._1 AS _1#53, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._2, true) AS _2#54]
   :     +- ExternalRDD [obj#52]
   +- LocalRelation [uid#80, salary#81]

== Physical Plan ==
*Project [uid#58, name#59, salary#81]
+- SortMergeJoin [uid#58], [uid#80], LeftOuter
   :- *Sort [uid#58 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(uid#58, 10)
   :     +- *Project [_1#53 AS uid#58, _2#54 AS name#59]
   :        +- *SerializeFromObject [assertnotnull(input[0, scala.Tuple2, true])._1 AS _1#53, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._2, true) AS _2#54]
   :           +- Scan ExternalRDDScan[obj#52]
   +- *Sort [uid#80 ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(uid#80, 10)
         +- LocalTableScan [uid#80, salary#81]

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/小舞很执着/article/detail/829974
推荐阅读
相关标签
  

闽ICP备14008679号