当前位置:   article > 正文

Spark算子:RDD基本转换操作–coalesce、repartition_spark coalesce函数用法

spark coalesce函数用法

1. coalesce

def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null): RDD[T]

该函数用于将RDD进行重分区,使用HashPartitioner。第一个参数为重分区的数目,第二个为是否进行shuffle,默认为false.

代码测试如下:

  1. scala> var data = sc.textFile("example.txt")
  2. data: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[53] at textFile at :21
  3. scala> data.collect
  4. res1: Array[String] = Array(hello world, hello spark, hello hive, hi spark)
  5. scala> data.partitions.size
  6. res2: Int = 2 //RDD data默认有两个分区
  7. scala> var rdd1 = data.coalesce(1)
  8. rdd1: org.apache.spark.rdd.RDD[String] = CoalescedRDD[2] at coalesce at :23
  9. scala> rdd1.partitions.size
  10. res3: Int = 1 //rdd1的分区数为1
  11. scala> var rdd1 = data.coalesce(4)
  12. rdd1: org.apache.spark.rdd.RDD[String] = CoalescedRDD[3] at coalesce at :23
  13. scala> rdd1.partitions.size
  14. res4: Int = 2 //如果重分区的数目大于原来的分区数,那么必须指定shuffle参数为true,否则,分区数不变
  15. scala> var rdd1 = data.coalesce(4,true)
  16. rdd1: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at coalesce at :23
  17. scala> rdd1.partitions.size
  18. res5: Int = 4

2. repartition

def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T]

该函数其实就是coalesce函数第二个参数为true的实现

代码测试如下:

  1. scala> var rdd2 = data.repartition(1)
  2. rdd2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[11] at repartition at :23
  3. scala> rdd2.partitions.size
  4. res6: Int = 1
  5. scala> var rdd2 = data.repartition(4)
  6. rdd2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[15] at repartition at :23
  7. scala> rdd2.partitions.size
  8. res7: Int = 4

 

声明:本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:【wpsshop博客】
推荐阅读
相关标签
  

闽ICP备14008679号