赞
踩
函数定义
def groupByKey(): RDD[(K, Iterable[V])]
def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])]
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])]
groupByKey会将RDD[key,value] 按照相同的key进行分组,形成RDD[key,Iterable[value]]的形式,有点类似于sql中的groupby,例如类似于mysql中的group_concat
groupByKey不能传算法,相比于reduceByKey而言,groupByKey更耗性能
案例:对学生的成绩进行分组
object GroupByKeyScala { def main(args: Array[String]): Unit = { val conf = new SparkConf().setMaster("local[*]").setAppName("groupByKey") val sc = new SparkContext(conf) val scoreDetails = sc.parallelize(List(("zhangsan",97),("zhangsan",87),("xiaoming",75),("lisi",95),("lisi",88))) val groupByKeyRDD = scoreDetails.groupByKey() //按名字分组(name,(score1,score2)) groupByKeyRDD.collect.foreach(println) //输出形式(name,score) groupByKeyRDD.collect.foreach(x => { val name = x._1 val scores = x._2 scores.foreach(score => { println(name,score)}) }) } }
public class GroupByKeyJava {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("groupByKey");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<Tuple2<String,Float>> scoreDetails = sc.parallelize(Arrays.asList(
new Tuple2<>
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。