赞
踩
rdd = sc.parallelize([('c01','张三'),('c02','李四'),('c02','王五'),('c01','赵六'),('c01','田七'),('c02','周八'),('c03','李九')]) 需求: 根据班级分组统计 rdd.groupByKey().collect() 结果: [ ('c01', <pyspark.resultiterable.ResultIterable object at 0x7f09aced8b80>), ('c02', <pyspark.resultiterable.ResultIterable object at 0x7f09ace7f4f0>), ('c03', <pyspark.resultiterable.ResultIterable object at 0x7f09ace7f580>) ] rdd.groupByKey().mapValues(list).collect() 结果: [ ('c01', ['张三', '赵六', '田七']), ('c02', ['李四', '王五', '周八']), ('c03', ['李九']) ] 统计每组内有多少个数据呢? rdd.groupByKey().mapValues(list).map(lambda kv: (kv[0],len(kv[1]))).collect() 结果: [('c01', 3), ('c02', 3), ('c03', 1)]
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。