赞
踩
RDDs基本操作:
val lines=sc.parallelize(Array(“hello”,”spark”,”hello”,”hello”,”!”)
lines.foreach(println)
val lines2=lines.map(word=>(word,1))
val lines3=lines.filter(word=>word.contains(“hello”)’
val inputs=sc.textFile(“../../testfile/helloSpark”)
val lines4=inputs.flatMap(line=>line.split(’ “))
val rdd1=sc.parallelize(Array(“coffee”,”coffee”,”panda”,”monkey”,”tea”)
val rdd2=sc.parallelize(Array(“coffee”,”monkey”,”kitty”)
val rdd_distinct=rdd1.distinct()
val rdd_union=rdd1.union(rdd2)
val rdd_inter=rdd1.intersection(rdd2)
val rdd_sub=rdd1.subtract(rdd2)
val rdd=sc.parallelize(Array(1,2,3,3))
rdd.collect()
rdd.reduce((x,y)=>x+y)
rdd.take(3)
rdd.top(2)
val rdd=sc.textFile(“../../testfile/helloSpark”)
val rdd2=rdd.map(line=>(line.split(” “)(0),line))
val rdd3=sc.parallelize(Array((1,2),(3,4),(3,6)))
val rdd4=rdd3.reduceByKey((x,y)=>x+y)
rdd4.foreach(println)
结果:(1,2)
(3,10)
val rdd5=rdd3.groupByKey()
val rdd6=rdd3.keys
val rdd7=rdd3.values
val rdd8=rdd3.sortByKey()
rdd8.foreach(println)
结果:按key值排序
(1,2)
(3,4)
(3,6)
val scores=sc.parallelize(Array((“jake”,80.0),(“jake”,90),(“jake”,85),(“mike”,85.0),(“mike”,92.0),(“mike”,90)))
val score2=scores.combineByKey(score=>(1,score),(c1:(Int,Double),newScore)=>(c1._1+1,c1._2+newScore),(c1:(Int,Double),c2:(Int,Double))=>(c1._1+c2._1,c1._2+c2._2))
val average=scores.map{case(name,(num,score))=>(name,score/name)}
combineByKey()是最常用的基于Key的聚合函数
WordCount示例:
把输入文件加载进RDD:
val textfile=sc.textFile(“../../testfile/helloSpark”)
helloSpark内容为:
hello Spark
hello World
hello Spark !
MapReduce操作,以work为key,1为value:
val wordcounts=textfile.flatMap(line => line.split(” “)).map(word => (word, 1)).reduceByKey((a, b) => a + b)
查看每个单词出现的次数:
wordcounts.collect()
输出: res1: Array[(String, Int)] = Array((Spark,2), (World,1), (!,1), (hello,3))
Spark示例:
val data=sc.parallelize(Array(1,2,3,4,5) //将data处理成RDD
data.reduce(+) //在RDD上进行运算,对data里面元素进行加和
输出:res0: Int = 15
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。