赞
踩
1)spark-shell版本
spark中已经创建好了SparkContext和SQLContext对象
2)代码:
./spark-shell --master spark://hdp-1:7077 --executor-memory 500m --total-executor-cores 1
var seq = Seq(("1","xiaoming",15),("2","xiaohong",20),("3","xiaoben",10))
加载信息:seq: Seq[(String, String, Int)] = List((1,xiaoming,15), (2,xiaohong,20), (3,xiaoben,10))
var rdd1 =sc.parallelize(seq)
加载信息: rdd1: org.apache.spark.rdd.RDD[(String, String, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:26
将当前的rdd对象转换为DataFrame对象(数据信息和数据结构信息存储到DataFrame)
//_1:string,_2:string,3:int
rdd1.toDF
val df = rdd1.toDF("id","name","age")
加载信息:df: org.apache.spark.sql.DataFrame = [id: string, name: string ... 1 more field]
_1:列名,String当前列的数据类型
//查看数据 show 算子来打印,show是一个action类型 算子
df.show
查询结果:
- +---+--------+---+
- | id| name|age|
- +---+--------+---+
- | 1|xiaoming| 15|
- | 2|xiaohong| 20|
- | 3| xiaoben| 10|
- +---+--------+---+
df.select("name").show
结果:
- +--------+
- | name|
- +--------+
- |xiaoming|
- |xiaohong|
- | xiaoben|
- +--------+
df.select("name","age").show
结果:
- +--------+---+
- | name|age|
- +--------+---+
- |xiaoming| 15|
- |xiaohong| 20|
- | xiaoben| 10|
- +--------+---+
//条件过滤
//参数必须是一个字符串,filter中的表达式也需要时一个字符串
df.select("name","age").filter("age >10").show
结果:
- +--------+---+
- | name|age|
- +--------+---+
- |xiaoming| 15|
- |xiaohong| 20|
- +--------+---+
df.select("name","age").filter(col("age") >10).show
结果:
- +--------+---+
- | name|age|
- +--------+---+
- |xiaoming| 15|
- |xiaohong| 20|
- +--------+---+
df.groupBy("age").count().show()
结果:
- +---+-----+
- |age|count|
- +---+-----+
- | 20| 1|
- | 15| 1|
- | 10| 1|
- +---+-----+
df.printSchema
结果:
- root
- |-- id: string (nullable = true)
- |-- name: string (nullable = true)
- |-- age: integer (nullable = false)
1.将DataFrame注册成表(临时表),
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。