赞
踩
前面,我们学习了Scala及Spark 的基本知识。今天开始进入Spark SQL的学习。
Spark SQL是Apache Spark的用于处理结构化数据的模块,它提供了一个编程抽象叫做DataFrame并且作为分布式SQL查询引擎的作用。Spark SQL一般用于处理离线数据。
DataFrame和DataSet 是Spark SQL的重点,Spark SQL对数据的操作基本都使用DataFrame和DataSet。
DataFrame在概念上可以理解为关系型数据库的表。
创建DataFrame的方式有多种,其中最简单的方式是基于带格式的数据源进行创建,如读取json格式的数据,就可以创建DataFrame了。
#读取一个json格式的文件,它返回的结果就是一个DataFrame scala> var df3 = spark.read.json("/root/emp.json") [Stage 0:> (0 + 0) / 2]20/01/08 11:04:07 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources df3: org.apache.spark.sql.DataFrame = [comm: string, deptno: bigint ... 6 more fields] #查看数据 scala> df3.show +----+------+-----+------+----------+---------+----+----+ |comm|deptno|empno| ename| hiredate| job| mgr| sal| +----+------+-----+------+----------+---------+----+----+ | | 20| 7369| SMITH|1980/12/17| CLERK|7902| 800| | 300| 30| 7499| ALLEN| 1981/2/20| SALESMAN|7698|1600| | 500| 30| 7521| WARD| 1981/2/22| SALESMAN|7698|1250| | | 20| 7566| JONES| 1981/4/2| MANAGER|7839|2975| |1400| 30| 7654|MARTIN| 1981/9/28| SALESMAN|7698|1250| | | 30| 7698| BLAKE| 1981/5/1| MANAGER|7839|2850| | | 10| 7782| CLARK| 1981/6/9| MANAGER|7839|2450| | | 20| 7788| SCOTT| 1987/4/19| ANALYST|7566|3000| | | 10| 7839| KING|1981/11/17|PRESIDENT| |5000| | 0| 30| 7844|TURNER| 1981/9/8| SALESMAN|7698|1500| | | 20| 7876| ADAMS| 1987/5/23| CLERK|7788|1100| | | 30| 7900| JAMES| 1981/12/3| CLERK|7698| 950| | | 20| 7902| FORD| 1981/12/3| ANALYST|7566|3000| | | 10| 7934|MILLER| 1982/1/23| CLERK|7782|1300| +----+------+-----+------+----------+---------+----+----+ scala> #查看表结构 scala> df3.printSchema root |-- comm: string (nullable = true) |-- deptno: long (nullable = true) |-- empno: long (nullable = true) |-- ename: string (nullable = true) |-- hiredate: string (nullable = true) |-- job: string (nullable = true) |-- mgr: string (nullable = true) |-- sal: long (nullable = true) scala> #查看员工的姓名和薪水 scala> df3.select("ename","sal").show +------+----+ | ename| sal| +------+----+ | SMITH| 800| | ALLEN|1600| | WARD|1250| | JONES|2975| |MARTIN|1250| | BLAKE|2850| | CLARK|2450| | SCOTT|3000| | KING|5000| |TURNER|1500| | ADAMS|1100| | JAMES| 950| | FORD|3000| |MILLER|1300| +------+----+ #查看员工的薪水,并且给员工统一涨薪500块钱。 scala> df3.select($"ename",$"sal"+500).show +------+-----------+ | ename|(sal + 500)| +------+-----------+ | SMITH| 1300| | ALLEN| 2100| | WARD| 1750| | JONES| 3475| |MARTIN| 1750| | BLAKE| 3350| | CLARK| 2950| | SCOTT| 3500| | KING| 5500| |TURNER| 2000| | ADAMS| 1600| | JAMES| 1450| | FORD| 3500| |MILLER| 1800| +------+-----------+ 查询薪水大于2000的员工 scala> df3.filter($"sal">2000).show +----+------+-----+-----+----------+---------+----+----+ |comm|deptno|empno|ename| hiredate| job| mgr| sal| +----+------+-----+-----+----------+---------+----+----+ | | 20| 7566|JONES| 1981/4/2| MANAGER|7839|2975| | | 30| 7698|BLAKE| 1981/5/1| MANAGER|7839|2850| | | 10| 7782|CLARK| 1981/6/9| MANAGER|7839|2450| | | 20| 7788|SCOTT| 1987/4/19| ANALYST|7566|3000| | | 10| 7839| KING|1981/11/17|PRESIDENT| |5000| | | 20| 7902| FORD| 1981/12/3| ANALYST|7566|3000| +----+------+-----+-----+----------+---------+----+----+ scala> #分组统计,统计每个部门的员工数 scala> df3.groupBy($"deptno").count.show +------+-----+ |deptno|count| +------+-----+ | 10| 3| | 30| 6| | 20| 5| +------+-----+ scala> 其他的例子,请参考API: http://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.sql.Dataset
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。