当前位置:   article > 正文

Spark SQL:结构化数据文件处理01_spark sql—结构化数据处理实验

spark sql—结构化数据处理实验

准备工作

Linux安装Hive,MySQL
参考这位大佬的安装教程

掌握DataFrame基础操作

创建DataFrame对象

结构化数据文件创建DataFrame

hdfs dfs -mkdir /user/root/sparksql
hdfs dfs -put /home/xwk/software/spark/examples/src/main/resources/users.parquet /user/root/sparksql
hdfs dfs -put /home/xwk/software/spark/examples/src/main/resources/people.json /user/root/sparksql
  • 1
  • 2
  • 3
scala> import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SQLContext
  • 1
  • 2
scala> val sqlContext=new SQLContext(sc)
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@438e1537
  • 1
  • 2
scala> val dfUsers=sqlContext.read.load("hdfs://master/user/root/sparksql/users.parquet")
dfUsers: org.apache.spark.sql.DataFrame = [name: string, favorite_color: string, favorite_numbers: array<int>]
  • 1
  • 2
scala> val dfPeople=sqlContext.read.json("hdfs://master/user/root/sparksql/people.json")
dfPeople: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
  • 1
  • 2

外部数据库创建DataFrame

前提是数据库和表都存在

scala> val url="jdbc:mysql://192.168.10.20:3306/hive"
url: String = jdbc:mysql://192.168.10.20:3306/hive

scala> val jdbcDF=sqlContext.read.format("jdbc").options(
     | Map("url"->url,
     | "user"->"root",
     | "password"->"123456",
     | "dbtable"->"DBS")).load()
jdbcDF: org.apache.spark.sql.DataFrame = [DB_ID: bigint, DESC: string, DB_LOCATION_URI: string, NAME: string, OWNER_NAME: string, OWNER_TYPE: string]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

RDD创建DataFrame

scala> case class Person(name:String,age:Int)
defined class Person

scala> val data=sc.textFile("/user/root/sparksql/user.txt").map(_.split(","))
data: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[8] at map at <console>:28

scala> val people=data.map(p=>Person(p(0),p(1).trim.toInt)).toDF()
people: org.apache.spark.sql.DataFrame = [name: string, age: int]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

Hive中的表创建DataFrame

scala> import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.hive.HiveContext

scala> val hiveContext=new HiveContext(sc)
hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@27f22d1a
  • 1
  • 2
  • 3
  • 4
  • 5
scala> hiveContext.sql("use test")
res4: org.apache.spark.sql.DataFrame = [result: string]

scala> val people=hiveContext.sql("select * from students")
people: org.apache.spark.sql.DataFrame = [id: int, name: string, score: double, classes: string]
  • 1
  • 2
  • 3
  • 4
  • 5

DataFrame查看数据

使用SparkContext读取该数据并转换为DataFrame

scala> case class Movie(movieId:Int,title:String,Genres:String)
defined class Movie

scala> val data=sc.textFile("hdfs://master/user/root/sparksql/movies.dat").map(_.split("::"))
data: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[20] at map at <console>:34

scala> val movies=data.map(m=>Movie(m(0).trim.toInt,m(1),m(2))).toDF()
movies: org.apache.spark.sql.DataFrame = [movieId: int, title: string, Genres: string]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

printSchema:打印数据模式

scala> movies.printSchema
root
 |-- movieId: integer (nullable = false)
 |-- title: string (nullable = true)
 |-- Genres: string (nullable = true)
  • 1
  • 2
  • 3
  • 4
  • 5

show:查看数据

在这里插入图片描述
在这里插入图片描述
下面这个控制的是字符不是行数,可以看到所有字符都显示出来了
在这里插入图片描述

frist/head/take/takeAsList:获取若干行记录

scala> movies.first //获取第一行记录
res14: org.apache.spark.sql.Row = [1,Toy Story (1995),Animation|Children's|Comedy]

scala> movies.head(2) //获取前2行记录
res15: Array[org.apache.spark.sql.Row] = Array([1,Toy Story (1995),Animation|Children's|Comedy], [2,Jumanji (1995),Adventure|Children's|Fantasy])

scala> movies.take(2) //获取前2行记录
res16: Array[org.apache.spark.sql.Row] = Array([1,Toy Story (1995),Animation|Children's|Comedy], [2,Jumanji (1995),Adventure|Children's|Fantasy])

scala> movies.takeAsList(2) //获取前2行记录,并以List的形式展现
res17: java.util.List[org.apache.spark.sql.Row] = [[1,Toy Story (1995),Animation|Children's|Comedy], [2,Jumanji (1995),Adventure|Children's|Fantasy]]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11

collect/collectAsList:获取所有数据

movies.collect()
movies.collectAsList()
  • 1
  • 2
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/小丑西瓜9/article/detail/572503
推荐阅读
相关标签
  

闽ICP备14008679号