赞
踩
目录
(1)、Spark SQL CLI的配置。
(2)、在Spark SQL CLI中操作Hive表。
某航空公司积累了大量的会员档案和其乘坐航班记录。抽取2012-04-01~2014-03-31内所有乘客的详细数据,总共62988条记录。其中包含了会员卡号、入会时间、性别等属性。
原始数据中存在票价(sum_yr_1)为空值的情况。票价为空值可能是客户不存在乘机记录造成的。同时原始数据中存在票价(sum_yr_1)最小值为0,折扣率(avg_discount)最小值为0,总飞行千米数(seg_km_sum)大于0的数据,这些数据可能是客户乘坐0折机票或积分兑换造成的。现要求统计sum_yr_1、seg_km_sum、avg_discount这3个字段的空值记录,保存到null_count表,以及统计sum_yr_1、seg_km_sum、avg_discount这3个字段的最小值并保存到min_count表。
(1)、检查是否已配置Spark SQL CLI,若没有配置Spark SQL CLI环境,则先配置Spark SQL CLI。
(2)、启动spark-sql。
(3)、创建数据库air,并在air数据库下新建表air_data_base,建表语句可参考代码。
(4)、导入数据到air_data_base表。
(5)、使用hiveContext.sql()方式统计SUM_YR_1、SEG_KM_SUM、AVG_DISCOUNT这3个字段的空值记录数,保存到null_count表中。
(6)、使用hiveContext.sql()方式统计SUM_YR_1、SEG_KM_SUM、AVG_DISCOUNT这3个字段的最小值,保存到null_count表中。
一、启动Hive的metastore服务,使用Hive CLI客户端登录hive
1.通过jps命令查看 RunJar进程存在
[root@node1 hive237]# jps
如果不存在,启动metastore,开启RunJar命令:
[root@node1 bin]# nohup /myserver/hive237/bin/hive --service metastore &
2.Hive CLI客户端登录hive(node1结点)
[root@node1 bin]# /myserver/hive237/bin/hive
二、在hive上建数据库和表
1.建库air
- hive> create database air;
- OK
- Time taken: 9.871 seconds
- hive> show databases;
- OK
- air
- default
- law
- Time taken: 0.223 seconds, Fetched: 3 row(s)
2.在数据库air里建表
- use air;
- create table air_data_base(
- member_no string,
- ffp_date string,
- first_flight_date string,
- gender string,
- ffp_tier int,
- work_city string,
- work_province string,
- work_country string,
- age int,
- load_time string,
- flight_count int,
- bp_sum bigint,
- ep_sum_yr_1 int,
- ep_sum_yr_2 bigint,
- sum_yr_1 bigint,
- sum_yr_2 bigint,
- seg_km_sum bigint,
- weighted_seg_km double,
- last_flight_date string,
- avg_flight_count double,
- avg_bp_sum double,
- begin_to_first int,
- last_to_end int,
- avg_interval float,
- max_interval int,
- add_points_sum_yr_1 bigint,
- add_points_sum_yr_2 bigint,
- exchange_count int,
- avg_discount float,
- p1y_flight_count int,
- l1y_flight_count int,
- p1y_bp_sum bigint,
- l1y_bp_sum bigint,
- ep_sum bigint,
- add_point_sum bigint,
- eli_add_point_sum bigint,
- l1y_eli_add_points bigint,
- points_sum bigint,
- l1y_points_sum float,
- ration_l1y_flight_count float,
- ration_p1y_flight_count float,
- ration_p1y_bps float,
- ration_l1y_bps float,
- point_notflight int
- )row format delimited fields terminated by ','
- STORED AS TEXTFILE;
show tables;
3. 加载数据到air_data_base表
(1)把air_data_base.txt上传到node1的根目录,并上传到hdfs上/user/root/目录下(在新打开 的node1,ssh连接上操作)
[root@node1 ~]# hdfs dfs -put air_data_base.txt /user/root/
(2)加载数据
load data inpath '/user/root/air_data_base.txt' overwrite into table air_data_base;
hive> select count(*) from air_data_base;
三、数据分析
1.统计观测窗口的票价收入(SUM_YR_1)、观测窗口的总飞行公里数(SEG_KM_SUM)和平均折扣率(AVG_DISCOUNT)三个字段的空值记录,并将结果保存到名为null_count的表中。
- hive> create table null_count as
- select * from
- (select count(*) as sum_yr_1_null_count from air_data_base where sum_yr_1 is null)
- sum_yr_1_null,
- (select count(*) as seg_km_sum_null_count from air_data_base where seg_km_sum is null)
- seg_km_sum_null,
- (select count(*) as avg_discount_null_count from air_data_base where avg_discount is null)
- avg_discount_null ;
- hive> select * from null_count;
- 592 1 1
- Time taken: 0.45 seconds, Fetched 1 row(s)
2. 统计air_data_table表中观测窗口的SUM_YR_1(票价收入)、SEG_KM_SUM(总飞行公里数)和AVG_DISCOUNT(平均折扣率)三列的最小值min_count表中。
- hive> create table min_count as
- select min(sum_yr_1) as sum_yr_1_min,
- min(seg_km_sum) as seg_km_sum_min,
- min(avg_discount) as avg_discount_min from air_data_base ;
-
- hive> select * from min_count;
- 0 368 0.0
- Time taken: 0.367 seconds, Fetched 1 row(s)
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。