赞
踩
首先将数据下载下来:
hdfs dfs -put /export/servers/part-r-00000 /JobData/output
先检查mysql是否启动用
mysql -uroot -p123456
进入hive的目录
cd /export/servers/apache-hive-1.2.1-bin/bin/
启动hive
./hive
1.创建数据仓库 jobdata
- create database jobdata;
-
- use jobdata;
2.创建事实表 ods_jobdata_origin
- CREATE TABLE ods_jobdata_origin(
-
- city string COMMENT '城市',
-
- salary array<String> COMMENT '薪资',
-
- company array<String> COMMENT '福利',
-
- kill array<String> COMMENT '技能')
-
- COMMENT '原始职位数据表'
-
- ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
-
- COLLECTION ITEMS TERMINATED BY '-'
-
- STORED AS TEXTFILE;
3.(1)将HDFS上的预处理数据导入到事实表中,命令如下:
LOAD DATA INPATH '/JobData/output/part-r-00000' OVERWRITE INTO TABLE ods_jobdata_origin;
(2)查看事实表中的数据,验证数据是否导入成功,命令如下:
select * from ods_jobdata_origin;
4.创建明细表ods_jobdata_detail用于存储事实表细化的数据内容
- CREATE TABLE ods_jobdata_detail(
-
- city string COMMENT '城市',
-
- salary array<String> COMMENT '薪资',
-
- company array<String> COMMENT '福利',
-
- kill array<String> COMMENT '技能',
-
- low_salary int COMMENT '低薪资',
-
- high_salary int COMMENT '高薪资',
-
- avg_salary double COMMENT '平均薪资')
-
- COMMENT '职位数据明细表'
-
- ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
-
- STORED AS TEXTFILE;
5.向明细表ods_jobdata_detail表中加载数据
- insert overwrite table ods_jobdata_detail
-
- select city,salary,company,kill,salary[0],salary[1],(salary[0]+salary[1])/2
-
- from ods_jobdata_origin;
6.对薪资字段内容进行扁平化处理,将处理结果储存到临时中间表t_ods_tmp_s
create table t_ods_tmp_salary as select explode(ojo.salary) from ods_jobdata_origin ojo;
7. 对t_ods_tmp_salary表的每一条数据进行泛化处理,将处理结果储存到中间表t_ods_tmp_salary_dist中,命令如下
- create table t_ods_tmp_salary_dist as
-
- select case when col>=0 and col<=5 then "0-5"
-
- when col>=6 and col<=10 then "6-10"
-
- when col>=11 and col<=15 then "11-15"
-
- when col>=16 and col<=20 then "16-20"
-
- when col>=21 and col<=25 then "21-25"
-
- when col>=26 and col<=30 then "26-30"
-
- when col>=31 and col<=35 then "31-35"
-
- when col>=36 and col<=40 then "36-40"
-
- when col>=41 and col<=45 then "41-45"
-
- when col>=46 and col<=50 then "46-50"
-
- when col>=51 and col<=55 then "51-55"
-
- when col>=56 and col<=60 then "56-60"
-
- when col>=61 and col<=65 then "61-65"
-
- when col>=66 and col<=70 then "66-70"
-
- when col>=71 and col<=75 then "71-75"
-
- when col>=76 and col<=80 then "76-80"
-
- when col>=81 and col<=85 then "81-85"
-
- when col>=86 and col<=90 then "86-90"
-
- when col>=91 and col<=95 then "91-95"
-
- when col>=96 and col<=100 then "96-100"
-
- when col>=101 then ">101" end from t_ods_tmp_salary;
8.对福利标签字段内容进行扁平化处理,将处理结果储存到临时中间表t_ods_tmp_company,命令如下:
create table t_ods_tmp_company as select explode(ojo.company) from ods_jobdata_origin ojo;
9.对技能标签字段内容进行扁平化处理,将处理结果储存到临时中间表t_ods_tmp_kill,命令如下:
create table t_ods_tmp_kill as select explode(ojo.kill) from ods_jobdata_origin ojo;
10.创建维度表t_ods_kill,用于储存技能标签的统计结果,命令如下
- create table t_ods_kill(
-
- every_kill String comment '技能标签',
-
- count int comment '词频')
-
- COMMENT '技能标签词频统计'
-
- ROW FORMAT DELIMITED
-
- fields terminated by ','
-
- STORED AS TEXTFILE;
11.创建维度表t_ods_company,用于储存福利标签的统计结果,命令如下
- create table t_ods_company(
-
- every_company String comment '福利标签',
-
- count int comment '词频')
-
- COMMENT '福利标签词频统计'
-
- ROW FORMAT DELIMITED
-
- fields terminated by ','
-
- STORED AS TEXTFILE;
12.创建维度表t_ods_salary,用于储存薪资分布的统计结果,命令如下
- create table t_ods_salary(
-
- every_partition String comment '薪资分布',
-
- count int comment '聚合统计')
-
- COMMENT '薪资分布聚合统计'
-
- ROW FORMAT DELIMITED
-
- fields terminated by ','
-
- STORED AS TEXTFILE;
13.创建维度表t_ods_city,用于储存城市的统计结果,命令如下
- create table t_ods_city(
-
- every_city String comment '城市',
-
- count int comment '词频')
-
- COMMENT '城市统计'
-
- ROW FORMAT DELIMITED
-
- fields terminated by ','
-
- STORED AS TEXTFILE;
1.通过对大数据相关职位分布的分析,使读者了解该职位在全国各城市的需求状况,命令如下
- insert overwrite table t_ods_city
-
- select city,count(1) from ods_jobdata_origin group by city;
2.查看维度表t_ods_city中的分析结果,命令如下
select * from t_ods_city sort by count desc;
3.分析大数据职位全国薪资分布情况,将分析结果储存在维度表t_ods_salary中,命令如下
- insert overwrite table t_ods_salary
-
- select `_c0`,count(1) from t_ods_tmp_salary_dist group by `_c0`;
4.查看维度表t_ods_salary中的分析结果,命令如下
select * from t_ods_salary sort by count desc;
5.分析薪资的平均值,命令如下
select avg(avg_salary) from ods_jobdata_detail;
6.分析薪资的众数,命令如下
select avg_salary,count(1) as cnt from ods_jobdata_detail group by avg_salary order by cnt desc limit 1;
7.分析薪资的中位数,命令如下
select percentile(cast(avg_salary as BIGINT), 0.5) from ods_jobdata_detail;
8.分析各城市平均薪资待遇,命令如下
select city,count(city),round(avg(avg_salary),2) as cnt from ods_jobdata_detail group by city order by cnt desc;
9.通过对公司福利字段进行分析,了解大数据职位相关公司对员工福利常用的标签都有哪些,命令如下
insert overwrite table t_ods_company select col,count(1) from t_ods_tmp_company group by col;
10.查看维度表t_ods_company中的分析结果,命令如下
select every_company,count from t_ods_company sort by count desc limit 10;
11. 通过对技能标签分析,了解要从事大数据相关工作需要掌握哪些技能,招聘公司比较重视哪些技能,命令如下
insert overwrite table t_ods_kill select col,count(1) from t_ods_tmp_kill group by col;
12.查看维度表t_ods_kill中的分析结果,命令如下
select every_kill,count from t_ods_kill sort by count desc limit 3;
之后你会在你的hive表中看到:
这样我们就把数据处理完了,时候我们需要通过sqoop工具将数据迁移至mysql中。我们下回来说!
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。