当前位置:   article > 正文

大数据-基于拉勾网职位数据的可视化实验-小白教程(六、Hive处理数据)_hive对招聘数据进行处理

hive对招聘数据进行处理

首先将数据下载下来:

hdfs dfs -put /export/servers/part-r-00000 /JobData/output

先检查mysql是否启动用

mysql -uroot -p123456

进入hive的目录 

cd /export/servers/apache-hive-1.2.1-bin/bin/

启动hive

./hive

一、将数据加载到数据仓库中

1.创建数据仓库 jobdata

  1. create database jobdata;
  2. use  jobdata;

2.创建事实表 ods_jobdata_origin

  1. CREATE TABLE ods_jobdata_origin(
  2. city string COMMENT '城市',
  3. salary array<String> COMMENT '薪资',
  4. company array<String> COMMENT '福利',
  5. kill array<String> COMMENT '技能')
  6. COMMENT '原始职位数据表'
  7. ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
  8. COLLECTION ITEMS TERMINATED BY '-'
  9. STORED AS TEXTFILE;

3.(1)将HDFS上的预处理数据导入到事实表中,命令如下:

LOAD DATA INPATH '/JobData/output/part-r-00000' OVERWRITE INTO TABLE ods_jobdata_origin;

(2)查看事实表中的数据,验证数据是否导入成功,命令如下:

select * from ods_jobdata_origin;

4.创建明细表ods_jobdata_detail用于存储事实表细化的数据内容

  1. CREATE TABLE ods_jobdata_detail(
  2. city string COMMENT '城市',
  3. salary array<String> COMMENT '薪资',
  4. company array<String> COMMENT '福利',
  5. kill array<String> COMMENT '技能',
  6. low_salary int COMMENT '低薪资',
  7. high_salary int COMMENT '高薪资',
  8. avg_salary double COMMENT '平均薪资')
  9. COMMENT '职位数据明细表'
  10. ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
  11. STORED AS TEXTFILE;

 5.向明细表ods_jobdata_detail表中加载数据

  1. insert overwrite table ods_jobdata_detail
  2. select city,salary,company,kill,salary[0],salary[1],(salary[0]+salary[1])/2
  3. from ods_jobdata_origin;

6.对薪资字段内容进行扁平化处理,将处理结果储存到临时中间表t_ods_tmp_s

create table t_ods_tmp_salary as select explode(ojo.salary) from ods_jobdata_origin ojo;

7.  对t_ods_tmp_salary表的每一条数据进行泛化处理,将处理结果储存到中间表t_ods_tmp_salary_dist中,命令如下

  1. create table t_ods_tmp_salary_dist as
  2. select case when col>=0 and col<=5 then "0-5"
  3. when col>=6 and col<=10 then "6-10"
  4. when col>=11 and col<=15 then "11-15"
  5. when col>=16 and col<=20 then "16-20"
  6. when col>=21 and col<=25 then "21-25"
  7. when col>=26 and col<=30 then "26-30"
  8. when col>=31 and col<=35 then "31-35"
  9. when col>=36 and col<=40 then "36-40"
  10. when col>=41 and col<=45 then "41-45"
  11. when col>=46 and col<=50 then "46-50"
  12. when col>=51 and col<=55 then "51-55"
  13. when col>=56 and col<=60 then "56-60"
  14. when col>=61 and col<=65 then "61-65"
  15. when col>=66 and col<=70 then "66-70"
  16. when col>=71 and col<=75 then "71-75"
  17. when col>=76 and col<=80 then "76-80"
  18. when col>=81 and col<=85 then "81-85"
  19. when col>=86 and col<=90 then "86-90"
  20. when col>=91 and col<=95 then "91-95"
  21. when col>=96 and col<=100 then "96-100"
  22. when col>=101 then ">101" end from t_ods_tmp_salary;

8.对福利标签字段内容进行扁平化处理,将处理结果储存到临时中间表t_ods_tmp_company,命令如下:

create table t_ods_tmp_company as select explode(ojo.company) from ods_jobdata_origin ojo;

9.对技能标签字段内容进行扁平化处理,将处理结果储存到临时中间表t_ods_tmp_kill,命令如下:

create table t_ods_tmp_kill as select explode(ojo.kill) from ods_jobdata_origin ojo;

10.创建维度表t_ods_kill,用于储存技能标签的统计结果,命令如下

  1. create table t_ods_kill(
  2. every_kill String comment '技能标签',
  3. count int comment '词频')
  4. COMMENT '技能标签词频统计'
  5. ROW FORMAT DELIMITED
  6. fields terminated by ','
  7. STORED AS TEXTFILE;

11.创建维度表t_ods_company,用于储存福利标签的统计结果,命令如下

  1. create table t_ods_company(
  2. every_company String comment '福利标签',
  3. count int comment '词频')
  4. COMMENT '福利标签词频统计'
  5. ROW FORMAT DELIMITED
  6. fields terminated by ','
  7. STORED AS TEXTFILE;

12.创建维度表t_ods_salary,用于储存薪资分布的统计结果,命令如下

  1. create table t_ods_salary(
  2. every_partition String comment '薪资分布',
  3. count int comment '聚合统计')
  4. COMMENT '薪资分布聚合统计'
  5. ROW FORMAT DELIMITED
  6. fields terminated by ','
  7. STORED AS TEXTFILE;

13.创建维度表t_ods_city,用于储存城市的统计结果,命令如下

  1. create table t_ods_city(
  2. every_city String comment '城市',
  3. count int comment '词频')
  4. COMMENT '城市统计'
  5. ROW FORMAT DELIMITED
  6. fields terminated by ','
  7. STORED AS TEXTFILE;

 

二、分析数据

1.通过对大数据相关职位分布的分析,使读者了解该职位在全国各城市的需求状况,命令如下

  1. insert overwrite table t_ods_city
  2. select city,count(1) from ods_jobdata_origin group by city;

2.查看维度表t_ods_city中的分析结果,命令如下

select * from t_ods_city sort by count desc;

3.分析大数据职位全国薪资分布情况,将分析结果储存在维度表t_ods_salary中,命令如下

  1. insert overwrite table t_ods_salary
  2. select `_c0`,count(1) from t_ods_tmp_salary_dist group by `_c0`;

4.查看维度表t_ods_salary中的分析结果,命令如下

select * from t_ods_salary sort by count desc;

5.分析薪资的平均值,命令如下

select avg(avg_salary) from ods_jobdata_detail;

6.分析薪资的众数,命令如下

select avg_salary,count(1) as cnt from ods_jobdata_detail group by avg_salary order by cnt desc limit 1;

7.分析薪资的中位数,命令如下

select percentile(cast(avg_salary as BIGINT), 0.5) from ods_jobdata_detail;

8.分析各城市平均薪资待遇,命令如下

select city,count(city),round(avg(avg_salary),2) as cnt from ods_jobdata_detail group by city order by cnt desc;

9.通过对公司福利字段进行分析,了解大数据职位相关公司对员工福利常用的标签都有哪些,命令如下

insert overwrite table t_ods_company select col,count(1) from t_ods_tmp_company group by col;

10.查看维度表t_ods_company中的分析结果,命令如下

select every_company,count from t_ods_company sort by count desc limit 10;

11. 通过对技能标签分析,了解要从事大数据相关工作需要掌握哪些技能,招聘公司比较重视哪些技能,命令如下

insert overwrite table t_ods_kill select col,count(1) from t_ods_tmp_kill group by col;

12.查看维度表t_ods_kill中的分析结果,命令如下

select every_kill,count from t_ods_kill sort by count desc limit 3;

之后你会在你的hive表中看到:

 这样我们就把数据处理完了,时候我们需要通过sqoop工具将数据迁移至mysql中。我们下回来说!

声明:本文内容由网友自发贡献,转载请注明出处:【wpsshop博客】
推荐阅读
相关标签
  

闽ICP备14008679号