赞
踩
Hive 由 Facebook 实现并开源,是基于 Hadoop 的一个数据仓库工具,可以将结构化的数据映射为一张数据库表,并提供 HQL(Hive SQL)查询功能,底层数据是存储在 HDFS 上。
Hive 本质: 将 SQL 语句转换为 MapReduce 任务运行,使不熟悉 MapReduce 的用户很方便地利用 HQL 处理和计算 HDFS 上的结构化的数据,是一款基于 HDFS 的 MapReduce 计算框架
主要用途:用来做离线数据分析,比直接用 MapReduce 开发效率更高。
直接使用 Hadoop MapReduce 处理数据所面临的问题:
使用 Hive
用户接口:包括 CLI、JDBC/ODBC、WebGUI。
元数据存储:通常是存储在关系数据库如 mysql/derby 中。
解释器、编译器、优化器、执行器:完成 HQL 查询语句从词法分析、语法分析、编译、优化以及查询计划的生成。生成的查询计划存储在 HDFS 中,并在随后由 MapReduce 调用执行
Hive 利用 HDFS 存储数据,利用 MapReduce 查询分析数据。
Hive是数据仓库工具,没有集群的概念,如果想提交Hive作业只需要在hadoop集群 Master节点上装Hive就可以
hive 用于海量数据的离线数据分析。
Hive最主要做查询不涉及删除修改,默认不支持删除修改,默认不支持事务,并不完全支持标准sql
SqlCRUD全部支持,支持在线业务,所有完整,支持事务
hive支持的数据类型
复杂数据类型
hive中表的类型
如上
CREATE DATABASE test;
SHOW DATABASES;
use default;
CREATE TABLE student(classNo string, stuNo string, score int) row format delimited fields terminated by ‘,’;
hadoop fs -ls /hive/warehouse
C01,N0101,82
C01,N0102,59
C01,N0103,65
C02,N0201,81
C02,N0202,82
C02,N0203,79
C03,N0301,56
C03,N0302,92
C03,N0306,72
load data local inpath '/tmp/student.txt’overwrite into table student;
这个命令将student.txt文件复制到hive的warehouse目录中,这个目录由hive.metastore.warehouse.dir配置项设置,默认值为/user/hive/warehouse。Overwrite选项将导致Hive事先删除student目录下所有的文件, 并将文件内容映射到表中。 Hive不会对student.txt做任何格式处理,因为Hive本身并不强调数据的存储格式。
select * from student;
select classNo,count(score) from student where score>=60 group by classNo;
从执行结果可以看出 hive把查询的结果变成了MapReduce作业通过hadoop执行
http://172.18.2.2:8088/cluster/
CREATE EXTERNAL TABLE student2 (classNo string, stuNo string, score int) row format delimited fields terminated by ‘,’ location ‘/tmp/student2’;
- 装载数据
load data local inpath ‘/root/tmp/student.txt’ overwrite into table student2;
desc formatted student;
drop table student;
select * from student2;
什么是分区表
创建分区表
tom,4300
jerry,12000
mike,13000
jake,11000
rob,10000
create table employee (name string,salary bigint) partitioned by (date1 string) row format delimited fields terminated by ‘,’ lines terminated by ‘\n’ stored as textfile;
show partitions employee;
alter table employee add if not exists partition(date1=‘2018-12-01’);
load data local inpath ‘/root/tmp/employee.txt’ into table employee partition(date1=‘2018-12-01’);
如果重复加载同名文件,不会报错,会自动创建一个*_copy_1.txt
外部分区表即使有分区的目录结构, 也必须要通过hql添加分区, 才能看到相应的数据
hadoop fs -mkdir /user/hive/warehouse/employee/date1=2018-12-04
hadoop fs -copyFromLocal /tmp/employee.txt /user/hive/warehouse/test.db/employee/date1=2018-12-04/employee.txt
此时查看表中数据发现数据并没有变化, 需要通过hql添加分区
alter table employee add if not exists partition(date1=‘2018-12-04’);
此时再次查看才能看到新加入的数据
总结:
在写入数据时自动创建分区(包括目录结构)
create table employee2 (name string,salary bigint) partitioned by (date1 string) row format delimited fields terminated by ‘,’ lines terminated by ‘\n’ stored as textfile;
insert into table employee2 partition(date1) select name,salary,date1 from employee;
- 使用动态分区需要设置参数
set hive.exec.dynamic.partition.mode=nonstrict;
在 Hive 有四种类型的运算符:
(内容较多,见《Hive 官方文档》》)
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
UDF
当 Hive 提供的内置函数无法满足你的业务处理需要时,此时就可以考虑使用用户自定义函数(UDF:user-defined function)。
UDF:就是做一个mapper,对每一条输入数据,映射为一条输出数据。
UDAF:就是一个reducer,把一组输入数据映射为一条(或多条)输出数据。
一个脚本至于是做mapper还是做reducer,又或者是做udf还是做udaf,取决于我们把它放在什么样的hive操作符中。放在select中的基本就是udf,放在distribute by和cluster by中的就是reducer。
UDF示例(运行java已经编写好的UDF)
hadoop fs -mkdir -p /user/hive/lib
cd /opt/hive/lib
hadoop fs -put hive-contrib-1.1.0-cdh5.14.2.jar /user/hive/lib/
hive
add jar hdfs:///user/hive/lib/hive-contrib-1.1.0-cdh5.14.2.jar;
CREATE TEMPORARY FUNCTION row_sequence as ‘org.apache.hadoop.hive.contrib.udf.UDFRowSequence’;
select row_sequence(),* from employee;
CREATE FUNCTION row_sequence as ‘org.apache.hadoop.hive.contrib.udf.UDFRowSequence’ using jar ‘hdfs:///user/hive/lib/hive-contrib-1.1.0-cdh5.7.0.jar’;
准备案例环境
CREATE table u(fname STRING,lname STRING);
insert into table u2 values(‘George’,‘washington’);
insert into table u2 values(‘George’,‘bush’);
insert into table u2 values(‘Bill’,‘clinton’);
insert into table u2 values(‘Bill’,‘gates’);
import sysfor line in sys.stdin:
line = line.strip()
fname , lname = line.split('\t')
l_name = lname.upper()
print '\t'.join([fname, str(l_name)])
通过hdfs向hive中ADD file
加载文件到hdfs
hadoop fs -put udf.py /user/hive/lib/
add file hdfs:///user/hive/lib/udf.py;
add file /root/tmp/udf1.py;
SELECT TRANSFORM(fname, lname) USING ‘python udf1.py’ AS (fname, l_name) FROM u;
根据用户行为以及文章标签筛选出用户最感兴趣(阅读最多)的标签
user_id article_id event_time
11,101,2018-12-01 06:01:10
22,102,2018-12-01 07:28:12
33,103,2018-12-01 07:50:14
11,104,2018-12-01 09:08:12
22,103,2018-12-01 13:37:12
33,102,2018-12-02 07:09:12
11,101,2018-12-02 18:42:12
35,105,2018-12-03 09:21:12
22,104,2018-12-03 16:42:12
77,103,2018-12-03 18:31:12
99,102,2018-12-04 00:04:12
33,101,2018-12-04 19:10:12
11,101,2018-12-05 09:07:12
35,102,2018-12-05 11:00:12
22,103,2018-12-05 12:11:12
77,104,2018-12-05 18:02:02
99,105,2018-12-05 20:09:11
101,http://www.itcast.cn/1.html,kw8|kw1
102,http://www.itcast.cn/2.html,kw6|kw3
103,http://www.itcast.cn/3.html,kw7
104,http://www.itcast.cn/4.html,kw5|kw1|kw4|kw9
105,http://www.itcast.cn/5.html,
hadoop fs -mkdir /tmp/demo
hadoop fs -mkdir /tmp/demo/user_action
- 用户行为表
drop table if exists user_actions;CREATE EXTERNAL TABLE user_actions(
user_id STRING,
article_id STRING,
time_stamp STRING
)ROW FORMAT delimited fields terminated by ','
LOCATION '/tmp/demo/user_action';
key_words array 数组的数据类型
COLLECTION ITEMS terminated BY ‘|’ 数组的元素之间用’|'分割
drop table if exists articles;CREATE EXTERNAL TABLE articles(
article_id STRING,
url STRING,
key_words array<STRING>
)ROW FORMAT delimited fields terminated by ','
COLLECTION ITEMS terminated BY '|'
LOCATION '/tmp/demo/article_keywords';
select * from user_actions;select * from articles;
分组查询每个用户的浏览记录
collect_set
select user_id,collect_set(article_id) from user_actions group by user_id;
11 [“101”,“104”]
22 [“102”,“103”,“104”]
33 [“103”,“102”,“101”]
35 [“105”,“102”]
77 [“103”,“104”]
99 [“102”,“105”]
select user_id,collect_list(article_id) from user_actions group by user_id;
11 [“101”,“104”,“101”,“101”]
22 [“102”,“103”,“104”,“103”]
33 [“103”,“102”,“101”]
35 [“105”,“102”]
77 [“103”,“104”]
99 [“102”,“105”]
select user_id,sort_array(collect_list(article_id)) as contents from user_actions group by user_id;
11 [“101”,“101”,“101”,“104”]
22 [“102”,“103”,“103”,“104”]
33 [“101”,“102”,“103”]
35 [“102”,“105”]
77 [“103”,“104”]
99 [“102”,“105”]
select explode(key_words) from articles;
select article_id,kw from articles lateral view explode(key_words) t as kw;
101 kw8
101 kw1
102 kw6
102 kw3
103 kw7
104 kw5
104 kw1
104 kw4
104 kw9
select article_id,kw from articles lateral view outer explode(key_words) t as kw;
#含有outer
101 kw8
101 kw1
102 kw6
102 kw3
103 kw7
104 kw5
104 kw1
104 kw4
104 kw9
105 NULL
- 原始数据
101 http://www.itcast.cn/1.html [“kw8”,“kw1”]
102 http://www.itcast.cn/2.html [“kw6”,“kw3”]
103 http://www.itcast.cn/3.html [“kw7”]
104 http://www.itcast.cn/4.html [“kw5”,“kw1”,“kw4”,“kw9”]
105 http://www.itcast.cn/5.html []
select a.user_id, b.kw from user_actions as a left outer JOIN (select article_id,kw from articles
lateral view outer explode(key_words) t as kw) bon (a.article_id = b.article_id)order by a.user_id;
11 kw1
11 kw8
11 kw5
11 kw1
11 kw4
11 kw1
11 kw9
11 kw8
11 kw1
11 kw8
22 kw1
22 kw7
22 kw9
22 kw4
22 kw5
22 kw7
22 kw3
22 kw6
33 kw8
33 kw1
33 kw3
33 kw6
33 kw7
35 NULL
35 kw6
35 kw3
77 kw9
77 kw1
77 kw7
77 kw4
77 kw5
99 kw3
99 kw6
99 NULL
select a.user_id, b.kw,count(1) as weight from user_actions as a left outer JOIN (select article_id,kw from articles
lateral view outer explode(key_words) t as kw) bon (a.article_id = b.article_id)group by a.user_id,b.kw order by a.user_id,weight desc;
11 kw1 4
11 kw8 3
11 kw5 1
11 kw9 1
11 kw4 1
22 kw7 2
22 kw9 1
22 kw1 1
22 kw3 1
22 kw4 1
22 kw5 1
22 kw6 1
33 kw3 1
33 kw8 1
33 kw7 1
33 kw6 1
33 kw1 1
35 NULL 1
35 kw3 1
35 kw6 1
77 kw1 1
77 kw4 1
77 kw5 1
77 kw7 1
77 kw9 1
99 NULL 1
99 kw3 1
99 kw6 1
返回结果为连接参数产生的字符串。如有任何一个参数为NULL ,则返回值为 NULL。
select concat(user_id,article_id) from user_actions;
使用语法为:CONCAT_WS(separator,str1,str2,…)
CONCAT_WS() 代表 CONCAT With Separator ,是CONCAT()的特殊形式。第一个参数是其它参数的分隔符。分隔符的位置放在要连接的两个字符串之间。分隔符可以是一个字符串,也可以是其它参数。如果分隔符为 NULL,则结果为 NULL。
select concat_ws(’:’,user_id,article_id) from user_actions;
select a.user_id, concat_ws(’:’,b.kw,cast (count(1) as string)) as kw_w from user_actions as a left outer JOIN (select article_id,kw from articles
lateral view outer explode(key_words) t as kw) bon (a.article_id = b.article_id)group by a.user_id,b.kw;
11 kw1:4
11 kw4:1
11 kw5:1
11 kw8:3
11 kw9:1
22 kw1:1
22 kw3:1
22 kw4:1
22 kw5:1
22 kw6:1
22 kw7:2
22 kw9:1
33 kw1:1
33 kw3:1
33 kw6:1
33 kw7:1
33 kw8:1
35 1
35 kw3:1
35 kw6:1
77 kw1:1
77 kw4:1
77 kw5:1
77 kw7:1
77 kw9:1
99 1
99 kw3:1
99 kw6:1
select cc.user_id,concat_ws(’,’,collect_set(cc.kw_w))from(select a.user_id, concat_ws(’:’,b.kw,cast (count(1) as string)) as kw_w from user_actions as a left outer JOIN (select article_id,kw from articles
lateral view outer explode(key_words) t as kw) bon (a.article_id = b.article_id)group by a.user_id,b.kw
) as cc group by cc.user_id;
11 kw1:4,kw4:1,kw5:1,kw8:3,kw9:1
22 kw1:1,kw3:1,kw4:1,kw5:1,kw6:1,kw7:2,kw9:1
33 kw1:1,kw3:1,kw6:1,kw7:1,kw8:1
35 1,kw3:1,kw6:1
77 kw1:1,kw4:1,kw5:1,kw7:1,kw9:1
99 1,kw3:1,kw6:1
select cc.user_id,str_to_map(concat_ws(’,’,collect_set(cc.kw_w))) as wmfrom(select a.user_id, concat_ws(’:’,b.kw,cast (count(1) as string)) as kw_w from user_actions as a left outer JOIN (select article_id,kw from articles
lateral view outer explode(key_words) t as kw) bon (a.article_id = b.article_id)group by a.user_id,b.kw
) as cc group by cc.user_id;
11 {“kw1”:“4”,“kw4”:“1”,“kw5”:“1”,“kw8”:“3”,“kw9”:“1”}
22 {“kw1”:“1”,“kw3”:“1”,“kw4”:“1”,“kw5”:“1”,“kw6”:“1”,“kw7”:“2”,“kw9”:“1”}
33 {“kw1”:“1”,“kw3”:“1”,“kw6”:“1”,“kw7”:“1”,“kw8”:“1”}
35 {“1”:null,“kw3”:“1”,“kw6”:“1”}
77 {“kw1”:“1”,“kw4”:“1”,“kw5”:“1”,“kw7”:“1”,“kw9”:“1”}
99 {“1”:null,“kw3”:“1”,“kw6”:“1”}
create table user_kws as select cc.user_id,str_to_map(concat_ws(’,’,collect_set(cc.kw_w))) as wmfrom(select a.user_id, concat_ws(’:’,b.kw,cast (count(1) as string)) as kw_w from user_actions as a left outer JOIN (select article_id,kw from articles
lateral view outer explode(key_words) t as kw) bon (a.article_id = b.article_id)group by a.user_id,b.kw
) as cc group by cc.user_id;
select user_id, wm[‘kw1’] from user_kws;
11 4
22 1
33 1
35 NULL
77 1
99 NULL
select user_id,map_keys(wm),map_values(wm) from user_kws;
11 [“kw1”,“kw4”,“kw5”,“kw8”,“kw9”] [“4”,“1”,“1”,“3”,“1”]
22 [“kw1”,“kw3”,“kw4”,“kw5”,“kw6”,“kw7”,“kw9”] [“1”,“1”,“1”,“1”,“1”,“2”,“1”]
33 [“kw1”,“kw3”,“kw6”,“kw7”,“kw8”] [“1”,“1”,“1”,“1”,“1”]
35 [“1”,“kw3”,“kw6”] [null,“1”,“1”]
77 [“kw1”,“kw4”,“kw5”,“kw7”,“kw9”] [“1”,“1”,“1”,“1”,“1”]
99 [“1”,“kw3”,“kw6”] [null,“1”,“1”]
select user_id,keyword,weight from user_kws lateral view explode(wm) t as keyword,weight;
11 kw1 4
11 kw4 1
11 kw5 1
11 kw8 3
11 kw9 1
22 kw1 1
22 kw3 1
22 kw4 1
22 kw5 1
22 kw6 1
22 kw7 2
22 kw9 1
33 kw1 1
33 kw3 1
33 kw6 1
33 kw7 1
33 kw8 1
35 1 NULL
35 kw3 1
35 kw6 1
77 kw1 1
77 kw4 1
77 kw5 1
77 kw7 1
77 kw9 1
99 1 NULL
99 kw3 1
99 kw6 1
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。