weixin_40725706

这个屌丝很懒，什么也没留下！

热门标签

Hive入门（三）之高级操作

作者：weixin_40725706 | 2024-05-20 10:23:41

踩

Hive入门（三）之高级操作

0.hive高级应用

1.Hive数据类型

1.1原子数据类型

数据类型	长度	备注
Tinyint	1字节的有符号整数	-128~127
SmallInt	1字节的有符号整数	-32768~32767
Int	4字节的有符号整数	-2147483648~()+1
BigInt	8字节的有符号整数
Boolean	布尔类型	true or false
Float	单精度浮点数
Double	双精度浮点数
String	字符串
TimeStamp	整数

1.2复杂数据类型

复杂数据类型包括数组（array）、映射（map）和结构体（struct）

类型	解释	例子
Struct	结构体	struct(‘john’,‘doe’)
Map	KV键值对	map（‘f’，‘m’，‘l’，‘n’）
Array	数组	array(‘john’,‘doe’)

eg:
CREATE TABLE student(
name STRING,
favors ARRAY<STRING>,
scores MAP<STRING, FLOAT>,
 address STRUCT<province:STRING, city:STRING, detail:STRING, zip:INT>
) 
 ROW FORMAT DELIMITED 
 FIELDS TERMINATED BY '\t' 
 COLLECTION ITEMS TERMINATED BY ';'
MAP KEYS TERMINATED BY ':' ;

1、字段 name 是基本类型，favors 是数组类型，可以保存很多爱好，scores 是映射类型，可
以保存多个课程的成绩，address 是结构类型，可以存储住址信息
2、ROW FORMAT DELIMITED 是指明后面的关键词是列和元素分隔符的
3、FIELDS TERMINATED BY 是字段分隔符
4、COLLECTION ITEMS TERMINATED BY 是元素分隔符（Array 中的各元素、Struct 中的各元素、
Map 中的 key-value 对之间）
5、MAP KEYS TERMINATED BY 是 Map 中 key 与 value 的分隔符
6、LINES TERMINATED BY 是行之间的分隔符
7、STORED AS TEXTFILE 指数据文件上传之后保存的格式

复合数据类型把多表关系通过一张表实现
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

1.3 实例说明

1.3.1 array

建表语句：
create table person(name string,work_locations string)
row format delimited fields terminated by '\t';
create table person1(name string,work_locations array<string>)
row format delimited fields terminated by '\t'
collection items terminated by ',';
数据：
huangbo beijing,shanghai,tianjin,hangzhou
xuzheng changchu,chengdu,wuhan
wangbaoqiang dalian,shenyang,jilin
导入数据：
load data local inpath '/home/hadoop/person.txt' into table person;
查询语句：
Select * from person;
Select name from person;
Select work_locations from person;
Select work_locations[0] from person;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

1.3.2 map

建表语句：
create table score(name string, scores map<string,int>)
row format delimited fields terminated by '\t'
collection items terminated by ','
map keys terminated by ':';
数据：
huangbo yuwen:80,shuxue:89,yingyu:95
xuzheng yuwen:70,shuxue:65,yingyu:81
wangbaoqiang yuwen:75,shuxue:100,yingyu:75
导入数据：
load data local inpath '/home/hadoop/score.txt' into table score;
查询语句：
Select * from score; Select name from score; Select scores from score; Select s.scores['yuwen'] from score s;
1
2
3
4
5
6
7
8
9
10
11
12
13

1.3.3 struct

建表语句：
create table structtable(id int,course struct<name:string,score:int>)
row format delimited fields terminated by '\t'
collection items terminated by ',';
数据：
1 english,80
2 math,89
3 chinese,95
导入数据：
load data local inpath '/ home/hadoop / structtable.txt' into table structtable;
查询语句：
Select * from structtable; Select id from structtable; Select course from structtable; Select t.course.name from structtable t; Select t.course.score from structtable t;
1
2
3
4
5
6
7
8
9
10
11
12

2.视图

和关系型数据库一样，hive也提供了视图，不过与关系型数据库中的视图相差很大

只有逻辑视图，没有物化视图
视图只能查询，不能对数据进行操作
视图创建时，只保存了一份元数据，只有当查询视图时，才会开始执行视图对应的子查询

创建视图
create view view_name as select * from carss;
查看
show tables;
desc view_name;
删除视图
drop view view_name;
使用视图
create view view_name as select * from tablename where rank>3;
select count(distinct uid) from tablename;
1
2
3
4
5
6
7
8
9
10

3.Hive函数

3.1Hive提供了极为丰富的内置函数

show functions;查看内置函数
desc function abs;查看函数的详细信息
desc function extended concat;显示函数的扩展信息

内置函数列表

一、关系运算：
1. 等值比较: =
2. 等值比较:<=>
3. 不等值比较: <>和!=
4. 小于比较: <
5. 小于等于比较: <=
6. 大于比较: >
7. 大于等于比较: >=
8. 区间比较
9. 空值判断: IS NULL
10. 非空判断: IS NOT NULL
10. LIKE 比较: LIKE
11. JAVA 的 LIKE 操作: RLIKE
12. REGEXP 操作: REGEXP

二、数学运算：
1. 加法操作: +
2. 减法操作: –
3. 乘法操作: *
4. 除法操作: /
5. 取余操作: %
6. 位与操作: &
7. 位或操作: |
8. 位异或操作: ^
9．位取反操作: ~
三、逻辑运算： 
1. 逻辑与操作: AND 、&&
2. 逻辑或操作: OR 、||
3. 逻辑非操作: NOT、!
四、复合类型构造函数 
1. map 结构
2. struct 结构
3. named_struct 结构
4. array 结构
5. create_union
五、复合类型操作符
1. 获取 array 中的元素
2. 获取 map 中的元素
3. 获取 struct 中的元素

六、数值计算函数 
1. 取整函数: round
2. 指定精度取整函数: round
3. 向下取整函数: floor
4. 向上取整函数: ceil
5. 向上取整函数: ceiling
6. 取随机数函数: rand
7. 自然指数函数: exp
8. 以 10 为底对数函数: log10
9. 以 2 为底对数函数: log2
10. 对数函数: log
11. 幂运算函数: pow
12. 幂运算函数: power
13. 开平方函数: sqrt
14. 二进制函数: bin
15. 十六进制函数: hex
16. 反转十六进制函数: unhex
17. 进制转换函数: conv
18. 绝对值函数: abs
19. 正取余函数: pmod
20. 正弦函数: sin
21. 反正弦函数: asin
22. 余弦函数: cos
23. 反余弦函数: acos
24. positive 函数: positive
25. negative 函数: negative

七、集合操作函数 
1. map 类型大小：size
2. array 类型大小：size
3. 判断元素数组是否包含元素：array_contains
4. 获取 map 中所有 value 集合
5. 获取 map 中所有 key 集合
6. 数组排序

八、类型转换函数
1. 二进制转换：binary
2. 基础类型之间强制转换：cast

九、日期函数 
1. UNIX 时间戳转日期函数: from_unixtime
2. 获取当前 UNIX 时间戳函数: unix_timestamp
3. 日期转 UNIX 时间戳函数: unix_timestamp
4. 指定格式日期转 UNIX 时间戳函数: unix_timestamp
5. 日期时间转日期函数: to_date
6. 日期转年函数: year
7. 日期转月函数: month
8. 日期转天函数: day
9. 日期转小时函数: hour
10. 日期转分钟函数: minute
11. 日期转秒函数: second
12. 日期转周函数: weekofyear
13. 日期比较函数: datediff
14. 日期增加函数: date_add
15. 日期减少函数: date_sub

十、条件函数 
1. If 函数: if
2. 非空查找函数: COALESCE
3. 条件判断函数：CASE

十一、字符串函数 
1. 字符 ascii 码函数：ascii
2. base64 字符串
3. 字符串连接函数：concat
4. 带分隔符字符串连接函数：concat_ws
5. 数组转换成字符串的函数：concat_ws
6. 小数位格式化成字符串函数：format_number
7. 字符串截取函数：substr,substring
8. 字符串截取函数：substr,substring
9. 字符串查找函数：instr
10. 字符串长度函数：length
11. 字符串查找函数：locate
12. 字符串格式化函数：printf
13. 字符串转换成 map 函数：str_to_map
14. base64 解码函数：unbase64(string str)
15. 字符串转大写函数：upper,ucase
16. 字符串转小写函数：lower,lcase
17. 去空格函数：trim
18. 左边去空格函数：ltrim
19. 右边去空格函数：rtrim
20. 正则表达式替换函数：regexp_replace
21. 正则表达式解析函数：regexp_extract
22. URL 解析函数：parse_url
23. json 解析函数：get_json_object
24. 空格字符串函数：space
25. 重复字符串函数：repeat
26. 左补足函数：lpad
27. 右补足函数：rpad
28. 分割字符串函数: split
29. 集合查找函数: find_in_set
30. 分词函数：sentences
31. 分词后统计一起出现频次最高的 TOP-K
32. 分词后统计与指定单词一起出现频次最高的 TOP-K

十二、混合函数 
1. 调用 Java 函数：java_method
2. 调用 Java 函数：reflect
3. 字符串的 hash 值：hash

十三、XPath 解析 XML 函数 
1. xpath
2. xpath_string
3. xpath_boolean
4. xpath_short, xpath_int, xpath_long
5. xpath_float, xpath_double, xpath_number

十四、汇总统计函数（UDAF）
1. 个数统计函数: count
2. 总和统计函数: sum
3. 平均值统计函数: avg
4. 最小值统计函数: min
5. 最大值统计函数: max
6. 非空集合总体变量函数: var_pop
7. 非空集合样本变量函数: var_samp
8. 总体标准偏离函数: stddev_pop
9. 样本标准偏离函数: stddev_samp
10．中位数函数: percentile
11. 中位数函数: percentile
12. 近似中位数函数: percentile_approx
13. 近似中位数函数: percentile_approx
14. 直方图: histogram_numeric
15. 集合去重数：collect_set
16. 集合不去重函数：collect_list

十五、表格生成函数 Table-Generating Functions (UDTF)
1．数组拆分成多行：explode(array)
2．Map 拆分成多行：explode(map)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168

3.2Hive自定义函数

当hive提供的内置函数无法满足业务处理需要时，此时就可以考虑使用用户自定义函数

UDF（user-defined function）作用于单个数据行，产生一个数据作为输出（数字函数，字符串函数）

UDAF（用户定义聚集函数 User- Defined Aggregation Funcation）：接收多个输入数据行，并产生一个输出数据行。（count，max）

UDTF（表格生成函数 User-Defined Table Functions）：接收一行输入，输出多行（explode）

UDF开发实例

1.编写一个类，继承UDF类，重载evaluate方法

Package com.ghgj.hive.udf
import java.util.HashMap;
import org.apache.hadoop.hive.ql.exec.UDF;

public class ToLowerCase extends UDF {
// 必须是 public，并且 evaluate 方法可以重载
	public String evaluate(String field) {
	String result = field.toLowerCase();
	return result;
	} 
}
1
2
3
4
5
6
7
8
9
10
11

2.打jar包上传到服务器

3.将jar包添加到hive的classpath

add jar /home/hadoop/hive_jar/u
查看jar的信息
list jar;
1
2
3

4.创建临时函数与class关联

create temporary function tolowercase as 'com.wy.udfs.ToLowerCase';
1

5.使用函数

select tolowercase(name),age from student;
1

注意：每一次重新进入hive shell ,临时函数都需要重新创建才能使用

4.Hive分隔符处理

hive读取数据的机制：

1、首先用 InputFormat<默认是：org.apache.hadoop.mapred.TextInputFormat >的一个具体实 
现类读入文件数据，返回一条一条的记录（可以是行，或者是你逻辑中的“行”） 

2、然后利用 SerDe<默认：org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe>的一个具体实现类，对上面返回的一条一条的记录进行字段切割 
1
2
3
4

Hive 对文件中字段的分隔符默认情况下只支持单字节分隔符，如果数据文件中的分隔符是多字符的，如下所示：

01||huangbo

02||xuzheng

03||wangbaoqiang

4.1使用RegexSerDe通过正则表达式来抽取字段

create table t_bi_reg
(id string,name string)

row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties('input.regex'=
'(.*)\\|\\|(.*)','output.format.string'='%1$s %2$s')

stored as textfile;
1
2
3
4
5
6
7
8

4.2通过自定义 InputFormat解决特殊分隔符问题

其原理是在 inputformat 读取行的时候将数据中的“多字节分隔符”替换为 hive 默认的分隔符（ctrl+A 亦即 \001）或用于替代的单字符分隔符，以便 hive 在 serde 操作时按照默认的单字节分隔符进行字段抽取

仅作了解

声明：本文内容由网友自发贡献，转载请注明出处：【wpsshop博客】