从前慢现在也慢

这个屌丝很懒，什么也没留下！

热门标签

大数据-玩转数据-Hive应用小结_基于hive数据仓库的教育大数据分析平台的存在和总结

作者：从前慢现在也慢 | 2024-07-08 23:07:53

踩

基于hive数据仓库的教育大数据分析平台的存在和总结

说明

本文只为说明功能，除十二节外，案例数据并不连贯。

一、 HIVE是什么

HIVE是基于Hadoop的一个数据仓库工具，它是一个可以将Sql翻译为MR程序的工具；HIVE支持用户将HDFS上的文件映射为表结构，然后用户就可以输入SQL对这些表（HDFS上的文件）进行查询分析,HIVE将用户定义的库、表结构等信息存储到HIVE的元数据库（可以是本地derby，也可以是远程mysql）中。其优点是学习成本低，可以通过类SQL语句快速实现简单的MapReduce统计，不必开发专门的MapReduce应用，十分适合数据仓库的统计分析。

二、登录交互模式

我们假设已经启动了HDFS,Yarn

[root@hadoop1 ~]# jps
1991 NameNode
2122 DataNode
2556 Jps
2414 NodeManager
1
2
3
4
5

1、单机交互式

已经配置了HIVE环境变量(参考HIVE安装篇)

[root@hadoop1 ~]# hive
1

在这里插入图片描述

2、hive服务交互

后台启动运行hive服务

[root@hadoop1 hive]# nohup bin/hiveserver2 1>/dev/dull 2>&1 &
[1] 3034
1
2

查看服务是否启动（需要一点时间）

[root@hadoop1 hive]# netstat -nltp|grep 10000
1

如果看到10000的端口的服务说明服务已经启动
在这里插入图片描述
任一台安装了hive的客户端连接启动的服务

[root@hadoop1 hive]# bin/beeline -u jdbc:hive2://hadoop1:10000 -n hadoop
1

如果报错：

21/03/17 18:04:41 [main]: WARN jdbc.HiveConnection: Failed to connect to hadoop102:10000
Error: Could not open client transport with JDBC Uri: jdbc:hive2://hadoop102:10000: Failed to open new session: java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): User: kuber is not allowed to impersonate kuber (state=08S01,code=0)
1
2

首先在./etc/hadoop/core-site.xml文件里面加上：

<property>
		<name>hadoop.proxyuser.kuber.hosts</name>
		<value>*</value>
</property>
<property>
		<name>hadoop.proxyuser.kuber.groups</name>
		<value>*</value>
</property>
1
2
3
4
5
6
7
8

然后重启hdfs和yarn，重启hiveserver,hiveserver2但是对我来说并没用。
随后，我在hive/conf/hive.site.xml里面加了下面一句：

<property>
    <name>hive.server2.enable.doAs</name>
    <value>false</value>
</property>
1
2
3
4

重新启动服务登录

3、将hive作为命令shell运行

bin/hive -e “sql1;sql2;sql3;sql4”
事先将sql语句写入一个文件比如 q.hql ，然后用hive命令执行：

[root@hadoop1 ~]# hive -e "select count(1) from default.t_big24"
1

当前目录下创建q.hql

[root@hadoop1 ~]# hive -f q.hql
1

4、shell脚本中调用

编写一个shell脚本文件etl.sh

#!/bin/bash
hive -e "create table t_count_text(sex string,number int)"
hive -e "isnert into t_count_text select sex,count(1) from default.t_big24 group by sex"
1
2
3

运行shell 脚本

[root@hadoop1 ~]# sh etl.sh
1

三、HIVE的DDL语法

1、建库

登录hive

jdbc:hive2://hadoop1:10000> create database db1; 
1

hive就会在默认仓库路径 /user/hive/warehouse/下建一个文件夹： db1.db

2、建内部表

2.1创建表

jdbc:hive2://hadoop1:10000> use db1;
1

jdbc:hive2://hadoop1:10000>create table t_2(id int,name string,salary bigint,add string)
row format delimited
fields terminated by ',';
1
2
3

建表后，hive会在仓库目录中建一个表目录： /user/hive/warehouse/db1.db/t_test1，数据格式的分隔符是’,’，如果不指定，默认分割划是^A，用ctrl + v可以输入 ^符号，ctrl + a 可以输入A，linux下用cat命令是看不到这个默认分割符的。

2.2 添加列

jdbc:hive2://hadoop1:10000> alter table t_seq add columns(address string,age int);
1

2.3 全部替换

jdbc:hive2://hadoop1:10000> alter table t_seq replace columns(id int,name string,address string,age int);
1

2.4 修改已存在的列定义

jdbc:hive2://hadoop1:10000> alter table t_seq change userid uid string;
1

3、建外部表

jdbc:hive2://hadoop1:10000> create external table t_3(id int,name string,salary bigint,add string)
row format delimited
fields terminated by ','
location '/aa/bb';
1
2
3
4

4、内部表和外部表区别

内部表的目录由hive创建在默认的仓库目录下：/user/hive/warehouse/…
外部表的目录由用户建表时自己指定： location ‘/位置/’
drop一个内部表时，表的元信息和表数据目录都会被删除；
drop一个外部表时，只删除表的元信息，表的数据目录不会删除；
意义：通常，一个数据仓库系统，数据总有一个源头，而源头一般是别的应用系统产生的，其目录无定法，为了方便映射，就可以在hive中用外部表进行映射；并且，就算在hive中把这个表给drop掉，也不会删除源数据目录，也就不会影响到别的应用系统；

5、分区表

分区关键字 PARTITIONED BY

jdbc:hive2://hadoop1:10000> create table t_4(ip string,url string,staylong int)
partitioned by (day string) 
row format delimited
fields terminated by ',';
1
2
3
4

分区标识不能存在于表字段中。

6、修改表的分区

6.1 添加分区

jdbc:hive2://hadoop1:10000> alter table t_4 add partition(day='2017-04-10') partition(day='2017-04-11');
1

添加完成后，可以检查t_4的分区情况：

jdbc:hive2://hadoop1:10000> show partitions t_4;
1

然后，可以向新增的分区中导入数据：

jdbc:hive2://hadoop1:10000> load data local inpath '/root/weblog.3' into table t_4 partition(day='2017-04-10');
jdbc:hive2://hdp-nn-01:10000> select * from t_4 where day='2017-04-10';
1
2

–还可以使用insert

insert into table t_4 partition(day='2017-04-11')
select ip,url,staylong from t_4 where day='2017-04-08' and staylong>30;
1
2

6.2 删除分区

jdbc:hive2://hadoop1:10000> alter table t_4 drop partition(day='2017-04-11');
1

四、数据的导入

先把数据存放到HDFS上指定目录

hdfs dfs -mkdir /hive_operate
hdfs dfs -mkdir /hive_operate/movie_table
hdfs dfs -mkdir /hive_operate/rating_table

hdfs dfs -put movies.csv /hive_operate/movie_table
hdfs dfs -put ratings.csv /hive_operate/rating_table
1
2
3
4
5
6

1、本地导入数据hive表

将hive运行所在机器的本地磁盘上的文件导入表中

hive>load data local inpath '/root/weblog.1' into[overwrite] table t_1;
1

2 、将hdfs中的文件导入hive表

jdbc:hive2://hadoop1:10000> load data inpath '/user.data.2' into table t_1;
1

不加local关键字，则是从hdfs的路径中移动文件到表目录中；

3、从别的表查询数据后插入到一张新建表中

jdbc:hive2://hadoop1:10000> create table t_1_jz
as
select id,name from t_1;
1
2
3

4、从别的表查询数据后插入到一张已存在的表中

加入已存在一张表：可以先建好：

jdbc:hive2://hadoop1:10000> create table t_1_hd like t_1;
1

然后从t_1中查询一些数据出来插入到t_1_hd中：

jdbc:hive2://hadoop1:10000> insert into table t_1_hd
select 
id,name,add 
from t_1
where add='handong';
1
2
3
4
5

5、导入数据到不同的分区目录

jdbc:hive2://hadoop1:10000> load data local inpath '/root/weblog.1' into table t_4 partition(day='2017-04-08');
jdbc:hive2://hadoop1:10000> load data local inpath '/root/weblog.2' into table t_4 partition(day='2017-04-09');
1
2

五、数据的导出

1、将数据从hive的表中导出到hdfs的目录中

jdbc:hive2://hadoop1:10000> insert overwrite directory '/aa/bb'
select * from t_1 where add='jingzhou';
1
2

2 、将数据从hive的表中导出到本地磁盘目录中

jdbc:hive2://hadoop1:10000> insert overwrite local directory '/aa/bb'
select * from t_1 where add='jingzhou';
1
2

六、显示命令

show tables
show databases
show partitions
例子： show partitions t_4;
show functions – 显示hive中所有的内置函数
desc t_name; – 显示表定义
desc extended t_name; – 显示表定义的详细信息
desc formatted table_name;
– 显示表定义的详细信息，并且用比较规范的格式显示
show create table table_name – 显示建表语句

七、hive 中 DML 语句

同sql语句

八、HIVE的内置函数

1、时间处理函数

from_unixtime(21938792183,'yyyy-MM-dd HH:mm:ss')  -->   '2017-06-03 17:50:30'
1

select current_date from dual;
select current_timestamp from dual;

select unix_timestamp() from dual;
--1491615665

select unix_timestamp('2011-12-07 13:01:03') from dual;
--1323234063

select unix_timestamp('20111207 13:01:03','yyyyMMdd HH:mm:ss') from dual;
--1323234063

select from_unixtime(1323234063,'yyyy-MM-dd HH:mm:ss') from dual;

--获取日期、时间
select year('2011-12-08 10:03:01') from dual;
--2011
select year('2012-12-08') from dual;
--2012
select month('2011-12-08 10:03:01') from dual;
--12
select month('2011-08-08') from dual;
--8
select day('2011-12-08 10:03:01') from dual;
--8
select day('2011-12-24') from dual;
--24
select hour('2011-12-08 10:03:01') from dual;
--10
select minute('2011-12-08 10:03:01') from dual;
--3
select second('2011-12-08 10:03:01') from dual;
--1

--日期增减
select date_add('2012-12-08',10) from dual;
--2012-12-18

date_sub (string startdate, int days) : string
--例：
select date_sub('2012-12-08',10) from dual;
--2012-11-28
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42

2、类型转换函数

from_unixtime(cast('21938792183' as bigint),'yyyy-MM-dd HH:mm:ss')
1

3、字符串截取和拼接

substr("abcd",1,3)  -->   'abc'
concat('abc','def')  -->  'abcdef'
1
2

4、Json数据解析函数

get_json_object('{\"key1\":3333，\"key2\":4444}' , '$.key1')  -->  3333

json_tuple('{\"key1\":3333，\"key2\":4444}','key1','key2') as(key1,key2)  --> 3333, 4444
1
2
3

5、url解析函数

parse_url_tuple('http://www.edu360.cn/bigdata/baoming?userid=8888','HOST','PATH','QUERY','QUERY:userid')
--->     www.edu360.cn      /bigdata/baoming     userid=8888   8888
1
2

6、函数：explode 和 lateral view

可以将一个数组变成列
加入有一个表，其中的字段为array类型
表数据：
1,zhangsan,数学:语文:英语:生物
2,lisi,数学:语文
3,wangwu,化学:计算机:java编程

建表：

create table t_xuanxiu(uid string,name string,kc array<string>)
row format delimited
fields terminated by ','
collection items terminated by ':';
1
2
3
4

** explode效果示例：

select explode(kc) from t_xuanxiu where uid=1;
1

数学
语文
英语
生物
1
2
3
4

** lateral view 表生成函数

hive> select uid,name,tmp.* from t_xuanxiu 
    > lateral view explode(kc) tmp as course;
1
2

1       zhangsan        数学
1       zhangsan        语文
1       zhangsan        英语
1       zhangsan        生物
2       lisi    数学
2       lisi    语文
3       wangwu  化学
3       wangwu  计算机
3       wangwu  java编程
1
2
3
4
5
6
7
8
9

利用explode和lateral view 实现hive版的wordcount 有以下数据：
a b c d e f g
a b c
e f g a
b c d b

对数据建表：

create table t_juzi(line string) row format delimited;
1

导入数据：

load data local inpath '/root/words.txt' into table t_juzi;
1

select a.word,count(1) cnt
from 
(select tmp.* from t_juzi lateral view explode(split(line,' ')) tmp as word) a
group by a.word
order by cnt desc;
1
2
3
4
5

7、row_number() over() 函数

常用于求分组TOPN

有如下数据：
zhangsan,kc1,90
zhangsan,kc2,95
zhangsan,kc3,68
lisi,kc1,88
lisi,kc2,95
lisi,kc3,98

建表：

create table t_rowtest(name string,kcId string,score int)
row format delimited
fields terminated by ',';
1
2
3

导入数据：

利用row_number() over() 函数看下效果：

select *,row_number() over(partition by name order by score desc) as rank from t_rowtest;
1

从而，求分组topn就变得很简单了：

select name,kcid,score
from
(select *,row_number() over(partition by name order by score desc) as rank from t_rowtest) tmp
where rank<3;
1
2
3
4

create table t_rate_topn_uid
as
select uid,movie,rate,ts
from
(select *,row_number() over(partition by uid order by rate desc) as rank from t_rate) tmp
where rank<11;
1
2
3
4
5
6

九、自定义函数

略

十、hive中的复合数据类型

1、array

有如下数据：

战狼2,吴京:吴刚:龙母,2017-08-16
三生三世十里桃花,刘亦菲:痒痒,2017-08-20
普罗米修斯,苍老师:小泽老师:波多老师,2017-09-17
美女与野兽,吴刚:加藤鹰,2017-09-17
1
2
3
4

– 建表映射：

create table t_movie(movie_name string,actors array<string>,first_show date)
row format delimited fields terminated by ','
collection items terminated by ':';
1
2
3

– 导入数据

load data local inpath '/root/hivetest/actor.dat' into table t_movie;
load data local inpath '/root/hivetest/actor.dat.2' into table t_movie;
1
2

– 查询

select movie_name,actors[0],first_show from t_movie;
1

select movie_name,actors,first_show
from t_movie where array_contains(actors,'吴刚');
1
2

select movie_name
,size(actors) as actor_number
,first_show
from t_movie;
1
2
3
4

2、map

有如下数据：

1,zhangsan,father:xiaoming#mother:xiaohuang#brother:xiaoxu,28
2,lisi,father:mayun#mother:huangyi#brother:guanyu,22
3,wangwu,father:wangjianlin#mother:ruhua#sister:jingtian,29
4,mayun,father:mayongzhen#mother:angelababy,26
1
2
3
4

– 建表映射上述数据

create table t_family(id int,name string,family_members map<string,string>,age int)
row format delimited fields terminated by ','
collection items terminated by '#'
map keys terminated by ':';
1
2
3
4

– 导入数据

load data local inpath '/root/hivetest/fm.dat' into table t_family;
1

– 查出每个人的爸爸、姐妹

select id,name,family_members["father"] as father,family_members["sister"] as sister,age
from t_family;
1
2

– 查出每个人有哪些亲属关系

select id,name,map_keys(family_members) as relations,age
from  t_family;
1
2

– 查出每个人的亲人名字

select id,name,map_values(family_members) as relations,age
from  t_family;
1
2

– 查出每个人的亲人数量

select id,name,size(family_members) as relations,age
from  t_family;
1
2

– 查出所有拥有兄弟的人及他的兄弟是谁
– 方案1：一句话写完

select id,name,age,family_members['brother']
from t_family  where array_contains(map_keys(family_members),'brother');
1
2

– 方案2：子查询

select id,name,age,family_members['brother']
from
(select id,name,age,map_keys(family_members) as relations,family_members 
from t_family) tmp 
where array_contains(relations,'brother');
1
2
3
4
5

3、struct

假如有以下数据：

1,zhangsan,18:male:深圳
2,lisi,28:female:北京
3,wangwu,38:male:广州
4,赵六,26:female:上海
5,钱琪,35:male:杭州
6,王八,48:female:南京
1
2
3
4
5
6

– 建表映射上述数据

drop table if exists t_user;
create table t_user(id int,name string,info struct<age:int,sex:string,addr:string>)
row format delimited fields terminated by ','
collection items terminated by ':';
1
2
3
4

– 导入数据

load data local inpath '/root/hivetest/user.dat' into table t_user;
1

– 查询每个人的id name和地址

select id,name,info.addr
from t_user;
1
2

十、WEB管理

http://192.168.80.2:50070/explorer.html#/user/hive/warehouse
在这里插入图片描述

十一 HIVE的存储文件格式

HIVE支持很多种文件格式： SEQUENCE FILE | TEXT FILE | PARQUET FILE | RC FILE

试验：先创建一张表t_seq，指定文件格式为sequencefile

create table t_seq(id int,name string,add string)
stored as sequencefile;
1
2

然后，往表t_seq中插入数据，hive就会生成sequence文件插入表目录中

insert into table t_seq
select * from t_1 where add='handong';
1
2

十二、一个日活、日增的分析案例

1 、需求分析

我们有一个 web 系统。
每天都产生数据。
求，日新：每天新来的用户。
求，日活：每天的活跃用户。

2 ，数据

log2017-09-15

192.168.33.6,hunter,2017-09-15 10:30:20,/a
192.168.33.7,hunter,2017-09-15 10:30:26,/b
192.168.33.6,jack,2017-09-15 10:30:27,/a
192.168.33.8,tom,2017-09-15 10:30:28,/b
192.168.33.9,rose,2017-09-15 10:30:30,/b
192.168.33.10,julia,2017-09-15 10:30:40,/c
1
2
3
4
5
6

log2017-09-16

192.168.33.16,hunter,2017-09-16 10:30:20,/a
192.168.33.18,jerry,2017-09-16 10:30:30,/b
192.168.33.26,jack,2017-09-16 10:30:40,/a
192.168.33.18,polo,2017-09-16 10:30:50,/b
192.168.33.39,nissan,2017-09-16 10:30:53,/b
192.168.33.39,nissan,2017-09-16 10:30:55,/a
192.168.33.39,nissan,2017-09-16 10:30:58,/c
192.168.33.20,ford,2017-09-16 10:30:54,/c
1
2
3
4
5
6
7
8

log2017-09-17

192.168.33.46,hunter,2017-09-17 10:30:21,/a
192.168.43.18,jerry,2017-09-17 10:30:22,/b
192.168.43.26,tom,2017-09-17 10:30:23,/a
192.168.53.18,bmw,2017-09-17 10:30:24,/b
192.168.63.39,benz,2017-09-17 10:30:25,/b
192.168.33.25,baval,2017-09-17 10:30:30,/c
192.168.33.10,julia,2017-09-17 10:30:40,/c
1
2
3
4
5
6
7

3 ，建表，分区表

create table web_log(ip string,uid string,access_time string,url string) 
partitioned by (dt string)
row format delimited fields terminated by ',';
1
2
3

4 ，导入数据

load data local inpath '/root/hivetest/log2017-09-15' into table web_log partition(dt='2017-09-15');
load data local inpath '/root/hivetest/log2017-09-16' into table web_log partition(dt='2017-09-16');
load data local inpath '/root/hivetest/log2017-09-17' into table web_log partition(dt='2017-09-17');
1
2
3

5 ，查看数据，查看分区

select * from web_log;
show partitions web_log;
1
2

6 、日活数据：建表

ip ：用户的 ip 地址，如果他用过很多 ip 来访问我们，我们就取出他的最早访问的那一条
uid ：用户 id
first_access ：如果用户来过很多次，我们记录第一次
url ：他访问了我们的哪个页面

sql ：

create table t_user_access_day(ip string,uid string,first_access string,url string) partitioned by(dt string);
1

7 、日活数据，查询：每个用户访问最早的一条 sql 的进化

select ip,uid,access_time,url from web_log;
1

select ip,uid,access_time,url from web_log where dt='2017-09-15';
1

select ip,uid,access_time,url, 
row_number() over(partition by uid order by access_time) as rn
from web_log 
where dt='2017-09-15';
1
2
3
4

select ip,uid,access_time,url 
from
(select ip,uid,access_time,url, 
row_number() over(partition by uid order by access_time) as rn
from web_log 
where dt='2017-09-15') tmp
where rn=1;
1
2
3
4
5
6
7

结果：

+----------------+---------+----------------------+------+--+
|       ip       |   uid   |     access_time      | url  |
+----------------+---------+----------------------+------+--+
| 192.168.33.6   | hunter  | 2017-09-15 10:30:20  | /a   |
| 192.168.33.6   | jack    | 2017-09-15 10:30:27  | /a   |
| 192.168.33.10  | julia   | 2017-09-15 10:30:40  | /c   |
| 192.168.33.9   | rose    | 2017-09-15 10:30:30  | /b   |
| 192.168.33.8   | tom     | 2017-09-15 10:30:28  | /b   |
+----------------+---------+----------------------+------+--+
1
2
3
4
5
6
7
8
9

8 ，将查询到的数据，存储到日活表

insert into table t_user_access_day partition(dt='2017-09-15')
select ip,uid,access_time,url 
from
(select ip,uid,access_time,url, 
row_number() over(partition by uid order by access_time) as rn
from web_log 
where dt='2017-09-15') tmp
where rn=1;
1
2
3
4
5
6
7
8

9 ，活跃用户总结

15 号活跃用户：

insert into table t_user_access_day partition(dt='2017-09-15')
select ip,uid,access_time,url 
from
(select ip,uid,access_time,url, 
row_number() over(partition by uid order by access_time) as rn
from web_log 
where dt='2017-09-15') tmp
where rn=1;
1
2
3
4
5
6
7
8

16 号活跃用户：

insert into table t_user_access_day partition(dt='2017-09-16')
select ip,uid,access_time,url 
from
(select ip,uid,access_time,url, 
row_number() over(partition by uid order by access_time) as rn
from web_log 
where dt='2017-09-16') tmp
where rn=1;
1
2
3
4
5
6
7
8

17 号活跃用户：

insert into table t_user_access_day partition(dt='2017-09-17')
select ip,uid,access_time,url 
from
(select ip,uid,access_time,url, 
row_number() over(partition by uid order by access_time) as rn
from web_log 
where dt='2017-09-17') tmp
where rn=1;
1
2
3
4
5
6
7
8

10 ，日新：思路

建历史表。
用今天的日活用户关联历史表。
日活有数据，历史表没有数据，就是当天的新用户，将数据插入到当天新增用户表。
查询过后，将当天的新用户，加入到历史表中。

11 ，日新：找到历史表中没有，日活有的用户 ( 日新 )

建表：历史用户表

 create table t_user_history(uid string);
1

建表：新用户表

create table t_user_new_day like t_user_access_day;
1

看一下，3 张表：
t_user_access_day 日活用户表
t_user_history 历史用户表
t_user_new_day 日新用户表
找出新用户：历史表没有，日活表有的数据

select a.*
from t_user_access_day a left join t_user_history b on a.uid=b.uid 
where a.dt='2017-09-15' and b.uid is null;
1
2
3

将这些数据，存储进日新表：

insert into table t_user_new_day partition(dt='2017-09-15') 
select a.ip,a.uid,a.first_access,a.url 
from t_user_access_day a left join t_user_history b on a.uid=b.uid 
where a.dt='2017-09-15' and b.uid is null;
1
2
3
4

将这些新用户插入历史表：

insert into t_user_history
select uid from t_user_new_day where dt='2017-09-15';
1
2

12 ，编写脚本

 vim rixin.sh
1

#!/bin/bash
day_str=`date -d '-1 day' +'%Y-%m-%d'`
echo "准备处理 $day_str 的数据......"

HQL_user_active_day="
insert into table sfl.t_user_active_day partition(day=\"$day_str\")
select ip,uid,access_time,url
from 
(select ip,uid,access_time,url,
row_number() over(partition by uid order by access_time) as rn
where day=\"$day_str\") tmp
where rn=1
"
echo "executing sql :"
echo $HQL_user_active_day
hive -e "$HQL_user_active_day"
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

十三、hive 系统参数

本地模式

set hive.exec.mode.local.auto=true
1

动态分区

set hive.exec.dynamic.partition.mode=nonstrict;
1

#指定开启分桶

set hive.enforce.bucketing = true;
set mapreduce.job.reduces=4;
1
2

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/从前慢现在也慢/article/detail/800644

大数据-玩转数据-Hive应用小结_基于hive数据仓库的教育大数据分析平台的存在和总结

说明

一、 HIVE是什么

二、登录交互模式

1、单机交互式

2、hive服务交互

3、将hive作为命令shell运行

4、shell脚本中调用

三、HIVE的DDL语法

1、建库

2、建内部表

2.1创建表

2.2 添加列

2.3 全部替换

2.4 修改已存在的列定义

3、建外部表

4、内部表和外部表区别

5、分区表

6、修改表的分区

6.1 添加分区

6.2 删除分区

四、数据的导入

1、本地导入数据hive表

2 、将hdfs中的文件导入hive表

3、 从别的表查询数据后插入到一张新建表中

4、 从别的表查询数据后插入到一张已存在的表中

5、导入数据到不同的分区目录

五、数据的导出

1、将数据从hive的表中导出到hdfs的目录中

2 、将数据从hive的表中导出到本地磁盘目录中

六 、显示命令

七、hive 中 DML 语句

八、HIVE的内置函数

1、时间处理函数

2、类型转换函数

3、字符串截取和拼接

4、Json数据解析函数

5、url解析函数

6、函数：explode 和 lateral view

7、row_number() over() 函数

九、 自定义函数

十、hive中的复合数据类型

1、array

2、map

3、struct

十、WEB管理

十一 HIVE的存储文件格式

十二、一个日活、日增的分析案例

1 、需求分析

2 ，数据

3 ，建表 ，分区表

4 ，导入数据

5 ，查看数据，查看分区

6 、日活数据 ： 建表

7 、日活数据，查询 ： 每个用户访问最早的一条 sql 的进化

8 ，将查询到的数据，存储到日活表

9 ，活跃用户总结

10 ，日新 ：思路

11 ，日新 ： 找到历史表中没有，日活有的用户 ( 日新 )

12 ，编写脚本

十三、hive 系统参数

3、从别的表查询数据后插入到一张新建表中

4、从别的表查询数据后插入到一张已存在的表中

六、显示命令

九、自定义函数

3 ，建表，分区表

6 、日活数据：建表

7 、日活数据，查询：每个用户访问最早的一条 sql 的进化

10 ，日新：思路

11 ，日新：找到历史表中没有，日活有的用户 ( 日新 )