菜鸟追梦旅行

这个屌丝很懒，什么也没留下！

热门标签

HIVE数据类型与函数__(1)_ data (2) inpath '1.txt' _(3)_ in

作者：菜鸟追梦旅行 | 2024-05-01 14:46:23

踩

___(1)___ data __(2)__ inpath '1.txt' ___(3)___ into __(4)___ db1.tbl1;

1.HIVE的数据类型

1.1基本数据类型

注意：Hive中没有long类型，是bigint。

1.2复杂数据类型

1.3 hive默认分隔符

2.HIVE对库的操作

2.1创建库

（1）简单方式：

create database t1;

show databases;

（2）创建库的时候带注释。

create database if not exists t2 comment 'learning hive';

（3）创建带属性的库。

create database if not exists t3 with dbproperties('creator'='hadoop','date'='2019-01-01');

2.2查看库

（1）最常用查看库方式。

show databases;

（2）显示数据库的详细属性信息。

desc database t3;

desc database extended t3;

（3）查看正在使用哪个库。

select current_database();

2.3删除库

默认情况下，hive 不允许删除包含表的数据库。需要使用cascade 关键字。

drop database if exists t3 cascade;

2.4切换库

切换到t2数据库。

use t2;

3.HIVE对表的操作

3.1内部表与外部表

（1）内部表

表目录hive会自动创建在默认的HDFS目录下

create table worker_1(id int,name string,salary bigint,addr string)

row format delimited

fields terminated by ‘,’;

查找命令：/user/hive/warehouse/t2.db/test_1

（2）外部表

创建的时候，需要使用external关键字，并指定表对应hdfs上的目录/aa/bb。

create external table test_2(id int,name string,salary bigint,addr string)

row format delimited

fields terminated by ‘,’

location ‘/test’;

drop一个内部表时，表的元信息和表数据目录都会被删除。

drop一个外部表时，只删除表的元信息，表的数据目录不会被删除。

外部表的意义

通常，一个数据仓库系统，数据总有一个源头，而源由一般是别的应用程序产生的，其目录无法确定，为了方便映射，就可以在hive中用外部表映射。并且，就算hive中把这个表删了，也不会删除数据目录，就不会影响到别的应用系统。

3.2数据的导入

（1）导入

将hive服务器运行所在节点的本地磁盘上的文件导入表中。

注意：

这里load的文件是在开启server的节点上。不是在客户端节点上。

我们编辑的数据如果有中文，必须得是UTF-8编码格式，否则数据会出现乱码现象。

load data local inpath '/opt/testdata/hive/test_1.txt’' into table test_1;

加overwrite可以实现覆盖，不加overwrite是追加到表后面。

load data local inpath ‘/opt/testdata/hive/test_1.txt’ overwrite into table test_1;

将hdfs上的文件导入表中。

上传文件到HDFS。

（2）从别的表查询数据后插入到一张新建的表中。表会自动生成。

create table test_3

as

select id,name,salary

from worker_2

where salary>=15000;

将数据从hive的表中导出到本地磁盘的目录中

insert overwrite local directory ‘‘/opt/testdata/hive/test_1.log’

select * from test_1;

3.3 hive的复杂数据类型

array、map、struct。

现有数据：

1huangboguangzhou,xianggang,shenzhena1:30,a2:20,a3:100beijing,112233,13522334455,500

2 xuzheng xianggang b2:50,b3:40 tianjin,223344,13644556677,600

3 wangbaoqiang beijing,zhejiang c1:200 chongqinjg,334455,15622334455,20

建表语句：

create table movie_info(

id int,

name string,

work_location array<string>,

piaofang map<string,bigint>,

address struct<location:string,zipcode:int,phone:string,value:int>)

row format delimited

fields terminated by " "

collection items terminated by ","

map keys terminated by ":" ;

导入数据：

load data local inpath "/opt/testdata/hive/movie_info.txt" into table movie_info;

查询语句：

array：select work_location[0] from movie_info;

map：select piaofang["a1"] from movie_info;

struct：select address.location from movie_info;

3.4 hive的文件存储格式

Hive支持多种文件格式：sequence file、text file、parquet file、rc file、orc file。

textfile为默认格式，存储方式为行存储。数据不做压缩，磁盘开销大，数据解析开销大。

SequenceFile是Hadoop API提供的一种二进制文件支持，其具有使用方便、可分割、可压缩的特点。 SequenceFile支持三种压缩选择：NONE, RECORD, BLOCK。 Record压缩率低，一般建议使用BLOCK压缩。

RC file一种行列存储相结合的存储方式。

ORCFile数据按照行分块，每个块按照列存储，其中每个块都存储有一个索引。每一块的默认大小为256MB。ORC是hive给出的新格式，属于RCFILE的升级版,性能有大幅度提升,而且数据可以压缩存储,压缩快快速列存取。

Parquet也是一种列式存储，同时具有很好的压缩性能；同时可以减少大量的表扫描和反序列化的时间。Snappy压缩方式。

创建seq表，对应的文件类型是sequencefile。

create table worker_seq(id int,name string)

stored as sequencefile;

将从别的表查询的数据放入到seq中

insert into worker_seq

select id,name from test_1;

将查询出来的数据直接使用sequencefile保存。

create table test_seq

stored as sequencefile

as

select * from test_1;

将查询出来的数据直接使用orc保存。

create table test_orc

stored as orc

as

select * from test_1;

将查询出来的数据直接使用parquet保存。

create table test_par

stored as parquet

as

select * from test_1;

3.5查看信息

新建表。

create table student(id int,name string,age int)

row format delimited

fields terminated by “,”;

查看表信息。

desc student;

查看表的详细信息。

desc extended student;

desc formatted student;

查看表的详细建表语句。

show create table student;

3.6修改表

修改表名。

alter table student rename to new_student;

修改字段。

增加一个字段：alter table new_student add columns (score int);

修改一个字段的定义：alter table new_student change name new_name string;

不支持删除字段。

3.7删除和清空表

删除表。

drop table new_student;

清空表。

truncate table student;

3.8 hive的分区表

（1）分区表的创建

建表

create table worker_4(id int,name string,salary bigint,addr string)

partitioned by (day string)

row format delimited

fields terminated by ‘,’;

注意：分区的字段一定不能在定义的字段里。

（2）导入数据

load data local inpath '/usr/datadir/worker_1.txt' into table worker_4 partition(day='01');

load data local inpath '/usr/datadir/worker_1.txt' into table worker_4 partition(day='02');

它会将day这个分区条件也当成了一个字段。

如果只查询test_4下day=01目录下的信息，可以用：

select * from test_4 where day=’01’;

（3）增删分区

查看分区信息。

show partitions test_4;

增加分区：

alter table test_4 add partition(day='03') partition(day='04');

通过加载数据实现添加分区：

load data local inpath '/opt/testdata/hive/test_3.txt' into table test _4 partition(day='05');

还可以使用insert实现分区：

insert into table test_4 partition(day='06')

select * from test_2 where salary>=5000;

删除分区：

alter table test_4 drop partition(day='06');

（4）动态分区

新建表。

create table student(id int,name string,sex string,age int,department string)

row format delimited fields terminated by ",";

load data local inpath '/opt/testdata/hive/student.txt' into table student;

把这一张表的内容直接插入到另一张表student_ptn_age中，并实现age为动态分区（不指定到底是哪种年龄，让系统自己分配决定）。

创建分区表。

create table student_ptn_age(id int,name string,sex string,department string)

partitioned by (age int);

插入数据，实现动态分区。

动态分区需要设置set hive.exec.dynamic.partition.mode=nonstrict;不然会报错。

insert overwrite table student_ptn_age partition(age)

select id,name,sex,department,age from student;

查询的分区字段要写在最后。

3.9 hive的分桶表

（1）分桶的概念

分桶是相对分区进行更细粒度的划分（数据取样更高效）。分桶将整个数据内容按照某列属性值的hash值进行区分，如要安装name属性分为3个桶，就是对name属性值的hash 值对3取摸，按照取模结果对数据分桶。如取模结果为0的数据记录存放到一个文件，取模为1的数据存放到一个文件，取模为2的数据存放到一个文件。

（2）分桶的操作

创建分桶表

create table student_bck(id int, name string)

clustered by (id) into 3 buckets #3个桶

row format delimited fields terminated by ",";

向桶中插入数据

insert overwrite table student_bck

select id,name from student;

查看存储信息

查看分桶数据

select * from student_bck tablesample(bucket 1 out of 3 on id);

tablesample (bucket x out of y on id);

x表示从哪个桶(x-1)开始，y代表分几个桶，也可以理解分x为分子，y为分母，及将表分为y份（桶），取第x份（桶）。

3.10视图

create view v_name(字段)

as

select * from t_student;

视图不能load数据，也不能insert。只能用来进行查询。

视图是一个逻辑的概念，并不是物理上存在的。

drop view v_name;

name，age，idcard，cardnum

create view v_name(字段)

as

select name,age from table;

3.11 hive的表关联操作

数据准备：

[root@hadoop01 datadir]# cat order.txt

112,皮鞋

114,耳机

116,可乐

121,鼠标

110,钢笔

[root@hadoop01 datadir]# cat goods.txt

114,130

116,5

112,500

110,50

119,800

创建表，导入数据：

create table t_order(orderid int,name string)

row format delimited

fields terminated by ",";

load data local inpath '/usr/datadir/order.txt' into table t_order;

create table t_goods(goodid int,price int)

row format delimited

fields terminated by ",";

load data local inpath '/opt/tastdata/hive/goods.txt' into table t_goods;

Hive中的join分为了内连接、左外连接、右外连接、全外连接。

内连接：inner join

select * from t_order inner join t_goodson orderid = goodid;

只会把相同关联条件匹配上的数据保留下来。

3.12 union和union all

union关联的时候会对数据进行去重，union all不会。

select * from test_1union

select * from test_2;

select * from test_1union all

select * from test_2;

4.HIVE的函数

（1）hive的内置函数

类型转换

cast(expr as <type>)

例如：cast(“1” as bigint)

select cast(money as bigint);

切割。

split(string str, string pat)

select split('nihao|hello|nice','\\|')

正则表达式截取字符串。

regexp_extract(string subject, string pattern, int index)

select regexp_extract('hello<B>nice</B>haha','<B>(.*)</B>',1)

select regexp_extract(字段名,正则表达式,索引)

将字符串前后出现的空格去掉。

trim(string A)

求指定列的聚合函数。

sum(col)

avg(col)

min(col)

max(col)

select subject,sum(score)from table_namegroup by subject

select中的字段，必须要在group by后面出现出行，或者用到聚合函数中。

拼接字符串。

concat(string A, string B...)

字符串的截取。

select substr('abcde',3,2)

炸裂函数。

select explode(split("nice|good|well","\\|"));

nice|good|well

NiceGoodWell

case when

打分、评级的时候。

数据准备：

tom,95

hua,90

hong,100

lele,85

kaka,70

kebi,60

ming,55

kang,78

lolo,93

create table student_level(name string,score int)

row format delimited fields terminated by ",";

load data local inpath '/usr/datadir/student_score.txt' into table student_level;

查看：select * from student_level;

select name,score,

case when score >= 90 then 'very good'

when score >= 80 and score <90 then 'double good'

when score >= 70 and score <80 then 'good'

when score >= 60 and score <70 then 'go on'

else 'zhencai'end levelfrom student_level;

炸裂函数实际应用

tom a,b,c

jim b,c,d

tony a,c,d

create table test1(name string,subject string)

row format delimited

fields terminated by " ";

load data local inpath '/usr/datadir/student_info.txt' into table test1;

select name,subfrom test1LATERAL VIEW explode(split(subject,','))temp as sub;

（2）日期处理函数

1.date_format函数（根据格式整理日期）

yyyy-MM-dd HH:mm:ss

hive (gmall)> select date_format('2020-03-05','yyyy-MM');

2020-03

2.date_add函数（加减日期）

hive (gmall)> select date_add('2020-03-05',-1);

2020-03-04

hive (gmall)> select date_add('2020-03-05',1);

2020-03-06

hive (gmall)> select date_sub('2020-03-05',1);

2020-03-04

3.next_day函数

（1）取当前天的下一个周一

hive (gmall)> select next_day('2020-03-05','MO');

2020-03-09

说明：星期一到星期日的英文（Monday，Tuesday、Wednesday、Thursday、Friday、Saturday、Sunday）

（2）取当前周的周一

hive (gmall)> select date_add(next_day('2020-03-05','MO'),-7);

2020-03-02

last_day函数（求当月最后一天日期）

hive (gmall)> select last_day('2020-03-05');

2020-03-31

1.处理json数据

现有json数据。

{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}

{"movie":"661","rate":"3","timeStamp":"978302109","uid":"1"}

{"movie":"914","rate":"3","timeStamp":"978301968","uid":"1"}

{"movie":"3408","rate":"4","timeStamp":"978300275","uid":"1"}

{"movie":"2355","rate":"5","timeStamp":"978824291","uid":"1"}

{"movie":"1197","rate":"3","timeStamp":"978302268","uid":"1"}

{"movie":"1287","rate":"5","timeStamp":"978302039","uid":"1"}

{"movie":"2804","rate":"5","timeStamp":"978300719","uid":"1"}

{"movie":"594","rate":"4","timeStamp":"978302268","uid":"1"}

大数据的数据来源：爬虫采集（python+scrapy）、前后端的埋点数据、业务数据。

（3）创建表，并load数据。

create table json(data string);

load data local inpath '/usr/datdadir/json.txt' into table json;

2.查询json数据。

select get_json_object(data,'$.movie') as movie,get_json_object(data,'$.rate') as ratefrom json;

（4）窗口函数

准备数据。

cookie1,2015-04-10,1

cookie1,2015-04-13,3

cookie1,2015-04-11,5

cookie1,2015-04-12,2

cookie1,2015-04-15,4

cookie1,2015-04-16,4

cookie1,2015-04-14,4

cookie2,2015-04-14,3

cookie2,2015-04-15,9

cookie2,2015-04-16,7

cookie2,2015-04-10,2

cookie2,2015-04-11,3

cookie2,2015-04-12,5

cookie2,2015-04-13,6

创建表。

create table cookie1(cookieid string, createtime string, pv int) row format delimited fields terminated by ',';

加载数据。

load data local inpath "/usr/datadir/cookie1.txt" into table cookie1;

sum(pv) over()

我们通过cookieid分组，createtime排序，求pv的和。求之前行到当前行的pv和。不加范围限定，默认也是这种。

sum(pv) over(partition by cookieid order by createtime rows between unbounded preceding and current row) as pv1,

sum(pv) over (partition by cookieid order by createtime) as pv2,

如果只进行了分组，没有排序，会将分组内的所有数据进行求和。

sum(pv) over (partition by cookieid) as pv3,

求当前行与前3行的pv和。

sum(pv) over (partition by cookieid order by createtime rows between 3 preceding and current row) as pv4,

当前行的前3行到后2行。

sum(pv) over(partition by cookid order by createtime rows between 3 preceding and 2 following) as pv5,

当前行到最后行。

sum(pv) over (partition by cookieid order by createtime rows between current row and unbounded following) as pv6,

代码：

select cookieid,createtime,

   pv,

   sum(pv) over (partition by cookieid order by createtime rows between unbounded preceding and current row) as pv1,

   sum(pv) over (partition by cookieid order by createtime) as pv2,

   sum(pv) over (partition by cookieid) as pv3,

   sum(pv) over (partition by cookieid order by createtime rows between 3 preceding and current row) as pv4,

   sum(pv) over (partition by cookieid order by createtime rows between 3 preceding and 2 following) as pv5,

   sum(pv) over (partition by cookieid order by createtime rows between current row and unbounded following) as pv6

from cookie1;

（5）窗口分片

数据准备：

cookie1,2015-04-10,1

cookie1,2015-04-11,5

cookie1,2015-04-12,7

cookie1,2015-04-13,3

cookie1,2015-04-14,2

cookie1,2015-04-15,4

cookie1,2015-04-16,4

cookie2,2015-04-10,2

cookie2,2015-04-11,3

cookie2,2015-04-12,5

cookie2,2015-04-13,6

cookie2,2015-04-14,3

cookie2,2015-04-15,9

cookie2,2015-04-16,7

创建表。

create table cookie2(cookieid string, createtime string, pv int)

row format delimited

fields terminated by ',';

加载数据。

load data local inpath "/usr/datadir/cookie2.txt" into table cookie2;

查看数据。

以下不支持rows between

ntile(n) over()

按顺序将组内的数据分为几片，一般用来求前几分之几的数据。

ntile(2) over (partition by cookieid order by createtime) as rn1

ntile(3) over (partition by cookieid order by createtime) as rn2,

如果不加分区，会将所有数据分成多片。

ntile(4) over (order by createtime) as rn3

代码：

select cookieid,createtime,

  pv,

  ntile(2) over (partition by cookieid order by createtime) as rn1,

  ntile(3) over (partition by cookieid order by createtime) as rn2,

  ntile(4) over (order by createtime) as rn3

from cookie1

order by cookieid,createtime;

比如，统计一个cookie，pv数最多的前1/3的天。

create table cookie_temp

as

select cookieid, createtime, pv, ntile(3) over (partition by cookieid order by pv desc) as rn from cookie2;

结果：

我们取rn=1的就是pv最多的前三分之一

（6）窗口排序：

row_number() over()

分组排序，并记录名次，一般用来取前n名

row_number() over (partition by cookieid order by pv desc) as rn1

100 99 98 98 97 96

1，2，3，4，5，6

rank() over()

rank() over(partition by cookieid order by pv desc) as rn2

100 99 98 98 97 96

1，2，3，3，5，6

dense_rank() over()

dense_rank() over(partition by cookieid order by pv desc) as rn3

100 99 98 98 97 96

1，2，3，3，4，5

代码:

select cookieid,createtime,

pv,

rank() over (partition by cookieid order by pv desc) as rn1,

dense_rank() over (partition by cookieid order by pv desc) as rn2,

row_number() over (partition by cookieid order by pv desc) as rn3

from cookie2

where cookieid='cookie1';

（7）上下移动

数据准备。

cookie1,2015-04-10 10:00:02,url2

cookie1,2015-04-10 10:00:00,url1

cookie1,2015-04-10 10:03:04,url3

cookie1,2015-04-10 10:50:05,url6

cookie1,2015-04-10 11:00:00,url7

cookie1,2015-04-10 10:10:00,url4

cookie1,2015-04-10 10:50:01,url5

cookie2,2015-04-10 10:00:02,url22

cookie2,2015-04-10 10:00:00,url11

cookie2,2015-04-10 10:03:04,url33

cookie2,2015-04-10 10:50:05,url66

cookie2,2015-04-10 11:00:00,url77

cookie2,2015-04-10 10:10:00,url44

cookie2,2015-04-10 10:50:01,url55

创建表。

create table cookie3(cookieid string, createtime string, url string)

row format delimited fields terminated by ',';

加载数据。

load data local inpath "/usr/datadir/cookie3.txt" into table cookie3;

查看数据。

LAG(col,n,DEFAULT)

用于将当前列往上移n行

第一个参数为列名。

第二个参数为往上第n行（可选，默认为1）。

第三个参数为默认值（当往上第n行为NULL时候，取默认值，如不指定，则为NULL）。

select

  cookieid,

  createtime,

  url,

  row_number() over (partition by cookieid order by createtime) as rn,

  LAG(createtime,1,'1970-01-01 00:00:00') over (partition by cookieid order by createtime) as last_1_time,

  LAG(createtime,2) over (partition by cookieid order by createtime) as last_2_time

from cookie3;

与上面的相似，用于将当前列往下移n行。

select

  cookieid,

  createtime,

  url,

  row_number() over (partition by cookieid order by createtime) as rn,

  LEAD(createtime,1,'1970-01-01 00:00:00') over (partition by cookieid order by createtime) as next_1_time,

  LEAD(createtime,2) over (partition by cookieid order by createtime) as next_2_time

from cookie3;

（8）首尾值

FIRST_VALUE(url) over ()

分组排序后截至到当前行的第一个值。

FIRST_VALUE(url) over (partition by cookieid order by createtime desc) as last1

LAST_VALUE(url) over ()

分组排序后截至到当前行的最后一个值。

FIRST_VALUE(url) over (partition by cookieid order by createtime desc) as last2

代码：

select cookieid,createtime,url,

row_number() over (partition by cookieid order by createtime) as rn,

FIRST_VALUE(url) over (partition by cookieid order by createtime desc) as last1,

LAST_VALUE(url) over (partition by cookieid order by createtime desc) as last2

from cookie3;

5.自定义函数

当 Hive 提供的内置函数无法满足业务处理需要时，此时就可以考虑使用用户自定义函数。

UDF（user-defined function）作用于单个数据行，产生一个数据行作为输出。

UDAF（用户定义聚集函数 User- Defined Aggregation Funcation）：接收多个输入数据行，并产生一个输出数据行。类似于max、min。

UDTF（表格生成函数 User-Defined Table Functions）：接收一行输入，输出多行。类似于explode。

5.1自定义UDF

（1）创建Maven项目，并导入依赖（eclipse中也可将hive-exec的jar包复制进来然后build path）。

<dependency>

            <groupId>org.apache.hive</groupId>

            <artifactId>hive-exec</artifactId>

            <version>2.3.3</version>

<exclusions>

                <exclusion>

                    <groupId>jdk.tools</groupId>

                    <artifactId>jdk.tools</artifactId>

                </exclusion>

            </exclusions>

        </dependency>