赞
踩
目录
Hive的数据模型:Hive---->数据库(...)--->表(...)--->记录(...)
数据定义语言(Data Definition Language,DDL),是SQL语言集中对数据库内部的对象结构进行创建,删除,修改等的操作语言,这些数据库对象包括database、table等。
DDL核心语法由CREATE、ALTER与DROP三个所组成。DDL并不涉及表内部数据的操作。
Hive SQL(HQL)与标准SQL的语法大同小异,基本相通。
基于Hive的设计、使用特点,HQL中create语法(尤其create table)将是学习掌握Hive DDL语法的重中之重。建表是否成功直接影响数据文件是否映射成功,进而影响后续是否可以基于SQL分析数据。通俗点说,没有表,表没有数据,你用Hive分析什么呢?
选择正确的方向,往往比盲目努力重要。
在Hive中,默认的数据库叫做default,存储数据位置位于HDFS的/user/hive/warehouse下
用户自己创建的数据库存储位置是/user/hive/warehouse/database_name.db下
create database用于创建新的数据库
COMMENT:数据库的注释说明语句
LOCATION:指定数据库在HDFS存储位置,默认/user/hive/warehouse/dbname.db
WITH DBPROPERTIES:用于指定一些数据库的属性配置。
- # 连接bin/beeline客户端
- [root@node1 ~]# /export/server/hive-3.1.3/bin/beeline
- SLF4J: Class path contains multiple SLF4J bindings.
- SLF4J: Found binding in [jar:file:/export/server/hive-3.1.3/lib/log4j-slf4j-impl -2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
- SLF4J: Found binding in [jar:file:/export/server/hadoop-3.3.6/share/hadoop/commo n/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
- SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
- SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
- SLF4J: Class path contains multiple SLF4J bindings.
- SLF4J: Found binding in [jar:file:/export/server/hive-3.1.3/lib/log4j-slf4j-impl -2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
- SLF4J: Found binding in [jar:file:/export/server/hadoop-3.3.6/share/hadoop/commo n/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
- SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
- SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
- Beeline version 3.1.3 by Apache Hive
- beeline> ! connect jdbc:hive2://node1:10000
- Connecting to jdbc:hive2://node1:10000
- Enter username for jdbc:hive2://node1:10000: root
- Enter password for jdbc:hive2://node1:10000:
- Connected to: Apache Hive (version 3.1.3)
- Driver: Hive JDBC (version 3.1.3)
- Transaction isolation: TRANSACTION_REPEATABLE_READ
- #创建database
- 0: jdbc:hive2://node1:10000> create database lwztest;
- INFO : Compiling command(queryId=root_20240227234021_e74ac3ed-eaf0-4ee1-a4a7-73 22a1e81285): create database lwztest
- INFO : Concurrency mode is disabled, not creating a lock manager
- INFO : Semantic Analysis Completed (retrial = false)
- INFO : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
- INFO : Completed compiling command(queryId=root_20240227234021_e74ac3ed-eaf0-4e e1-a4a7-7322a1e81285); Time taken: 0.023 seconds
- INFO : Concurrency mode is disabled, not creating a lock manager
- INFO : Executing command(queryId=root_20240227234021_e74ac3ed-eaf0-4ee1-a4a7-73 22a1e81285): create database lwztest
- INFO : Starting task [Stage-0:DDL] in serial mode
- INFO : Completed executing command(queryId=root_20240227234021_e74ac3ed-eaf0-4e e1-a4a7-7322a1e81285); Time taken: 1.008 seconds
- INFO : OK
- INFO : Concurrency mode is disabled, not creating a lock manager
- No rows affected (1.093 seconds)
- 0: jdbc:hive2://node1:10000>
HDFS下/user/hive路径存储database
- # 切换数据库
- use database
选择特定的数据库,切换当前会话使用哪一个数据库进行操作
- # 切换到lwztest数据库
- 0: jdbc:hive2://node1:10000> use lwztest;
- INFO : Compiling command(queryId=root_20240227234852_0877050e-8aea-4d1b-a8d5-de 023a4d15ae): use lwztest
- INFO : Concurrency mode is disabled, not creating a lock manager
- INFO : Semantic Analysis Completed (retrial = false)
- INFO : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
- INFO : Completed compiling command(queryId=root_20240227234852_0877050e-8aea-4d 1b-a8d5-de023a4d15ae); Time taken: 0.046 seconds
- INFO : Concurrency mode is disabled, not creating a lock manager
- INFO : Executing command(queryId=root_20240227234852_0877050e-8aea-4d1b-a8d5-de 023a4d15ae): use lwztest
- INFO : Starting task [Stage-0:DDL] in serial mode
- INFO : Completed executing command(queryId=root_20240227234852_0877050e-8aea-4d 1b-a8d5-de023a4d15ae); Time taken: 0.016 seconds
- INFO : OK
- INFO : Concurrency mode is disabled, not creating a lock manager
- No rows affected (0.08 seconds)
- 0: jdbc:hive2://node1:10000>
drop database:删除数据库,默认行为是RESTRICT(限制),这意味着仅在数据库为空时才删除它。
要删除带有表的数据库(不为空的数据库),我们可以使用CASCADE。
DROP [IF EXISTS] database_name [RESTRICT|CASCADE];
一个数据库通常包括一个或多个表。每个表由一个名字标识
表包含带有数据的记录(行)。
- CREATE TABLE [IF NOT EXISTS][db_name.]table_name
- (col_name data_type [COMMENT col_comment],...)
- [COMMENT table_comment]
- [ROW FORMAT DELIMITED ...];
-
- []中括号的语法表示可选。
- 建表语句中的语法顺序要和语法树中顺序保持一致。
Hive数据类型指的是表中列的字段类型;
整体分为两类:原生数据类型(primitive data type)和复杂数据类型(complex data type)
最常用的数据类型是字符串String和数字类型Int
ROW FORMAT DELIMITED语法用于指定字段之间相关的分隔符,这样Hive才能正确的读取解析数据。
或者说只有分隔符指定正确,解析数据成功,我们才能在表中看到数据。
LazySimpleSerDe是Hive默认的,包含4种子语法,分别用于指定字段之间、集合元素之间、map映射kv之间、换行的分隔符号。
在建表的时候可以根据数据的特点灵活搭配使用。
- ROW FORMAT DELIMITED
- [FIELDS TERMINATED BY char] -------------->字段之间分隔符
- [COLLECTION ITEMS TERMINATED BY char]----->集合元素之间分隔符
- [MAP KEYS TERMINATED BY char]------------->Map映射kv之间分隔符
- [LINES TERMINATED BY char]---------------->行数据之间分隔符
创建测试数据archer.txt
- 1 后羿 5986 1784 396 336 remotely arche
- 2 马可波罗 5584 200 392 344 remotely arche
- 3 鲁班七号 5989 1756 396 400 remotely arche
- 4 李元芳 5725 1770 411 336 remotely arche
- 5 孙尚香 6014 1756 376 346 remotely arche
- 6 黄忠 5998 1784 396 336 remotely arche
- 7 狄仁杰 5710 1770 407 338 remotely arche
- 8 虞姬 5669 1770 396 336 remotely arche
- 9 成吉思汗 5799 1742 394 329 remotely arche
- 10 百里守约 5611 1784 410 329 remotely arche assassin
建表语句
- create table t_archer(
- id int comment 'ID',
- name string comment '英雄名称',
- hp_max int comment '最大生命',
- mp_max int comment '最大法力',
- attack_max int comment '最高物攻',
- defense_max int comment '最大物防',
- attack_range string comment '攻击范围',
- role_main string comment '主要定位',
- role_assist string comment '次要定位'
- )
- row format delimited
- fields terminated by '\t'; --字段之间的分隔符是tab键 制表符
创建完表,上传文件
- #上传文件到HDFS上/user/hive/warehouse/lwztest.db/t_archer路径下
- [root@node1 ~]# hadoop fs -put file:///export/archer.txt hdfs://node1:8020/user/hive/warehouse/lwztest.db/t_archer
- [root@node1 ~]#
DataGrip查看表数据
select * from t_archer;
Hive建表时如果没有row format语法指定分隔符,则采用默认分隔符;
默认分隔符是'\001',是一种特殊的字符,使用的是ASCII编码的值,键盘是打不出来的。
在vim编辑器中,连续按下Ctrl+v/Ctrl+a即可输入'\001',显示^A
在一些文本编辑器中将以SOH的形式显示。
测试数据student.txt
- 1张三10
- 2李四20
- 3王五30
上传文件到HDFS
- [root@node1 export]# hadoop fs -put file:///export/student.txt hdfs://node1:8020/user/hive/warehouse/lwztest.db/t_student
- [root@node1 export]#
建表语句
- -- 使用默认分隔符建表 \001 非打印字符
- create table t_student(
- id int comment 'ID',
- name string comment '学生名称',
- age int comment '年龄'
- );
查询语句
select * from t_student;
查询结果如下
如果分隔符和建表语句不符的话,使用SQL语句查询,查询不到数据都是null,映射不成功。
show相关的语句可以帮助用户查询相关信息。
比如我们最常使用的查询当前数据库下有哪些表 show tables.
- -- 1、显示所有数据库SCHEMAS和DATABASES的用法功能一样
- show databases;
- show schemas;
-
- -- 2、显示当前数据库所有表
- show tables;
- show tables [in database_name];--指定某个数据库
-
- -- 3、查询显示一张表的元数据信息
- desc formatted t_student;
现象如下图乱码
Hive注释信息中文乱码解决
- --注意 下面sql语句需要在MYSQL中执行 修改Hive存储的元数据信息(metadata)
- use hive;
- show tables;
-
- alter table hive.COLUMNS_V2 modify column COMMENT varchar(256) character set utf8;
- alter table hive.TABLE_PARAMS modify column PARAM_VALUE varchar(256) character set utf8;
- alter table hive.PARTITION_PARAMS modify column PARAM_VALUE varchar(256) character set utf8;
- alter table hive.PARTITION_KEYS modify column PKEY_COMMENT varchar(256) character set utf8;
- alter table hive.INDEX_PARAMS modify column PARAM_VALUE varchar(256) character set utf8;
修改完再执行还是乱码,此时需要删除表,重新创建。即解决乱码问题。
- --删除表
- drop table t_student;
在Hive中建表成功之后,就会在HDFS上创建一个与之对应的文件夹,且文件夹名字就是表名;
文件夹父路径是由参数hive.metastore.warehouse.dir控制,默认值是/use/hive/warehouse;
不管路径在哪里,只有把数据文件移动到对应的表文件夹下面,Hive才能映射解析成功;
最原始暴力的方式就是使用hadoop fs -put|-mv等方式直接将数据移动到表文件夹下;
但是,Hive官方推荐使用Load命令将数据加载到表中。
Load英文单词的含义为:加载、装载;
所谓加载是指:将数据文件移动到与Hive表对应的位置,移动时是纯复制、移动操作。
纯复制、移动指在数据load加载到表中时,Hive不会对表中的数据内容进行任何转换,任何操作。
- LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename;
- LOCAL本地是哪里?
- 本地文件系统指的是Hiveserver2服务所在机器的本地Linux文件系统,不是Hive客户端所在的本地文件系统。
指定LOCAL,将在本地文件系统中查找文件路径。
若指定相对路径,将相对于用户的当前工作目录进行解释;
用户也可以为本地文件指定完整的URI,例:file:///user/hive/project/data1
没有指定LOCAL关键字
如果filepath指向的是一个完整的URI,会直接使用这个URI;
如果没有指定schema,Hive会使用在hadoop配置文件中参数fs.default.name指定的(不出意外,都是HDFS)
- use lwztest;
- ------Hive SQL DML Load 加载数据------
- --step1:建表
- --建表student_local 用于演示从本地加载数据
- create table student_local(
- num int,
- name string,
- sex string,
- age int,
- dept string
- )
- row format delimited
- fields terminated by ',';
-
- --建表student_HDFS 用于演示从HDFS加载数据
- create table student_HDFS(
- num int,
- name string,
- sex string,
- age int,
- dept string
- )
- row format delimited
- fields terminated by ',';
-
- --建议使用beeline客户端,可以显示出加载过程日志信息
- --step2:加载数据
- --从本地加载数据 数据位于HS2(node1)本地文件系统 本质是hadoop fs -put上传操作
- LOAD DATA LOCAL INPATH '/export/students.txt' INTO TABLE lwztest.student_local;
- --从HDFS加载数据 数据位于HDFS文件系统根目录下 本质是hadoop fs -mv移动操作
- --先把数据上传到HDFS上 hadoop fs -put /export/students.txt /
- LOAD DATA INPATH '/students.txt' INTO TABLE lwztest.student_HDFS;
-
- students.txt 数据内容
- 1,李勇,男,20,CS
- 2,王敏,女,22,IS
- 3,王一,女,19,CS
- 4,郑明,男,20,MA
1、local本地加载-纯复制
- #连接beeline客户端
- [root@node1 ~]# /export/server/hive-3.1.3/bin/beeline
- SLF4J: Class path contains multiple SLF4J bindings.
- SLF4J: Found binding in [jar:file:/export/server/hive-3.1.3/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
- SLF4J: Found binding in [jar:file:/export/server/hadoop-3.3.6/share/hadoop/common/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.cla ss]
- SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
- SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
- SLF4J: Class path contains multiple SLF4J bindings.
- SLF4J: Found binding in [jar:file:/export/server/hive-3.1.3/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
- SLF4J: Found binding in [jar:file:/export/server/hadoop-3.3.6/share/hadoop/common/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.cla ss]
- SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
- SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
- Beeline version 3.1.3 by Apache Hive
- beeline> ! connect jdbc:hive2://node1:10000
- Connecting to jdbc:hive2://node1:10000
- Enter username for jdbc:hive2://node1:10000: root
- Enter password for jdbc:hive2://node1:10000:
- Connected to: Apache Hive (version 3.1.3)
- Driver: Hive JDBC (version 3.1.3)
- Transaction isolation: TRANSACTION_REPEATABLE_READ
- # 加载数据从本地
- 0: jdbc:hive2://node1:10000> LOAD DATA LOCAL INPATH '/export/students.txt' INTO TABLE lwztest.student_local;
- INFO : Compiling command(queryId=root_20240303230810_d5bd2c94-3e72-490a-a8ea-728ac379d8ed): LOAD DATA LOCAL INPATH '/export/students.txt' INTO TABLE lwztest.student_local
- INFO : Concurrency mode is disabled, not creating a lock manager
- INFO : Semantic Analysis Completed (retrial = false)
- INFO : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
- INFO : Completed compiling command(queryId=root_20240303230810_d5bd2c94-3e72-490a-a8ea-728ac379d8ed); Time taken: 0.079 seconds
- INFO : Concurrency mode is disabled, not creating a lock manager
- INFO : Executing command(queryId=root_20240303230810_d5bd2c94-3e72-490a-a8ea-728ac379d8ed): LOAD DATA LOCAL INPATH '/export/students.txt' INTO TABLE lwztest.student_local
- INFO : Starting task [Stage-0:MOVE] in serial mode
- INFO : Loading data to table lwztest.student_local from file:/export/students.txt
- INFO : Starting task [Stage-1:STATS] in serial mode
- INFO : Completed executing command(queryId=root_20240303230810_d5bd2c94-3e72-490a-a8ea-728ac379d8ed); Time taken: 0.35 seconds
- INFO : OK
- INFO : Concurrency mode is disabled, not creating a lock manager
- No rows affected (0.439 seconds)
- 0: jdbc:hive2://node1:10000> select * from student_local;
- INFO : Compiling command(queryId=root_20240303231007_8a3b8082-6984-4634-9698-a35a3fdf2fff): select * from student_local
- INFO : Concurrency mode is disabled, not creating a lock manager
- INFO : Semantic Analysis Completed (retrial = false)
- INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:student_local.num, type:int, comment:null), FieldSchema(name:student_local.name, type:string, comment:null), FieldSchema(name:student_local.sex, type:string, comment:null), FieldSchema(name:student_local.age, type:int, comment:null ), FieldSchema(name:student_local.dept, type:string, comment:null)], properties:null)
- INFO : Completed compiling command(queryId=root_20240303231007_8a3b8082-6984-4634-9698-a35a3fdf2fff); Time taken: 0.273 seconds
- INFO : Concurrency mode is disabled, not creating a lock manager
- INFO : Executing command(queryId=root_20240303231007_8a3b8082-6984-4634-9698-a35a3fdf2fff): select * from student_local
- INFO : Completed executing command(queryId=root_20240303231007_8a3b8082-6984-4634-9698-a35a3fdf2fff); Time taken: 0.0 seconds
- INFO : OK
- INFO : Concurrency mode is disabled, not creating a lock manager
- +--------------------+---------------------+--------------------+--------------------+---------------------+
- | student_local.num | student_local.name | student_local.sex | student_local.age | student_local.dept |
- +--------------------+---------------------+--------------------+--------------------+---------------------+
- | 1 | 李勇 | 男 | 20 | CS |
- | 2 | 王敏 | 女 | 22 | IS |
- | 3 | 王一 | 女 | 19 | CS |
- | 4 | 郑明 | 男 | 20 | MA |
- +--------------------+---------------------+--------------------+--------------------+---------------------+
- 4 rows selected (0.415 seconds)
- 0: jdbc:hive2://node1:10000>
2、hdfs加载数据--直接移动数据到sql目录下
- # 先查询student_hdfs数据为空
- 0: jdbc:hive2://node1:10000> select * from student_hdfs;
- INFO : Compiling command(queryId=root_20240303232201_33548c54-8e0a-404f-a2e8-c2f0e7255ce4): select * from student_hdfs
- INFO : Concurrency mode is disabled, not creating a lock manager
- INFO : Semantic Analysis Completed (retrial = false)
- INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:student_hdfs.num, type:int, comment:null), FieldSchema(name:student_hdfs.name, type:string, comment:null), FieldSchema(name:student_hdfs.sex, type:string, comment:null), FieldSchema(name:student_hdfs.age, type:int, comment:null), FieldSchema(name:student_hdfs.dept, type:string, comment:null)], properties:null)
- INFO : Completed compiling command(queryId=root_20240303232201_33548c54-8e0a-404f-a2e8-c2f0e7255ce4); Time taken: 0.183 seconds
- INFO : Concurrency mode is disabled, not creating a lock manager
- INFO : Executing command(queryId=root_20240303232201_33548c54-8e0a-404f-a2e8-c2f0e7255ce4): select * from student_hdfs
- INFO : Completed executing command(queryId=root_20240303232201_33548c54-8e0a-404f-a2e8-c2f0e7255ce4); Time taken: 0.0 seconds
- INFO : OK
- INFO : Concurrency mode is disabled, not creating a lock manager
- +-------------------+--------------------+-------------------+-------------------+--------------------+
- | student_hdfs.num | student_hdfs.name | student_hdfs.sex | student_hdfs.age | student_hdfs.dept |
- +-------------------+--------------------+-------------------+-------------------+--------------------+
- +-------------------+--------------------+-------------------+-------------------+--------------------+
- No rows selected (0.196 seconds)
- # 加载数据从HDFS上进行加载students.txt数据
- 0: jdbc:hive2://node1:10000> LOAD DATA INPATH '/students.txt' INTO TABLE lwztest.student_HDFS;
- INFO : Compiling command(queryId=root_20240303232209_47aaa363-d26c-422f-93db-09b3515e4098): LOAD DATA INPATH '/students.txt' INTO TABLE lwztest.student_HDFS
- INFO : Concurrency mode is disabled, not creating a lock manager
- INFO : Semantic Analysis Completed (retrial = false)
- INFO : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
- INFO : Completed compiling command(queryId=root_20240303232209_47aaa363-d26c-422f-93db-09b3515e4098); Time taken: 0.052 seconds
- INFO : Concurrency mode is disabled, not creating a lock manager
- INFO : Executing command(queryId=root_20240303232209_47aaa363-d26c-422f-93db-09b3515e4098): LOAD DATA INPATH '/students.txt' INTO TABLE lwztest.student_HDFS
- INFO : Starting task [Stage-0:MOVE] in serial mode
- INFO : Loading data to table lwztest.student_hdfs from hdfs://node1:8020/students.txt
- INFO : Starting task [Stage-1:STATS] in serial mode
- INFO : Completed executing command(queryId=root_20240303232209_47aaa363-d26c-422f-93db-09b3515e4098); Time taken: 0.284 seconds
- INFO : OK
- INFO : Concurrency mode is disabled, not creating a lock manager
- No rows affected (0.346 seconds)
- #再次查询有数据
- 0: jdbc:hive2://node1:10000> select * from student_hdfs;
- INFO : Compiling command(queryId=root_20240303232218_a3c37424-20fd-4c65-a4c1-e2f3e2f47e6f): select * from student_hdfs
- INFO : Concurrency mode is disabled, not creating a lock manager
- INFO : Semantic Analysis Completed (retrial = false)
- INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:student_hdfs.num, type:int, comment:null), FieldSchema(name:student_hdfs.name, type:string, comment:null), FieldSchema(name:student_hdfs.sex, type:string, comment:null), FieldSchema(name:student_hdfs.age, type:int, comment:null), FieldSchema(name:student_hdfs.dept, type:string, comment:null)], properties:null)
- INFO : Completed compiling command(queryId=root_20240303232218_a3c37424-20fd-4c65-a4c1-e2f3e2f47e6f); Time taken: 0.225 seconds
- INFO : Concurrency mode is disabled, not creating a lock manager
- INFO : Executing command(queryId=root_20240303232218_a3c37424-20fd-4c65-a4c1-e2f3e2f47e6f): select * from student_hdfs
- INFO : Completed executing command(queryId=root_20240303232218_a3c37424-20fd-4c65-a4c1-e2f3e2f47e6f); Time taken: 0.0 seconds
- INFO : OK
- INFO : Concurrency mode is disabled, not creating a lock manager
- +-------------------+--------------------+-------------------+-------------------+--------------------+
- | student_hdfs.num | student_hdfs.name | student_hdfs.sex | student_hdfs.age | student_hdfs.dept |
- +-------------------+--------------------+-------------------+-------------------+--------------------+
- | 1 | 李勇 | 男 | 20 | CS |
- | 2 | 王敏 | 女 | 22 | IS |
- | 3 | 王一 | 女 | 19 | CS |
- | 4 | 郑明 | 男 | 20 | MA |
- +-------------------+--------------------+-------------------+-------------------+--------------------+
- 4 rows selected (0.255 seconds)
- 0: jdbc:hive2://node1:10000>
Insert语法功能
Hive官方推荐加载数据的方式:清洗数据成为结构化文件,再使用Load语法加载数据到表中。这样的效率更高。
也可以使用insert语法把数据插入到指定的表中,最常用的配合是把查询返回结果插入到另一张表中。
实例:
- #创建t_1表
- 0: jdbc:hive2://node1:10000> create table t_1(id int,name string);
- INFO : Compiling command(queryId=root_20240304000029_a2c6ace7-4aca-463c-83a2-b09ef288cd2f): create table t_1(id int,name string)
- INFO : Concurrency mode is disabled, not creating a lock manager
- INFO : Semantic Analysis Completed (retrial = false)
- INFO : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
- INFO : Completed compiling command(queryId=root_20240304000029_a2c6ace7-4aca-463c-83a2-b09ef288cd2f); Time taken: 0.026 seconds
- INFO : Concurrency mode is disabled, not creating a lock manager
- INFO : Executing command(queryId=root_20240304000029_a2c6ace7-4aca-463c-83a2-b09ef288cd2f): create table t_1(id int,name string)
- INFO : Starting task [Stage-0:DDL] in serial mode
- INFO : Completed executing command(queryId=root_20240304000029_a2c6ace7-4aca-463c-83a2-b09ef288cd2f); Time taken: 0.128 seconds
- INFO : OK
- INFO : Concurrency mode is disabled, not creating a lock manager
- No rows affected (0.171 seconds)
- #insert插入一条数据
- 0: jdbc:hive2://node1:10000> insert into table t_1 values(1,"zhangsan");
- INFO : Compiling command(queryId=root_20240304000119_97a0e4ac-45cf-414c-b2d8-78e98532ca72): insert into table t_1 values(1,"zhangsan")
- INFO : Concurrency mode is disabled, not creating a lock manager
- INFO : Semantic Analysis Completed (retrial = false)
- INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:col1, type:int, comment:null), FieldSchema(name:col2, type:string, comment:null)], properties:null)
- INFO : Completed compiling command(queryId=root_20240304000119_97a0e4ac-45cf-414c-b2d8-78e98532ca72); Time taken: 0.571 seconds
- INFO : Concurrency mode is disabled, not creating a lock manager
- INFO : Executing command(queryId=root_20240304000119_97a0e4ac-45cf-414c-b2d8-78e98532ca72): insert into table t_1 values(1,"zhangsan")
- WARN : Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
- INFO : Query ID = root_20240304000119_97a0e4ac-45cf-414c-b2d8-78e98532ca72
- INFO : Total jobs = 3
- INFO : Launching Job 1 out of 3
- INFO : Starting task [Stage-1:MAPRED] in serial mode
- INFO : Number of reduce tasks determined at compile time: 1
- INFO : In order to change the average load for a reducer (in bytes):
- INFO : set hive.exec.reducers.bytes.per.reducer=<number>
- INFO : In order to limit the maximum number of reducers:
- INFO : set hive.exec.reducers.max=<number>
- INFO : In order to set a constant number of reducers:
- INFO : set mapreduce.job.reduces=<number>
- INFO : number of splits:1
- INFO : Submitting tokens for job: job_1709476925218_0001
- INFO : Executing with tokens: []
- INFO : The url to track the job: http://node1:8088/proxy/application_1709476925218_0001/
- INFO : Starting Job = job_1709476925218_0001, Tracking URL = http://node1:8088/proxy/application_1709476925218_0001/
- INFO : Kill Command = /export/server/hadoop-3.3.6/bin/mapred job -kill job_1709476925218_0001
- INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
- INFO : 2024-03-04 00:01:50,003 Stage-1 map = 0%, reduce = 0%
- INFO : 2024-03-04 00:02:02,509 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.45 sec
- INFO : 2024-03-04 00:02:08,744 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.49 sec
- INFO : MapReduce Total cumulative CPU time: 6 seconds 490 msec
- INFO : Ended Job = job_1709476925218_0001
- INFO : Starting task [Stage-7:CONDITIONAL] in serial mode
- INFO : Stage-4 is selected by condition resolver.
- INFO : Stage-3 is filtered out by condition resolver.
- INFO : Stage-5 is filtered out by condition resolver.
- INFO : Starting task [Stage-4:MOVE] in serial mode
- INFO : Moving data to directory hdfs://node1:8020/user/hive/warehouse/lwztest.db/t_1/.hive-staging_hive_2024-03-04_00-01-19_178_7643664026654055193-4/-ext-10000 from hdfs://node1:8020/user/hive/warehouse/lwztest.db/t_1/.hive-staging_hive_2024-03-04_00-01-19_178_7643664026654055193-4/-ext-10002
- INFO : Starting task [Stage-0:MOVE] in serial mode
- INFO : Loading data to table lwztest.t_1 from hdfs://node1:8020/user/hive/warehouse/lwztest.db/t_1/.hive-staging_hive_2024-03-04_00-01-19_178_7643664026654055193-4/-ext-10000
- INFO : Starting task [Stage-2:STATS] in serial mode
- INFO : MapReduce Jobs Launched:
- INFO : Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 6.49 sec HDFS Read: 15250 HDFS Write: 241 SUCCESS
- INFO : Total MapReduce CPU Time Spent: 6 seconds 490 msec
- INFO : Completed executing command(queryId=root_20240304000119_97a0e4ac-45cf-414c-b2d8-78e98532ca72); Time taken: 50.958 seconds
- INFO : OK
- INFO : Concurrency mode is disabled, not creating a lock manager
- No rows affected (51.537 seconds) #51.537 seconds 51秒才执行完
- 0: jdbc:hive2://node1:10000> select * from t_1;
- INFO : Compiling command(queryId=root_20240304000349_9b1c1d8d-3008-406a-ae56-86f5c06b37f4): select * from t_1
- INFO : Concurrency mode is disabled, not creating a lock manager
- INFO : Semantic Analysis Completed (retrial = false)
- INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:t_1.id, type:int, comment:null), FieldSchema(name:t_1.name, type:string, comment:null)], properties:null)
- INFO : Completed compiling command(queryId=root_20240304000349_9b1c1d8d-3008-406a-ae56-86f5c06b37f4); Time taken: 0.179 seconds
- INFO : Concurrency mode is disabled, not creating a lock manager
- INFO : Executing command(queryId=root_20240304000349_9b1c1d8d-3008-406a-ae56-86f5c06b37f4): select * from t_1
- INFO : Completed executing command(queryId=root_20240304000349_9b1c1d8d-3008-406a-ae56-86f5c06b37f4); Time taken: 0.001 seconds
- INFO : OK
- INFO : Concurrency mode is disabled, not creating a lock manager
- +---------+-----------+
- | t_1.id | t_1.name |
- +---------+-----------+
- | 1 | zhangsan |
- +---------+-----------+
- 1 row selected (0.2 seconds)
insert语法把数据插入到指定的表中效率慢,一般不推荐或不会这么操作。
insert+select表示:将后面查询返回的结果作为内容插入到指定表中。
1、需要保证查询结果列的数目和需要插入数据表格的列数目一致。
2、如果查询出来的数据类型和插入表格对应的列数据类型不一致,将会进行转换,但是不能保证转换一定成功,转换失败的数据将会为NULL。
INSERT INTO TABLE tablename select_statement1 FROM from_statement;
- ------Hive SQL DML Insert 插入数据------
- --step1:创建一张源表student
- drop table if exists student;
- create table student(
- num int,
- name string,
- sex string,
- age int,
- dept string
- )
- row format delimited
- fields terminated by ',';
-
- --step2:加载数据
- --从本地加载数据 数据位于HS2(node1)本地文件系统 本质是hadoop fs -put上传操作
- LOAD DATA LOCAL INPATH '/export/students.txt' INTO TABLE lwztest.student;
-
- --step3:创建一张目标表 只有两个字段
- create table student_from_insert(
- sno int,
- sname string
- );
-
- --使用insert+select插入数据到新表
- insert into table student_from_insert select num,name from student;
-
- select * from student_from_insert;
Select语法树
从哪里查询取决于FROM关键字后面的table_reference。表名和列名不区分大小写。
select_expr表示检索查询返回的列,必须至少有一个select_expr。
默认情况下是有ALL的,代表全部查询。DISTINCT:去重。GROUP BY:分组。ORDER BY排序。默认升序(ASC)、倒序(DESC)、LIMIT 用于限制SELECT语句返回的行数。
不详细介绍了,和SQL语法一样。
- SELECT [ALL|DISTINCT] select_expr,select_expr,...
- FROM table_reference
- [WHERE where_condition]
- [GROUP BY col_list]
- [ORDER BY col_list]
- [LIMIT [offset,]rows];
聚合操作
SQL中拥有很多可用于计数和计算的内建函数,其使用的语法是:SELECT function(列) FROM 表。
聚合操作函数如:Count、Sum、Max、Min、Avg等函数。
聚合函数的最大特点是不管原始数据有多少行记录,经过聚合操作只返回一条数据,这条数据就是聚合的结果。
AVG(column) | 返回某列的平均值 |
COUNT(column) | 返回某列的行数(不包括NULL值) |
COUNT(*) | 返回被选行数 |
MAX(column) | 返回某列的最高值 |
MIN(column) | 返回某列某列的最低值 |
SUM(column) | 返回某列的总和 |
HAVING
在SQL中增加HAVING子句原因是,WHERE关键字无法与聚合函数一起使用。
HAVING子句可以让我们筛选分组后的各组数据,并且可以在Having中使用聚合函数,因为此时where,group by 已经执行结束,结果集已经确定。
HAVING与WHERE区别
having是在分组后对数据进行过滤
where是在分组前对数据进行过滤
having后面可以使用聚合函数
where后面不可以使用聚合函数
执行顺序
在查询过程中执行顺序:from>where>group(含聚合)>having>order>select;
1.聚合语句要比having子句优先执行
2.where子句在查询过程中执行优先级别优先于聚合语句。
- select state,sum(age) as cnts from t_1
- where c_date='2021-01-28'
- group by state
- having cnts>1000
- limit 2;
join语法的出现是用于根据两个或多个表中的列之间的关系,从这些表中共同组合查询数据。
在Hive中,使用最多,最重要的两种join分别是inner join(内连接)、left join(左连接)
- join_table:
- table_reference [INNER] JOIN table_factor [join_condition]
- | table_reference {LEFT}[OUTER] JOIN table_factor [join_condition]
-
- join_condition:
- ON expression
table_reference:是join查询中使用的表名
table_factor:与table_reference相同,是联接查询中使用的表名。
join_condition:join查询关联的条件,如果在两个以上的表上需要连接,则使用AND关键字。
INNER可以省略:INNER JOIN ==JOIN
INNER JOIN关系如下图
left join 左连接
left join中文叫做是左外连接(Left Outer Join)或者左连接,其中outer可以省略,left outer join是早期的写法。
left join的核心就在于left左。左指的是join关键字左边的表,简称左表。
通俗解释:join时以左表的全部数据为准,右边与之关联;左表数据全部返回,右表关联上的显示返回,关联不上的显示null返回。
Hive内建了不少函数,用户满足用户不同使用需求,提高SQL编写效率:
1、使用show functions查看当下可用的所有函数;
2、通过describe function extended funcname来查看函数的使用方式;
- --查看当下可用的所有函数
- show functions;
-
- --查看函数的使用方式
- describe function extended count;
Hive的函数分为两大类:内置函数(Build-in Functions)、用户定义函数UDF(User-Defined Functions):
1、内置函数可以分为:数值类型函数、日期类型函数、字符串类型函数、集合函数、条件函数等;
2、用户定义函数根据输入输出的行数可分为3类:UDF、UDAF、UDTF。
用户定义函数UDF分类标准
根据函数输入输出的行数:
UDF(User-Defined Functions)普通函数,一进一出
UDAF(User-Defined Aggregation Functions)聚合函数,多进一出
UDTF(User-Defined Table-Generating Functions)表生成函数,一进多出
内置函数
内置函数(Build-in)指的是Hive开发实现好,直接可以使用的函数,也叫做内建函数。
内置函数根据应用归类整体可以分为8大种类型。数值类型函数、日期类型函数、字符串类型函数、集合函数、条件函数等;
- ------String Functions 字符串函数------
- select length("lwztest");
- select reverse("lwztest");
- --拼接
- select concat("lwz","test");
- --带分隔符字符串连接函数:concat_ws(separator,[string | array(string)]+)
- select concat_ws('.','www',array('com','lwz','cn'));
- --字符串截取函数:substr(str,pos[,len]) 或者substring(str,pos[,len])
- select substr("lwztest",-2); --pos是从1开始的索引,如果为负数则倒着数
- select substr("lwztest",2);
- --分割字符串函数:split(str,regex)
- --split针对字符串数据进行分割,返回是数组array,可以通过数组的下标取内部的元素,注意下标从0开始
- select split('apache hive',' ');
- select split('apache hive',' ')[0];
- select split('apache hive',' ')[1];
-
- ------Date Functions 日期函数------
- --获取当前日期:current_date
- select current_date();
- --获取当前UNIX时间戳函数:unix_timestamp
- select unix_timestamp();
- --日期转UNIX时间戳函数:unix_timestamp
- select unix_timestamp("2011-12-07 13:01:03");
- --指定格式日期转UNIX时间戳函数:unix_timestamp
- select unix_timestamp('2011120713:01:03','yyyyMMdd HH:mm:ss');
- --UNIX时间戳转日期函数:from_unixtime
- select from_unixtime(1_618_238_391);
- select from_unixtime(0,'yyyy-MM-dd HH:mm:ss');
- --日期比较函数:datediff 日期格式要求'yyyy-MM-dd HH:mm:ss' or 'yyyy-MM-dd'
- select datediff('2012-12-08','2012-05-09');
- --日期增加函数:date_add
- select date_add('2012-02-28',10);
- --日期减少函数:date_sub
- select date_sub('2012-01-1',10);
-
- ------Mathematical Functions 数学函数------
- --取整函数: round 返回double类型的整数值部分(遵循四舍五入)
- select round(3.1415926);
- --指定精度取整函数:round(double a, intd)返回指定精度d的double类型
- select round(3.1415926,4);
- --取随机数函数:rand 每次执行都不一样返回一个0到1范围内的随机数
- select rand();
- --指定种子取随机数函数:rand(int seed)得到一个稳定的随机数序列
- select rand(3);
-
- ------Conditional Functions条件函数------
- --使用之前课程创建好的student表数据
- select * from student limit 3;
- --if条件判断:if(boolean testCondition, T valueTrue,T valueFalseOrNull)
- select if(1=2,100,200);
- select if(sex ='男','M','W') from student limit 3;
- --空值转换函数:nvl(T value,T default_value)
- select nvl("allen","lwztest");
- select nvl(null,"lwztest");
- --条件转换函数:CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END
- select case 100 when 50 then 'tom' when 100 then 'mary' else 'tim' end;
- select case sex when '男' then 'male' else 'female' end from student limit 3;
请记住,你当下的结果,由过去决定;你现在的努力,在未来见效;
不断学习才能不断提高!磨炼,不断磨炼自己的技能!学习伴随我们终生!
生如蝼蚁,当立鸿鹄之志,命比纸薄,应有不屈之心。
乾坤未定,你我皆是黑马,若乾坤已定,谁敢说我不能逆转乾坤?
努力吧,机会永远是留给那些有准备的人,否则,机会来了,没有实力,只能眼睁睁地看着机会溜走。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。