赞
踩
作者:韩信子@ShowMeAI
教程地址:http://www.showmeai.tech/tutorials/84
本文地址:http://www.showmeai.tech/article-detail/171
声明:版权所有,转载请联系平台与作者并注明出处
收藏ShowMeAI查看更多精彩内容
大数据生态中最重要的工具平台之一是Hive,它是离线计算的关键组件,常用于数仓建设,在公司内会通过SQL实现大数据的统计与报表。下面来看一下Hive的搭建和配置使用方法。
安装Hive过程可以参考官方文档:https://cwiki.apache.org/confluence/display/Hive/GettingStarted。
按照文件建议在 http://www.apache.org/dyn/closer.cgi/hive/ 下载最新的release,这里以Hive3.1.3为例讲解。
把安装文件解压到安装有hadoop环境的机器上:
root@ubuntu:~/bigdata# ll
total 20
drwxr-xr-x 27 root root 4096 Sep 30 07:24 azkaban/
drwxr-xr-x 2 root root 4096 Oct 13 08:36 demo/
drwxr-xr-x 12 1001 1001 4096 Sep 30 09:43 hadoop-3.3.0/
drwxr-xr-x 11 root root 4096 Oct 13 07:58 hive-3.1.3/
drwxr-xr-x 32 work work 4096 Aug 28 07:28 spark-3.0.1/
root@ubuntu:~/bigdata# pwd
/root/bigdata
接下来把Hive的bin目录导出PATH:
root@ubuntu:~/bigdata# cat /etc/profile.d/java.sh
export PATH=/usr/local/jdk/bin:/root/bigdata/hadoop-3.3.0/bin:/root/bigdata/spark-3.0.1/bin:/root/bigdata/hive-3.1.3/bin:${PATH}
升级guava依赖为hadoop版本:
mv lib/guava-19.0.jar lib/guava-19.0.jar.bk
ln -s /root/bigdata/hadoop-3.3.0/share/hadoop/hdfs/lib/guava-27.0-jre.jar /root/bigdata/hive-3.1.3/lib/guava-27.0-jre.jar
上述第一条命令先备份了Hive自带的guava依赖包,然后将hadoop自带的更高版本软链过来,这个版本一致性是Hive正常运行的关键之一。
hive的元数据服务是独立部署的,它基于mysql保存数据。可以使用apt命令在ubuntu环境安装oracle mysql:
apt-get install mysql-server
如果使用 mysql -h localhost -u root -p
登录时,提示 access denied
,那需要找到和删除 mysql user
表中的一条 localhost
的特殊规则:
delete from user where User=’root’ and Host=’localhost’;
FLUSH PRIVILEGES;
接着创建hive数据库:
create database hive;
接着通过wget命令下载 JDBC mysql
,并把该Jar放到hive的lib目录下(hive metastore服务将用该JDBC驱动连接mysql读写元数据):
wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.21/mysql-connector-java-8.0.21.jar
mv mysql-connector-java-8.0.21.jar lib
Hive会自动加载 conf/hive-site.xml
配置文件,官方在 conf/hive-default.xml.template
提供了一个模板文件,里面是 Hive 加载不到 hive-site.xml
时使用的默认值,可以参考 conf/hive-default.xml.template
来填写 hive-site.xml
,下面是一个配置好的样本(只配置了必要项):
root@ubuntu:~/bigdata/hive-3.1.3# cat conf/hive-site.xml <?xml version="1.0" encoding="UTF-8" standalone="no"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hive.cli.print.header</name> <value>true</value> <description>Whether to print the names of the columns in query output.</description> </property> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost:3306/hive</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.cj.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>xxxxx</value> </property> <property> <name>hive.metastore.uris</name> <value>thrift://localhost:9083</value> <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description> </property> </configuration>
其中核心的项目解释如下:
hive.cli.print.header
:Hive命令行工具将打印table的表头,方便阅读结果javax.jdo.option.ConnectionURL
:数据库JDBC URL,这里就是mysql的Hive数据库javax.jdo.option.ConnectionDriverName
:JDBC类名,8.x版本Mysql jar的类名有所变化,需要注意javax.jdo.option.ConnectionUserName
:mysql用户名javax.jdo.option.ConnectionPassword
:mysql密码hive.metastore.uris
:启动metastore服务的监听地址先执行Hive建表命令,完成mysql元数据建表:
bin/schematool -dbType mysql -initSchema
执行命令:
nohup hive –service metastore &
服务将监听在 localhost:9083
端口,生产环境需要让host是可以被其他服务器访问到的,因为访问metastore服务的客户端不一定在本机。
现在命令行使用Hive命令,将会自动根据 hive-site.xml
连接到 metastore
服务,运行Hive命令做一下测试:
root@ubuntu:~/bigdata/hive-3.1.3# hive SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/root/bigdata/hive-3.1.3/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/root/bigdata/hadoop-3.3.0/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] Hive Session ID = f0d4bf60-d85f-456a-98fb-e904d50f5242 Logging initialized using configuration in jar:file:/root/bigdata/hive-3.1.3/lib/hive-common-3.1.3.jar!/hive-log4j2.properties Async: true Hive Session ID = 959e0cda-f8eb-4fc1-b798-cb5175e735d2 Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. hive> show databases; OK database_name default Time taken: 0.516 seconds, Fetched: 2 row(s)
Hive命令过后进入,可以看到自带的default数据库,并且还提示Hive目前使用MR作为计算引擎,实际Hive建议开始使用spark或者tez作为SQL的底层计算引擎,未来最终会彻底取消MR。
目前继续使用MR作为计算引擎即可,Hive会根据hadoop命令自动找到hadoop和yarn配置文件,最终SQL是通过MR运行在yarn上完成计算的。
以上就是Hive的完整搭建过程,小伙伴们就可以欢快地开始使用Hive了。
本案例对视频网站的数据进行各种指标分析,为管理者提供决策支持。
统计youtube影音视频网站的常规指标,各种TopN指标:
视频表
字段 | 备注 | 详细描述 |
---|---|---|
video id | 视频唯一id | 11位字符串 |
uploader | 视频上传者 | 上传视频的用户名String |
age | 视频年龄 | 视频在平台上的整数天 |
category | 视频类别 | 上传视频指定的视频分类 |
length | 视频长度 | 整形数字标识的视频长度 |
views | 观看次数 | 视频被浏览的次数 |
rate | 视频评分 | 满分5分 |
ratings | 流量 | 视频的流量,整型数字 |
conments | 评论数 | 一个视频的整数评论数 |
related ids | 相关视频id | 相关视频的id,最多20个 |
用户表
字段 | 备注 | 字段类型 |
---|---|---|
uploader | 上传者用户名 | string |
videos | 上传视频数 | int |
friends | 朋友数量 | int |
ETL原始数据
通过观察原始数据形式,可以发现,视频可以有多个所属分类,每个所属分类用 &
符号分割,且分割的两边有空格字符,同时相关视频也是可以有多个元素,多个相关视频又用 \t
进行分割。
为了分析数据时方便对存在多个子元素的数据进行操作,首先进行数据重组清洗操作。即:将所有的类别用 &
分割,同时去掉两边空格,多个相关视频 id 也使用 &
进行分割。
核心要做三件事情:
创建表:youtubevideo_ori
,youtubevideo_user_ori
创建表:youtubevideo_orc
,youtubevideo_user_orc
--创建: youtubevideo_ori表 create table youtubevideo_ori( videoId string, uploader string, age int, category array<string>, length int, views int, rate float, ratings int, comments int, relatedId array<string>) row format delimited fields terminated by "\t" collection items terminated by "&" stored as textfile; --创建 youtubevideo_user_ori表: create table youtubevideo_user_ori( uploader string, videos int, friends int) row format delimited fields terminated by "\t" stored as textfile; --创建 youtubevideo_orc表: create table youtubevideo_orc( videoId string, uploader string, age int, category array<string>, length int, views int, rate float, ratings int, comments int, relatedId array<string>) row format delimited fields terminated by "\t" collection items terminated by "&" stored as orc; --创建 youtubevideo_user_orc表: create table youtubevideo_user_orc( uploader string, videos int, friends int) row format delimited fields terminated by "\t" stored as orc;
youtubevideo_ori
:
load data inpath "/output/youtube_video" into table youtubevideo_ori;
youtubevideo_user_ori
:
load data inpath "/youtube_video/user" into table youtubevideo_user_ori;
youtubevideo_orc
:
insert overwrite table youtubevideo_orc select * from youtubevideo_ori;
youtubevideo_user_orc
:
insert into table youtubevideo_user_orc select * from youtubevideo_user_ori;
思路:使用order by按照views字段做一个全局排序即可,同时设置只显示前10条。
SELECT videoid, uploader, age, category, length, views, rate, ratings, comments FROM youtubevideo_orc ORDER BY views DESC limit 10; -- 方式2SELECT * FROM ( SELECT videoid , age, category, length, views, Row_number() OVER( ORDER BY views DESC) AS rn FROM youtubevideo_orc )t WHERE t.rn <= 10;
思路:即统计每个类别有多少个视频,显示出包含视频最多的前10个类别。
SELECT category_name AS category,
Count(t1.videoid) AS hot
FROM (
SELECT videoid,
category_name
FROM youtubevideo_orc lateral view explode(category) t_catetory as category_name) t1
GROUP BY t1.category_name
ORDER BY hot DESC limit 10;
思路:
SELECT category_name AS category,
Count(t2.videoid) AS hot_with_views
FROM (
SELECT videoid,
category_name
FROM (
SELECT *
FROM youtubevideo_orc
ORDER BY views DESC limit 20) t1 lateral VIEW explode(category) t_catetory as category_name) t2
GROUP BY category_name
ORDER BY hot_with_views DESC;
思路:
--创建表类别表: CREATE TABLE youtubevideo_category ( videoid STRING, uploader STRING, age INT, categoryid STRING, length INT, views INT, rate FLOAT, ratings INT, comments INT, relatedid ARRAY<string> ) row format delimited fields terminated BY "\t" collection items terminated BY "&" stored AS orc; --向类别表中插入数据: INSERT INTO table youtubevideo_category SELECT videoid, uploader, age, categoryid, length, views, rate, ratings, comments, relatedid FROM youtubevideo_orc lateral view explode(category) catetory AS categoryid; --统计Music类别的Top10(也可以统计其他) SELECT videoid, views FROM youtubevideo_category WHERE categoryid = "Music" ORDER BY views DESC limit 10; -- 方式2SELECT * FROM ( SELECT videoid , age, categoryid, length, views, Row_number() OVER( ORDER BY views DESC) AS rn FROM youtubevideo_category WHERE categoryid = "music" )t WHERE t.rn <= 10;
思路:
SELECT *
FROM (SELECT videoid,
age,
categoryid,
length,
ratings,
Row_number()
OVER(
partition BY categoryid
ORDER BY ratings DESC) AS rn
FROM youtubevideo_category)t
WHERE t.rn <= 10;
思路:
--第一步: SELECT * FROM youtubevideo_user_orc ORDER BY videos DESC LIMIT 10; --第二步: SELECT t2.videoid, t2.uploader, t2.views, t2.ratings, t1.videos, t1.friends FROM (SELECT * FROM youtubevideo_user_orc ORDER BY videos DESC LIMIT 10) t1 JOIN youtubevideo_orc t2 ON t1.uploader = t2.uploader ORDER BY views DESC LIMIT 20;
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。