当前位置:   article > 正文

Hive大数据项目实践_大数据项目实战

大数据项目实战

在搭建了Hadoop和hive环境后,就可以使用hive来进行数据库相关操作了。Hive提供了hql(类sql)语句来操作,基本过程与mysql类似,区别的就是对于hive中的聚合操作,将使用hadoop底层的mapreduce进程来执行。

下面以一个游戏公司的游戏、用户等相关分析大数据业务为例,以Hive为工具来完成游戏活跃度、用户使用情况等的统计分析工作。

(1)数据的产生

因为获取游戏公司的实际数据还是比较困难的,我们直接自己来构建。使用python脚本就可以完成。由于hadoop和hive都是安装在centos系统上,centos默认安装了python2.7,所以可以直接编写脚本,然后在centos上运行得到结果。

用户数据的构建,模拟产生1000个用户

  1. import random
  2. def getUser():
  3. location_List= ['BJ','SH','TJ','GZ','SZ']
  4. fd= fopen('userinfo','w+')
  5. for i in range(1000):
  6. userid = str(1000+i)
  7. age = str(random.randrange(10,40))
  8. area = random.choice(location_List)
  9. user_money = str(i)
  10. str_tmp = userid + ','+ age+ ',' + area + ',' + usermoney + '\n'
  11. fd.write(str_tmp)
  12. fd.close()
  13. if __name__=='__main__':
  14. getUser()

游戏信息的构建,模拟共4个游戏:

  1. def getGame():
  2. Game_list= ['CHESS','LANDLORD','QGAME','ROYAL']
  3. fd =fopen('gameinfo','w+')
  4. for i in range(4):
  5. gameid = str(i)
  6. gamename = Game_list[i]
  7. str_tmp = gameid+ ',' +gamename + '\n'
  8. fd.write(str_tmp)
  9. fd.close()
  10. if __name__=='__main__':
  11. getGame()

用户玩游戏时间数据构建,模拟一共10天的用户玩游戏时间的记录

  1. import datetime
  2. from datetime import timedelta
  3. import random
  4. def gameTime():
  5. game_list = [0, 1, 2,3]
  6. time_list= [10,15,20,20,50,60,90]
  7. for j in range(10):
  8. fdate = (datetime.datetime.now() + datetime.timedelta(days=j)).strftime('%Y-%m-%d')
  9. fd = open('gametime\\{}\\gametime_{}.txt'.format(fdate,fdate), 'w+')
  10. for i in range(1000):
  11. userid = str(1000 + i)
  12. gameid = str(random.choice(game_list))
  13. gametime = str(random.choice(time_list))
  14. str_tmp = fdate+ ','+ userid+ ','+ gameid + ',' + gametime + '\n' + '\n'
  15. fd.write(str_tmp)
  16. fd.close()
  17. if __name__=='__main__':
  18. gameTime()

用户玩游戏费用数据构建,模拟这10天里每天的费用:

  1. import random,datetime
  2. from datetime import timedelta
  3. def userFee():
  4. game_list = [0, 1, 2,3]
  5. money_list= [10,15,30,27,55,66,90]
  6. for j in range(10):
  7. fdate = (datetime.datetime.now() + datetime.timedelta(days=j)).strftime('%Y-%m-%d')
  8. os.mkdir('userfee\\{}'.format(fdate))
  9. for j in range(10):
  10. fdate = (datetime.datetime.now() + datetime.timedelta(days=j)).strftime('%Y-%m-%d')
  11. fd = open('userfee\\{}\\userfee_{}.txt'.format(fdate,fdate), 'w+')
  12. for i in range(1000):
  13. userid = str(1000 + i)
  14. gameid = str(random.choice(game_list))
  15. gametime = str(random.choice(money_list))
  16. str_tmp = fdate+ ',' + userid+ ','+ gameid + ',' + gametime + '\n' + '\n'
  17. fd.write(str_tmp)
  18. fd.close()
  19. if __name__=='__main__':
  20. userFee()

(2)数据的Hive存储

上述的用户数据都是以文件方式存储下来的,接下来我们将其存储到Hadoop上。使用的时候就是利用Hive以建表创建数据、插入数据等方式来实现。

首先在hadoop中新建如下文件夹,用于设置hive存储位置。

  1. [hadoop@master ~]$ hdfs dfs -mkdir /stat
  2. [hadoop@master ~]$ hdfs dfs -mkdir /stat/data/
  3. [hadoop@master ~]$ hdfs dfs -mkdir /stat/data/gameinfo
  4. [hadoop@master ~]$ hdfs dfs -mkdir /stat/data/gametime
  5. [hadoop@master ~]$ hdfs dfs -mkdir /stat/data/userinfo
  6. [hadoop@master ~]$ hdfs dfs -mkdir /stat/data/userfee

然后进入hive shell命令端,开始创建stat数据库,用于存放上述生产的数据,同时创建一个analysis数据库,用于存放hive统计分析数据。

  1. hive> create database stat;
  2. hive> create database analysis;

接下来就可以使用hive创建表的命令创建4个表,gameinfo,gametime,userinfo,userfee:

  1. hive > use stat;
  2. hive > create table if not EXISTS gameinfo ( gameid int, gamename string)
  3. > row format delimited fields terminated by ','
  4. > location '/stat/data/gameinfo';
  5. hive > create table if not EXISTS userinfo ( userid int, age int, area string , money int)
  6. > row format delimited fields terminated by ','
  7. > location '/stat/data/userinfo';
  8. hive > create table if not EXISTS gametime ( date string, userid int, gameid int, gametime int)
  9. > partitioned by (dt string )
  10. > row format delimited fields terminated by ','
  11. > location '/stat/data/gametime';
  12. hive > create table if not EXISTS userfee ( date string, userid int, gameid int, fee int)
  13. > partitioned by (st string)
  14. > row format delimited fields terminated by ','
  15. > location '/stat/data/userfee';

注意到建表时各个字段与第一步构建数据时要对应,这样保证后面数据能够正常导入。其中gametime和userfee都是涉及到分区,因为有每日的数据需要单独进行存储,所以在创建表时就设定好分区。

有了表名和字段定义后,可以导入数据了。

如下将userinfo和gameinfo的数据存入hive:

  1. hive > load data local inpath 'userinfo.txt' into table stat.userinfo ;
  2. hive > load data local inpath 'gameinfo.txt' into table stat.gameinfo ;

对于后续两个分区表,则采用alter table add方式来导入:

  1. hive > alter table stat.gametime add if not EXISTS partition (dt = '2020-02-09')
  2. > location '/stat/data/gametime/2020-02-09/';
  3. hive > alter table stat.gametime add if not EXISTS partition (dt = '2020-02-10')
  4. > location '/stat/data/gametime/2020-02-10/';
  5. hive > alter table stat.gametime add if not EXISTS partition (dt = '2020-02-11')
  6. > location '/stat/data/gametime/2020-02-11/';
  7. hive > alter table stat.gametime add if not EXISTS partition (dt = '2020-02-12')
  8. > location '/stat/data/gametime/2020-02-12/';
  9. hive > alter table stat.gametime add if not EXISTS partition (dt = '2020-02-13')
  10. > location '/stat/data/gametime/2020-02-13/';
  11. hive > alter table stat.gametime add if not EXISTS partition (dt = '2020-02-14')
  12. > location '/stat/data/gametime/2020-02-14/';
  13. hive > alter table stat.gametime add if not EXISTS partition (dt = '2020-02-15')
  14. > location '/stat/data/gametime/2020-02-15/';
  15. hive > alter table stat.gametime add if not EXISTS partition (dt = '2020-02-16')
  16. > location '/stat/data/gametime/2020-02-16/';
  17. hive > alter table stat.gametime add if not EXISTS partition (dt = '2020-02-17')
  18. > location '/stat/data/gametime/2020-02-17/';
  19. hive > alter table stat.userfee add if not EXISTS partition (dt = '2020-02-09')
  20. > location '/stat/data/userfee/2020-02-09/';
  21. hive > alter table stat.userfee add if not EXISTS partition (dt = '2020-02-10')
  22. > location '/stat/data/userfee/2020-02-10/';
  23. hive > alter table stat.userfee add if not EXISTS partition (dt = '2020-02-11')
  24. > location '/stat/data/userfee/2020-02-11/';
  25. hive > alter table stat.userfee add if not EXISTS partition (dt = '2020-02-12')
  26. > location '/stat/data/userfee/2020-02-12/';
  27. hive > alter table stat.userfee add if not EXISTS partition (dt = '2020-02-13')
  28. > location '/stat/data/userfee/2020-02-13/';
  29. hive > alter table stat.userfee add if not EXISTS partition (dt = '2020-02-14')
  30. > location '/stat/data/userfee/2020-02-14/';
  31. hive > alter table stat.userfee add if not EXISTS partition (dt = '2020-02-15')
  32. > location '/stat/data/userfee/2020-02-15/';
  33. hive > alter table stat.userfee add if not EXISTS partition (dt = '2020-02-16')
  34. > location '/stat/data/userfee/2020-02-16/';
  35. hive > alter table stat.userfee add if not EXISTS partition (dt = '2020-02-17')
  36. > location '/stat/data/userfee/2020-02-17/';

这样就把生产的游戏时间、用户费用数据存入了hdfs中。其实可以观察到,对于后面的两个用户游戏时间、用户游戏费用数据的存储,直接使用hive来操作还是比较笨重的,毕竟语句之间差别的就是日期变量。如果能使用脚本来传参,将日期作为变量传入执行脚本,执行效率会高很多。这里就需要启动hiveserver2进程,使用jdbc或者beeline来进行操作,尤其可以使用javaAPI或者python来远程访问,编写脚本来实现hive数据库的管理效率会跟多。

如下为hdfs数据结构:

进入hive,查询一下导入数据的情况:

  1. hive> select * from stat.gameinfo;
  2. OK
  3. 0 LandLord
  4. 1 Buffle
  5. 2 Farm
  6. 3 PuQ
  7. Time taken: 0.212 seconds, Fetched: 4 row(s)
  8. hive> select * from stat.gametime where dt='2020-02-09' limit 1;
  9. OK
  10. 2020-02-09 1000 1 60 2020-02-09
  11. Time taken: 0.498 seconds, Fetched: 1 row(s)

(3)数据的hive统计分析

前面的4个业务相关数据表,我们需要经过统计分析来获得相应的规律信息。如通过游戏信息表、用户玩游戏时间表关联分析可获得每个游戏每天的活跃度(统计每个游戏每天有多少用户在玩,进而分析最喜爱游戏进行推荐);通过用户信息表、用户游戏消费表关联分析来获得基于年龄的付费信息,即看游戏对哪个年龄段最有吸引力;通过游戏信息表、用户游戏消费表关联分析获得每个游戏获得的收入金额等。在hive shell输入hql语句,用于分析统计类的,包括各种聚合、关联条件等,执行时系统自动将任务交给底层的mapreduce来执行。此时我们将使用另外一个数据库analysis。

首先来写一下hql语句,实现游戏活跃度的统计:

hql= ''' insert overwrite table analysis.gameactive partition (dt='2020-02-17') 
select gt.fdate as fdate,gi.fgamename as fgamename,count(gt.fuserid) as fcount from stat.gametime gt,stat.gameinfo as gi
where gt.fgameid=gi.fgameid and fdate='2020-02-17' group by gt.fdate,gi.fgamename ''';

如果表比较多的时候,还可以这样来使用:

  1. insert overwrite table analysis.gameactive partition (dt='2020-02-17'
  2. select gt.fdate as fdate,gi.fgamename as fgamename,count(gt.fuserid) as fcount
  3. from stat.gametime gt
  4. left join stat.gameinfo gi 
  5. on gt.fgameid = gi.fgameid
  6. where fdate ='2020-02-17'
  7. group by gt.fdate,gi.fgamename;

语句中有left join,左连接的方式,left join... on...,on后面为关联方式。然后接where语句,查询条件。最后在使用group by分组时,注意到分组所用的字段,一定是需要前面select语句里出现过的属性。

同样直接在hive shell命令行窗口输入:

  1. hive > insert overwrite table analysis.gameactive partition (dt='2020-02-17')
  2. hive > select gt.fdate as fdate,gi.fgamename as fgamename,count(gt.fuserid) as fcount
  3. hive > from stat.gametime gt,stat.gameinfo as gi
  4. hive > where gt.fgameid=gi.fgameid and fdate='2020-02-17'
  5. hive > group by gt.fdate,gi.fgamename;

执行结束后,依次将时间修改成产生数据的10天时间,然后可以查询结果如下:

  1. hive> select * from analysis.gameactive;
  2. OK
  3. 2020-02-09 Buffle 268 2020-02-09
  4. 2020-02-09 Farm 221 2020-02-09
  5. 2020-02-09 LandLord 256 2020-02-09
  6. 2020-02-09 PuQ 255 2020-02-09
  7. 2020-02-10 Buffle 223 2020-02-10
  8. 2020-02-10 Farm 258 2020-02-10
  9. 2020-02-10 LandLord 255 2020-02-10
  10. 2020-02-10 PuQ 264 2020-02-10
  11. 2020-02-11 Buffle 235 2020-02-11
  12. 2020-02-11 Farm 259 2020-02-11
  13. 2020-02-11 LandLord 246 2020-02-11
  14. 2020-02-11 PuQ 260 2020-02-11
  15. 2020-02-12 Buffle 253 2020-02-12
  16. 2020-02-12 Farm 241 2020-02-12
  17. 2020-02-12 LandLord 266 2020-02-12
  18. 2020-02-12 PuQ 240 2020-02-12
  19. 2020-02-13 Buffle 222 2020-02-13
  20. 2020-02-13 Farm 273 2020-02-13
  21. 2020-02-13 LandLord 252 2020-02-13
  22. 2020-02-13 PuQ 253 2020-02-13
  23. 2020-02-14 Buffle 253 2020-02-14
  24. 2020-02-14 Farm 257 2020-02-14
  25. 2020-02-14 LandLord 245 2020-02-14
  26. 2020-02-14 PuQ 245 2020-02-14
  27. 2020-02-15 Buffle 251 2020-02-15
  28. 2020-02-15 Farm 239 2020-02-15
  29. 2020-02-15 LandLord 250 2020-02-15
  30. 2020-02-15 PuQ 260 2020-02-15
  31. 2020-02-16 Buffle 261 2020-02-16
  32. 2020-02-16 Farm 263 2020-02-16
  33. 2020-02-16 LandLord 231 2020-02-16
  34. 2020-02-16 PuQ 245 2020-02-16
  35. 2020-02-17 Buffle 242 2020-02-17
  36. 2020-02-17 Farm 223 2020-02-17
  37. 2020-02-17 LandLord 261 2020-02-17
  38. 2020-02-17 PuQ 274 2020-02-17
  39. 2020-02-18 Buffle 271 2020-02-18
  40. 2020-02-18 Farm 253 2020-02-18
  41. 2020-02-18 LandLord 226 2020-02-18
  42. 2020-02-18 PuQ 250 2020-02-18
  43. Time taken: 0.523 seconds, Fetched: 40 row(s)

由此我们看到整个10天里,每天4个游戏的参与人数统计就出来了。

同样接下来实现游戏用户年龄段情况统计,使用的hql语句为:

  1. hive > insert overwrite table analysis.gameuserage partition (dt='2020-02-11')
  2. hive > select gt.fdate as fdate,sum(gt.fgametime) as fgametime,ui.fage as fage
  3. hive > from stat.gametime gt,stat.userinfo as ui
  4. hive > where gt.fuserid=ui.fuserid and fdate='2020-02-11'
  5. hive > group by ui.fage,gt.fdate;

执行后,依次对每天数据进行统计处理,查询结果如下(第一列为日期,第二列为用户玩游戏的时间,第三列为用户年龄,因为在产生数据的时候我们设定了年龄段从10到39岁,所以结果就是对这30个年龄组玩游戏的时间进行了总计分析):

  1. 2020-02-18 1695 10 2020-02-18
  2. 2020-02-18 1350 11 2020-02-18
  3. 2020-02-18 1325 12 2020-02-18
  4. 2020-02-18 990 13 2020-02-18
  5. 2020-02-18 1355 14 2020-02-18
  6. 2020-02-18 1420 15 2020-02-18
  7. 2020-02-18 905 16 2020-02-18
  8. 2020-02-18 1140 17 2020-02-18
  9. 2020-02-18 1580 18 2020-02-18
  10. 2020-02-18 1085 19 2020-02-18
  11. 2020-02-18 1350 20 2020-02-18
  12. 2020-02-18 1525 21 2020-02-18
  13. 2020-02-18 1285 22 2020-02-18
  14. 2020-02-18 1105 23 2020-02-18
  15. 2020-02-18 1035 24 2020-02-18
  16. 2020-02-18 1185 25 2020-02-18
  17. 2020-02-18 975 26 2020-02-18
  18. 2020-02-18 1625 27 2020-02-18
  19. 2020-02-18 1370 28 2020-02-18
  20. 2020-02-18 1485 29 2020-02-18
  21. 2020-02-18 930 30 2020-02-18
  22. 2020-02-18 1390 31 2020-02-18
  23. 2020-02-18 1250 32 2020-02-18
  24. 2020-02-18 1005 33 2020-02-18
  25. 2020-02-18 1250 34 2020-02-18
  26. 2020-02-18 1280 35 2020-02-18
  27. 2020-02-18 970 36 2020-02-18
  28. 2020-02-18 1170 37 2020-02-18
  29. 2020-02-18 1015 38 2020-02-18
  30. 2020-02-18 1015 39 2020-02-18
  31. Time taken: 0.446 seconds, Fetched: 270 row(s)

(4)sqoop工具的使用

sqoop是一个实现RDMS与HDFS数据交换处理的桥梁工具。实践过程中安装相对较为简单,不过本人在实践时下载sqoop1.99版本时在验证一直通不过,各种环境变量、依赖包都设置了还是不行。无奈只好选择sqoop1.4版本。

1. 安装配置

sqoop1.4tar包可以直接从各种镜像上下载下来,然后解压到centos,由于解压后名字较长,可以将其重命名一下。

  1. [hadoop@master ~]$ tar -xvf sqoop-1.4.6-cdh5.14.0.tar.gz
  2. [hadoop@master ~]$ mv sqoop-1.4.6-cdh5.14.0.tar.gz sqoop1.4.6

然后设置一下环境变量:

  1. [hadoop@master ~]$ vi ~/.bashrc

输入SQOOP_HOME路径:

  1. export SQOOP_HOME=/home/hadoop/sqoop1.4.6
  2. export PATH=$PATH:$SQOOP_HOME/bin

接下来进入sqoop文件夹的conf配置目录中,配置sqoop-env.sh:

  1. [hadoop@master conf]$ mv sqoop-env-template.sh sqoop-env.sh
  2. [hadoop@master conf]$ vi sqoop-env.sh

在该文件中给定hadoop相关路径:

#Set path to where bin/hadoop is available
export HADOOP_COMMON_HOME=/home/hadoop/hadoop-3.1.2

#Set path to where hadoop-*-core.jar is available
export HADOOP_MAPRED_HOME=/home/hadoop/hadoop-3.1.2

#set the path to where bin/hbase is available
#export HBASE_HOME=

#Set the path to where bin/hive is available
export HIVE_HOME=/home/hadoop/hive-3.1.2-bin

#Set the path for where zookeper config dir is
#export ZOOCFGDIR=
由于本次没有使用hbase和zookeeper,所以没有设置两者的路径。

如此环境变量就配置完毕,还有两个jar包需要导入,一个是mysql的jdbc工具包,一个是java的json工具包,这两个都可以从网上下载下来,然后导入到sqoop的lib目录中:

  1. [hadoop@master lib]$ ll java-json.jar
  2. -rw-r--r--. 1 hadoop hadoop 84697 Feb 13 14:35 java-json.jar
  3. [hadoop@master lib]$ ll mysql-connector-java-8.0.16.jar
  4. -rw-r--r--. 1 hadoop hadoop 2293144 Feb 13 12:05 mysql-connector-java-8.0.16.jar

2. 开始使用

sqoop的语法较为复杂,不过感觉也都是模板化的,按照规则就可以正常执行。

首先来测试一下,在当前用户目录下输入sqoop help:

  1. [hadoop@master ~]$ sqoop help
  2. Warning: /home/hadoop/sqoop1.4.6/../hbase does not exist! HBase imports will fail.
  3. Please set $HBASE_HOME to the root of your HBase installation.
  4. Warning: /home/hadoop/sqoop1.4.6/../hcatalog does not exist! HCatalog jobs will fail.
  5. Please set $HCAT_HOME to the root of your HCatalog installation.
  6. Warning: /home/hadoop/sqoop1.4.6/../accumulo does not exist! Accumulo imports will fail.
  7. Please set $ACCUMULO_HOME to the root of your Accumulo installation.
  8. Warning: /home/hadoop/sqoop1.4.6/../zookeeper does not exist! Accumulo imports will fail.
  9. Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
  10. 2020-02-13 14:56:14,436 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.14.0
  11. usage: sqoop COMMAND [ARGS]
  12. Available commands:
  13. codegen Generate code to interact with database records
  14. create-hive-table Import a table definition into Hive
  15. eval Evaluate a SQL statement and display the results
  16. export Export an HDFS directory to a database table
  17. help List available commands
  18. import Import a table from a database to HDFS
  19. import-all-tables Import tables from a database to HDFS
  20. import-mainframe Import datasets from a mainframe server to HDFS
  21. job Work with saved jobs
  22. list-databases List available databases on a server
  23. list-tables List available tables in a database
  24. merge Merge results of incremental imports
  25. metastore Run a standalone Sqoop metastore
  26. version Display version information
  27. See 'sqoop help COMMAND' for information on a specific command.

我们看到help中提示的,也是sqoop的主要使用方法,import 和export,可以将mysql中的表导入到HDFS系统中,也可以将HDFS系统目录导入到mysql中以表的形式存储。还可以使用create-hive-table来直接创建hive的表。

例如先将mysql中创建一个数据库名为sqoop,然后新建一个数据表为gameactive,插入两行记录如下:

  1. mysql> select * from gameactive;
  2. Empty set (0.01 sec)
  3. mysql> insert into gameactive values('2020-02-13','puke',100),('2020-02-13','dizhu',150);
  4. Query OK, 2 rows affected (0.14 sec)
  5. Records: 2 Duplicates: 0 Warnings: 0
  6. mysql> select * from gameactive;
  7. +------------+-----------+--------+
  8. | fdate | fgamename | fcount |
  9. +------------+-----------+--------+
  10. | 2020-02-13 | puke | 100 |
  11. | 2020-02-13 | dizhu | 150 |
  12. +------------+-----------+--------+
  13. 2 rows in set (0.00 sec)

然后使用sqoop脚本连接mysql和hadoop,将mysql中的这个数据表存入hdfs系统中:

[hadoop@master ~]$ sqoop import --connect jdbc:mysql://master:3306/sqoop --username root --password Root-123 --table gameactive --target-dir  /stat/test --delete-target-dir --num-mappers 1 --fields-terminated-by ','

上述脚本分为几个部分:

sqoop import \     采用import命令

--connect jdbc:mysql://master:3306/sqoop    使用jdbc连接mysql数据库,主机名为master,端口为3306,sqoop为mysql中的数据库名

--username root   连接时使用的mysql用户名为root

--password Root-123 连接时使用的mysql用户密码

--table gameactive  访问mysql中sqoop数据库里的gameactive数据表

--target-dir /stat/test 将数据表中内容存入hdfs中的stat/test目录下

--deleter-target-dir 如果该目录已存在就删除后再存入

--num-mappers 1   使用mapreduce中的mapper进程,数量为1

--fields-terminated-by ','  字段之间间隔采用逗号

执行上述脚本,系统会开启hadoop中的mapreduce任务进程处理:

  1. [hadoop@master ~]$ sqoop import --connect jdbc:mysql://master:3306/sqoop --username root --password Root-123 --table gameactive --target-dir /stat/test --delete-target-dir --num-mappers 1 --fields-terminated-by ','
  2. Warning: /home/hadoop/sqoop1.4.6/../hbase does not exist! HBase imports will fail.
  3. Please set $HBASE_HOME to the root of your HBase installation.
  4. Warning: /home/hadoop/sqoop1.4.6/../hcatalog does not exist! HCatalog jobs will fail.
  5. Please set $HCAT_HOME to the root of your HCatalog installation.
  6. Warning: /home/hadoop/sqoop1.4.6/../accumulo does not exist! Accumulo imports will fail.
  7. Please set $ACCUMULO_HOME to the root of your Accumulo installation.
  8. Warning: /home/hadoop/sqoop1.4.6/../zookeeper does not exist! Accumulo imports will fail.
  9. Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
  10. 2020-02-13 14:38:17,349 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.14.0
  11. 2020-02-13 14:38:17,386 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
  12. 2020-02-13 14:38:17,526 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
  13. 2020-02-13 14:38:17,529 INFO tool.CodeGenTool: Beginning code generation
  14. Loading class `com.mysql.jdbc.Driver'. This is deprecated. The new driver class is `com.mysql.cj.jdbc.Driver'. The driver is automatically registered via the SPI and manual loading of the driver class is generally unnecessary.
  15. 2020-02-13 14:38:18,426 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `gameactive` AS t LIMIT 1
  16. 2020-02-13 14:38:18,525 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `gameactive` AS t LIMIT 1
  17. 2020-02-13 14:38:18,534 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /home/hadoop/hadoop-3.1.2
  18. Note: /tmp/sqoop-hadoop/compile/3acb0490ad7cb85df80adb8d2b955e47/gameactive.java uses or overrides a deprecated API.
  19. Note: Recompile with -Xlint:deprecation for details.
  20. 2020-02-13 14:38:20,418 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoop/compile/3acb0490ad7cb85df80adb8d2b955e47/gameactive.jar
  21. 2020-02-13 14:38:21,378 INFO tool.ImportTool: Destination directory /stat/test is not present, hence not deleting.
  22. 2020-02-13 14:38:21,378 WARN manager.MySQLManager: It looks like you are importing from mysql.
  23. 2020-02-13 14:38:21,378 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
  24. 2020-02-13 14:38:21,378 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
  25. 2020-02-13 14:38:21,378 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
  26. 2020-02-13 14:38:21,386 INFO mapreduce.ImportJobBase: Beginning import of gameactive
  27. 2020-02-13 14:38:21,387 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
  28. 2020-02-13 14:38:21,456 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
  29. 2020-02-13 14:38:21,491 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
  30. 2020-02-13 14:38:21,768 INFO client.RMProxy: Connecting to ResourceManager at master/192.168.58.159:8032
  31. 2020-02-13 14:38:22,841 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/hadoop/.staging/job_1581564243812_0001
  32. 2020-02-13 14:38:54,794 INFO db.DBInputFormat: Using read commited transaction isolation
  33. 2020-02-13 14:38:55,013 INFO mapreduce.JobSubmitter: number of splits:1
  34. 2020-02-13 14:38:55,632 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1581564243812_0001
  35. 2020-02-13 14:38:55,634 INFO mapreduce.JobSubmitter: Executing with tokens: []
  36. 2020-02-13 14:38:56,023 INFO conf.Configuration: resource-types.xml not found
  37. 2020-02-13 14:38:56,023 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
  38. 2020-02-13 14:38:56,742 INFO impl.YarnClientImpl: Submitted application application_1581564243812_0001
  39. 2020-02-13 14:38:56,923 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1581564243812_0001/
  40. 2020-02-13 14:38:56,924 INFO mapreduce.Job: Running job: job_1581564243812_0001
  41. 2020-02-13 14:39:34,699 INFO mapreduce.Job: Job job_1581564243812_0001 running in uber mode : false
  42. 2020-02-13 14:39:34,702 INFO mapreduce.Job: map 0% reduce 0%
  43. 2020-02-13 14:39:48,428 INFO mapreduce.Job: map 100% reduce 0%
  44. 2020-02-13 14:39:49,482 INFO mapreduce.Job: Job job_1581564243812_0001 completed successfully
  45. 2020-02-13 14:39:49,615 INFO mapreduce.Job: Counters: 32
  46. File System Counters
  47. FILE: Number of bytes read=0
  48. FILE: Number of bytes written=234015
  49. FILE: Number of read operations=0
  50. FILE: Number of large read operations=0
  51. FILE: Number of write operations=0
  52. HDFS: Number of bytes read=87
  53. HDFS: Number of bytes written=41
  54. HDFS: Number of read operations=6
  55. HDFS: Number of large read operations=0
  56. HDFS: Number of write operations=2
  57. Job Counters
  58. Launched map tasks=1
  59. Other local map tasks=1
  60. Total time spent by all maps in occupied slots (ms)=21786
  61. Total time spent by all reduces in occupied slots (ms)=0
  62. Total time spent by all map tasks (ms)=10893
  63. Total vcore-milliseconds taken by all map tasks=10893
  64. Total megabyte-milliseconds taken by all map tasks=22308864
  65. Map-Reduce Framework
  66. Map input records=2
  67. Map output records=2
  68. Input split bytes=87
  69. Spilled Records=0
  70. Failed Shuffles=0
  71. Merged Map outputs=0
  72. GC time elapsed (ms)=404
  73. CPU time spent (ms)=2520
  74. Physical memory (bytes) snapshot=124497920
  75. Virtual memory (bytes) snapshot=3604873216
  76. Total committed heap usage (bytes)=40763392
  77. Peak Map Physical memory (bytes)=124497920
  78. Peak Map Virtual memory (bytes)=3604873216
  79. File Input Format Counters
  80. Bytes Read=0
  81. File Output Format Counters
  82. Bytes Written=41
  83. 2020-02-13 14:39:49,631 INFO mapreduce.ImportJobBase: Transferred 41 bytes in 88.1291 seconds (0.4652 bytes/sec)
  84. 2020-02-13 14:39:49,646 INFO mapreduce.ImportJobBase: Retrieved 2 records.

处理结束后,可以使用web界面来查看,也可以直接采用hdfs命令来访问/stat/test目录:

  1. [hadoop@master ~]$ hdfs dfs -ls /stat/test
  2. Found 2 items
  3. -rw-r--r-- 1 hadoop supergroup 0 2020-02-13 14:39 /stat/test/_SUCCESS
  4. -rw-r--r-- 1 hadoop supergroup 41 2020-02-13 14:39 /stat/test/part-m-00000
  5. [hadoop@master ~]$ hdfs dfs -cat /stat/test/part-m-00000
  6. 2020-02-13,puke,100
  7. 2020-02-13,dizhu,150

结果与mysql中建立的数据完全一致,这样就实现了mysql与hdfs之间的数据导入。

反过来如果将hdfs中数据导入到mysql中,执行脚本语言就得使用export方式。

首先在mysql端建立一个表getData,

  1. mysql> use sqoop
  2. Reading table information for completion of table and column names
  3. You can turn off this feature to get a quicker startup with -A
  4. Database changed
  5. mysql> create table getData(fdate varchar(64),fgamename varchar(64),fcount int);
  6. Query OK, 0 rows affected (0.78 sec)
  7. mysql> show tables;
  8. +-----------------+
  9. | Tables_in_sqoop |
  10. +-----------------+
  11. | gameactive |
  12. | getData |
  13. +-----------------+
  14. 2 rows in set (0.00 sec)

然后回到sqoop端,我们将前面从mysql中导出存储到HDFS中的数据再导回至mysql中,上述数据在hdfs系统存储位置为/stat/test,因此sqoop执行脚本写为:

[hadoop@master ~]$ sqoop export --connect jdbc:mysql://master:3306/sqoop --username root --password Root-123 --table getData -m 1 --export-dir '/stat/test' --fields-terminated-by ',' 

详细看的话:

sqoop export \     采用export命令

--connect jdbc:mysql://master:3306/sqoop    使用jdbc连接mysql数据库,主机名为master,端口为3306,sqoop为mysql中的数据库名

--username root   连接时使用的mysql用户名为root

--password Root-123 连接时使用的mysql用户密码

--table gameactive  访问mysql中sqoop数据库里的gameactive数据表

--target-dir /stat/test 将hdfs中的stat/test目录下的内容导出到mysql中

--num-mappers 1   使用mapreduce中的mapper进程,数量为1

--fields-terminated-by ','  字段之间间隔采用逗号

执行结束后,可以去mysql数据中查询:

  1. mysql> select * from getData;
  2. +------------+-----------+--------+
  3. | fdate | fgamename | fcount |
  4. +------------+-----------+--------+
  5. | 2020-02-13 | puke | 100 |
  6. | 2020-02-13 | dizhu | 150 |
  7. +------------+-----------+--------+
  8. 2 rows in set (0.04 sec)

可以看到,数据已经存入mysql中了。

 

本文内容由网友自发贡献,转载请注明出处:【wpsshop博客】
推荐阅读
相关标签
  

闽ICP备14008679号