当前位置:   article > 正文

[Hadoop大数据]——Hive数据的导入导出

hadoop运行完的文件怎么导出

[Hadoop大数据]——Hive数据的导入导出

Hive作为大数据环境下的数据仓库工具,支持基于hadoop以sql的方式执行mapreduce的任务,非常适合对大量的数据进行全量的查询分析。

本文主要讲述下hive载cli中如何导入导出数据:

449064-20160823222503433-646418508.png

导入数据

第一种方式,直接从本地文件系统导入数据

我的本机有一个test1.txt文件,这个文件中有三列数据,并且每列都是以'\t'为分隔

  1. [root@localhost conf]# cat /usr/tmp/test1.txt
  2. 1 a1 b1
  3. 2 a2 b2
  4. 3 a3 b3
  5. 4 a4 b

创建数据表:

  1. >create table test1(a string,b string,c string)
  2. >row format delimited
  3. >fields terminated by '\t'
  4. >stored as textfile;

导入数据:

load data local inpath '/usr/tmp/test1.txt' overwrite into table test1;

其中local inpath,表明路径为本机路径
overwrite表示加载的数据会覆盖原来的内容

第二种,从hdfs文件中导入数据

首先上传数据到hdfs中

hadoop fs -put /usr/tmp/test1.txt /test1.txt

在hive中查看test1.txt文件

  1. hive> dfs -cat /test1.txt;
  2. 1 a1 b1
  3. 2 a2 b2
  4. 3 a3 b3
  5. 4 a4 b4

创建数据表,与前面一样。导入数据的命令有些差异:

load data inpath '/test1.txt' overwrite into table test2;
第三种,基于查询insert into导入

首先定义数据表,这里直接创建带有分区的表

  1. hive> create table test3(a string,b string,c string) partitioned by (d string) row format delimited fields terminated by '\t' stored as textfile;
  2. OK
  3. Time taken: 0.109 seconds
  4. hive> describe test3;
  5. OK
  6. a string
  7. b string
  8. c string
  9. d string
  10. # Partition Information
  11. # col_name data_type comment
  12. d string
  13. Time taken: 0.071 seconds, Fetched: 9 row(s)

通过查询直接导入数据到固定的分区表中:

  1. hive> insert into table test3 partition(d='aaaaaa') select * from test2;
  2. WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  3. Query ID = root_20160823212718_9cfdbea4-42fa-4267-ac46-9ac2c357f944
  4. Total jobs = 3
  5. Launching Job 1 out of 3
  6. Number of reduce tasks is set to 0 since there's no reduce operator
  7. Job running in-process (local Hadoop)
  8. 2016-08-23 21:27:21,621 Stage-1 map = 100%, reduce = 0%
  9. Ended Job = job_local1550375778_0001
  10. Stage-4 is selected by condition resolver.
  11. Stage-3 is filtered out by condition resolver.
  12. Stage-5 is filtered out by condition resolver.
  13. Moving data to directory hdfs://localhost:8020/user/hive/warehouse/test.db/test3/d=aaaaaa/.hive-staging_hive_2016-08-23_21-27-18_739_4058721562930266873-1/-ext-10000
  14. Loading data to table test.test3 partition (d=aaaaaa)
  15. MapReduce Jobs Launched:
  16. Stage-Stage-1: HDFS Read: 248 HDFS Write: 175 SUCCESS
  17. Total MapReduce CPU Time Spent: 0 msec
  18. OK
  19. Time taken: 3.647 seconds

通过查询观察结果

  1. hive> select * from test3;
  2. OK
  3. 1 a1 b1 aaaaaa
  4. 2 a2 b2 aaaaaa
  5. 3 a3 b3 aaaaaa
  6. 4 a4 b4 aaaaaa
  7. Time taken: 0.264 seconds, Fetched: 4 row(s)

PS:也可以直接通过动态分区插入数据:

insert into table test4 partition(c) select * from test2;

分区会以文件夹命名的方式存储:

  1. hive> dfs -ls /user/hive/warehouse/test.db/test4/;
  2. Found 4 items
  3. drwxr-xr-x - root supergroup 0 2016-08-23 21:33 /user/hive/warehouse/test.db/test4/c=b1
  4. drwxr-xr-x - root supergroup 0 2016-08-23 21:33 /user/hive/warehouse/test.db/test4/c=b2
  5. drwxr-xr-x - root supergroup 0 2016-08-23 21:33 /user/hive/warehouse/test.db/test4/c=b3
  6. drwxr-xr-x - root supergroup 0 2016-08-23 21:33 /user/hive/warehouse/test.db/test4/c=b4
第四种,直接基于查询创建数据表

直接通过查询创建数据表:

  1. hive> create table test5 as select * from test4;
  2. WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  3. Query ID = root_20160823213944_03672168-bc56-43d7-aefb-cac03a6558bf
  4. Total jobs = 3
  5. Launching Job 1 out of 3
  6. Number of reduce tasks is set to 0 since there's no reduce operator
  7. Job running in-process (local Hadoop)
  8. 2016-08-23 21:39:46,030 Stage-1 map = 100%, reduce = 0%
  9. Ended Job = job_local855333165_0003
  10. Stage-4 is selected by condition resolver.
  11. Stage-3 is filtered out by condition resolver.
  12. Stage-5 is filtered out by condition resolver.
  13. Moving data to directory hdfs://localhost:8020/user/hive/warehouse/test.db/.hive-staging_hive_2016-08-23_21-39-44_259_5484795730585321098-1/-ext-10002
  14. Moving data to directory hdfs://localhost:8020/user/hive/warehouse/test.db/test5
  15. MapReduce Jobs Launched:
  16. Stage-Stage-1: HDFS Read: 600 HDFS Write: 466 SUCCESS
  17. Total MapReduce CPU Time Spent: 0 msec
  18. OK
  19. Time taken: 2.184 seconds

查看结果

  1. hive> select * from test5;
  2. OK
  3. 1 a1 b1
  4. 2 a2 b2
  5. 3 a3 b3
  6. 4 a4 b4
  7. Time taken: 0.147 seconds, Fetched: 4 row(s)

导出数据

导出到本地文件

执行导出本地文件命令:

  1. hive> insert overwrite local directory '/usr/tmp/export' select * from test1;
  2. WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  3. Query ID = root_20160823221655_05b05863-6273-4bdd-aad2-e80d4982425d
  4. Total jobs = 1
  5. Launching Job 1 out of 1
  6. Number of reduce tasks is set to 0 since there's no reduce operator
  7. Job running in-process (local Hadoop)
  8. 2016-08-23 22:16:57,028 Stage-1 map = 100%, reduce = 0%
  9. Ended Job = job_local8632460_0005
  10. Moving data to local directory /usr/tmp/export
  11. MapReduce Jobs Launched:
  12. Stage-Stage-1: HDFS Read: 794 HDFS Write: 498 SUCCESS
  13. Total MapReduce CPU Time Spent: 0 msec
  14. OK
  15. Time taken: 1.569 seconds
  16. hive>

在本地文件查看内容:

  1. [root@localhost export]# ll
  2. total 4
  3. -rw-r--r--. 1 root root 32 Aug 23 22:16 000000_0
  4. [root@localhost export]# cat 000000_0
  5. 1a1b1
  6. 2a2b2
  7. 3a3b3
  8. 4a4b4
  9. [root@localhost export]# pwd
  10. /usr/tmp/export
  11. [root@localhost export]#
导出到hdfs
  1. hive> insert overwrite directory '/usr/tmp/test' select * from test1;
  2. WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  3. Query ID = root_20160823214217_e8c71bb9-a147-4518-8353-81f9adc54183
  4. Total jobs = 3
  5. Launching Job 1 out of 3
  6. Number of reduce tasks is set to 0 since there's no reduce operator
  7. Job running in-process (local Hadoop)
  8. 2016-08-23 21:42:19,257 Stage-1 map = 100%, reduce = 0%
  9. Ended Job = job_local628523792_0004
  10. Stage-3 is selected by condition resolver.
  11. Stage-2 is filtered out by condition resolver.
  12. Stage-4 is filtered out by condition resolver.
  13. Moving data to directory hdfs://localhost:8020/usr/tmp/test/.hive-staging_hive_2016-08-23_21-42-17_778_6818164305996247644-1/-ext-10000
  14. Moving data to directory /usr/tmp/test
  15. MapReduce Jobs Launched:
  16. Stage-Stage-1: HDFS Read: 730 HDFS Write: 498 SUCCESS
  17. Total MapReduce CPU Time Spent: 0 msec
  18. OK
  19. Time taken: 1.594 seconds

导出成功,查看导出的hdfs文件

  1. hive> dfs -cat /usr/tmp/test;
  2. cat: `/usr/tmp/test': Is a directory
  3. Command failed with exit code = 1
  4. Query returned non-zero code: 1, cause: null
  5. hive> dfs -ls /usr/tmp/test;
  6. Found 1 items
  7. -rwxr-xr-x 3 root supergroup 32 2016-08-23 21:42 /usr/tmp/test/000000_0
  8. hive> dfs -cat /usr/tmp/test/000000_0;
  9. 1a1b1
  10. 2a2b2
  11. 3a3b3
  12. 4a4b4
  13. hive>
导出到另一个表

样例可以参考前面数据导入的部分:

insert into table test3 select * from test1;
posted @ 2016-08-23 22:26 xingoo 阅读( ...) 评论( ...) 编辑 收藏
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/weixin_40725706/article/detail/693381
推荐阅读
相关标签
  

闽ICP备14008679号