当前位置:   article > 正文

HIVE常用命令之MSCK REPAIR TABLE_mask repair table

mask repair table

目录

MSCK REPAIR TABLE 命令是做啥的

例子

后续


 

MSCK REPAIR TABLE 命令是做啥的

MSCK REPAIR TABLE命令主要是用来解决通过hdfs dfs -put或者hdfs api写入hive分区表的数据在hive中无法被查询到的问题。

我们知道hive有个服务叫metastore,这个服务主要是存储一些元数据信息,比如数据库名,表名或者表的分区等等信息。如果不是通过hive的insert等插入语句,很多分区信息在metastore中是没有的,如果插入分区数据量很多的话,你用 ALTER TABLE table_name ADD PARTITION 一个个分区添加十分麻烦。这时候MSCK REPAIR TABLE就派上用场了。只需要运行MSCK REPAIR TABLE命令,hive就会去检测这个表在hdfs上的文件,把没有写入metastore的分区信息写入metastore。

例子

我们先创建一个分区表,然后往其中的一个分区插入一条数据,在查看分区信息

  1. CREATE TABLE repair_test (col_a STRING) PARTITIONED BY (par STRING);
  2. INSERT INTO TABLE repair_test PARTITION(par="partition_1") VALUES ("test");
  3. SHOW PARTITIONS repair_test;



查看分区信息的结果如下

  1. 0: jdbc:hive2://localhost:10000> show partitions repair_test;
  2. INFO : Compiling command(queryId=hive_20180810175151_5260f52e-10bb-4589-ad48-31ba72a81c21): show partitions repair_test
  3. INFO : Semantic Analysis Completed
  4. INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:partition, type:string, comment:from deserializer)], properties:null)
  5. INFO : Completed compiling command(queryId=hive_20180810175151_5260f52e-10bb-4589-ad48-31ba72a81c21); Time taken: 0.029 seconds
  6. INFO : Executing command(queryId=hive_20180810175151_5260f52e-10bb-4589-ad48-31ba72a81c21): show partitions repair_test
  7. INFO : Starting task [Stage-0:DDL] in serial mode
  8. INFO : Completed executing command(queryId=hive_20180810175151_5260f52e-10bb-4589-ad48-31ba72a81c21); Time taken: 0.017 seconds
  9. INFO : OK
  10. +------------------+--+
  11. | partition |
  12. +------------------+--+
  13. | par=partition_1 |
  14. +------------------+--+
  15. 1 row selected (0.073 seconds)
  16. 0: jdbc:hive2://localhost:10000>



然后我们通过hdfs的put命令手动创建一个数据

  1. [ericsson@h3cnamenode1 pcc]$ echo "123123" > test.txt
  2. [ericsson@h3cnamenode1 pcc]$ hdfs dfs -mkdir -p /user/hive/warehouse/test.db/repair_test/par=partition_2/
  3. [ericsson@h3cnamenode1 pcc]$ hdfs dfs -put -f test.txt /user/hive/warehouse/test.db/repair_test/par=partition_2/
  4. [ericsson@h3cnamenode1 pcc]$ hdfs dfs -ls -R /user/hive/warehouse/test.db/repair_test
  5. drwxrwxrwt - ericsson hive 0 2018-08-10 17:46 /user/hive/warehouse/test.db/repair_test/par=partition_1
  6. drwxrwxrwt - ericsson hive 0 2018-08-10 17:46 /user/hive/warehouse/test.db/repair_test/par=partition_1/.hive-staging_hive_2018-08-10_17-45-59_029_1594310228554990949-1
  7. drwxrwxrwt - ericsson hive 0 2018-08-10 17:46 /user/hive/warehouse/test.db/repair_test/par=partition_1/.hive-staging_hive_2018-08-10_17-45-59_029_1594310228554990949-1/-ext-10000
  8. -rwxrwxrwt 3 ericsson hive 5 2018-08-10 17:46 /user/hive/warehouse/test.db/repair_test/par=partition_1/000000_0
  9. drwxr-xr-x - ericsson hive 0 2018-08-10 17:57 /user/hive/warehouse/test.db/repair_test/par=partition_2
  10. -rw-r--r-- 3 ericsson hive 7 2018-08-10 17:57 /user/hive/warehouse/test.db/repair_test/par=partition_2/test.txt
  11. [ericsson@h3cnamenode1 pcc]$

 

这时候我们查询分区信息,发现partition_2这个分区并没有加入到hive中

  1. 0: jdbc:hive2://localhost:10000> show partitions repair_test;
  2. INFO : Compiling command(queryId=hive_20180810175959_e7cefe8c-57b5-486c-8e03-b1201dac4d79): show partitions repair_test
  3. INFO : Semantic Analysis Completed
  4. INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:partition, type:string, comment:from deserializer)], properties:null)
  5. INFO : Completed compiling command(queryId=hive_20180810175959_e7cefe8c-57b5-486c-8e03-b1201dac4d79); Time taken: 0.029 seconds
  6. INFO : Executing command(queryId=hive_20180810175959_e7cefe8c-57b5-486c-8e03-b1201dac4d79): show partitions repair_test
  7. INFO : Starting task [Stage-0:DDL] in serial mode
  8. INFO : Completed executing command(queryId=hive_20180810175959_e7cefe8c-57b5-486c-8e03-b1201dac4d79); Time taken: 0.02 seconds
  9. INFO : OK
  10. +------------------+--+
  11. | partition |
  12. +------------------+--+
  13. | par=partition_1 |
  14. +------------------+--+
  15. 1 row selected (0.079 seconds)
  16. 0: jdbc:hive2://localhost:10000>



运行MSCK REPAIR TABLE 命令后再查询分区信息,可以看到通过put命令放入的分区已经可以查询了

  1. 0: jdbc:hive2://localhost:10000> MSCK REPAIR TABLE repair_test;
  2. INFO : Compiling command(queryId=hive_20180810180000_7099daf2-6fde-44dd-8938-d2a02589358f): MSCK REPAIR TABLE repair_test
  3. INFO : Semantic Analysis Completed
  4. INFO : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
  5. INFO : Completed compiling command(queryId=hive_20180810180000_7099daf2-6fde-44dd-8938-d2a02589358f); Time taken: 0.004 seconds
  6. INFO : Executing command(queryId=hive_20180810180000_7099daf2-6fde-44dd-8938-d2a02589358f): MSCK REPAIR TABLE repair_test
  7. INFO : Starting task [Stage-0:DDL] in serial mode
  8. INFO : Completed executing command(queryId=hive_20180810180000_7099daf2-6fde-44dd-8938-d2a02589358f); Time taken: 0.138 seconds
  9. INFO : OK
  10. No rows affected (0.154 seconds)
  11. 0: jdbc:hive2://localhost:10000> show partitions repair_test;
  12. INFO : Compiling command(queryId=hive_20180810180000_ff711820-6f41-4d5d-9fee-b6e1cdbe1e25): show partitions repair_test
  13. INFO : Semantic Analysis Completed
  14. INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:partition, type:string, comment:from deserializer)], properties:null)
  15. INFO : Completed compiling command(queryId=hive_20180810180000_ff711820-6f41-4d5d-9fee-b6e1cdbe1e25); Time taken: 0.045 seconds
  16. INFO : Executing command(queryId=hive_20180810180000_ff711820-6f41-4d5d-9fee-b6e1cdbe1e25): show partitions repair_test
  17. INFO : Starting task [Stage-0:DDL] in serial mode
  18. INFO : Completed executing command(queryId=hive_20180810180000_ff711820-6f41-4d5d-9fee-b6e1cdbe1e25); Time taken: 0.016 seconds
  19. INFO : OK
  20. +------------------+--+
  21. | partition |
  22. +------------------+--+
  23. | par=partition_1 |
  24. | par=partition_2 |
  25. +------------------+--+
  26. 2 rows selected (0.088 seconds)
  27. 0: jdbc:hive2://localhost:10000> select * from repair_test;
  28. INFO : Compiling command(queryId=hive_20180810180101_1225075e-43c8-4a49-b8ef-a12f72544a38): select * from repair_test
  29. INFO : Semantic Analysis Completed
  30. INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:repair_test.col_a, type:string, comment:null), FieldSchema(name:repair_test.par, type:string, comment:null)], properties:null)
  31. INFO : Completed compiling command(queryId=hive_20180810180101_1225075e-43c8-4a49-b8ef-a12f72544a38); Time taken: 0.059 seconds
  32. INFO : Executing command(queryId=hive_20180810180101_1225075e-43c8-4a49-b8ef-a12f72544a38): select * from repair_test
  33. INFO : Completed executing command(queryId=hive_20180810180101_1225075e-43c8-4a49-b8ef-a12f72544a38); Time taken: 0.001 seconds
  34. INFO : OK
  35. +--------------------+------------------+--+
  36. | repair_test.col_a | repair_test.par |
  37. +--------------------+------------------+--+
  38. | test | partition_1 |
  39. | 123123 | partition_2 |
  40. +--------------------+------------------+--+
  41. 2 rows selected (0.121 seconds)
  42. 0: jdbc:hive2://localhost:10000>





 

后续

后面发生了更有意思的事情。大致情况是很多人以为alter table drop partition只能删除一个分区的数据,结果用hdfs dfs -rmr 删除hive分区表的hdfs文件。这就导致了一个问题hdfs上的文件虽然删除了,但是hive metastore中的原信息没有删除。如果用show parttions table_name 这些分区信息还在,需要把这些分区原信息清除。

后来我想看看MSCK REPAIR TABLE这个命令能否删除已经不存在hdfs上的表分区信息,发现不行,我去jira查了下,发现Fix Version/s: 3.0.0, 2.4.0, 3.1.0 这几个版本的hive才支持这个功能。但由于我们的hive版本是1.1.0-cdh5.11.0, 这个方法无法使用。

附上官网的链接
Recover Partitions (MSCK REPAIR TABLE)

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/我家小花儿/article/detail/719453
推荐阅读
相关标签
  

闽ICP备14008679号