当前位置:   article > 正文

【Hive】msck repair-解决动态分区查询不到数据

msck repair

0、背景

Hive的表建立了动态分区,手动复制文件或者Spark写数据到分区中,而且设置了参数但是还查询不到该分区数据。这时你应该需要执行MSCK REPAIR TABLE命令来刷新Hive的metastore数据

一、作用

MSCK REPAIR TABLE命令主要是用来解决通过hdfs dfs -put或者hdfs api写入hive分区表的数据在hive中无法被查询到的问题。

我们知道hive有个服务叫metastore,这个服务主要是存储一些元数据信息,比如数据库名,表名或者表的分区等等信息。如果不是通过hive的insert等插入语句,很多分区信息在metastore中是没有的,如果插入分区数据量很多的话,你用 ALTER TABLE table_name ADD PARTITION 一个个分区添加十分麻烦。这时候MSCK REPAIR TABLE就派上用场了。只需要运行MSCK REPAIR TABLE命令,hive就会去检测这个表在hdfs上的文件,把没有写入metastore的分区信息写入metastore。

二、例子

先创建一个分区表,然后往其中的一个分区插入一条数据,在查看分区信息

  1. CREATE TABLE repair_test (col_a STRING) PARTITIONED BY (par STRING);
  2. INSERT INTO TABLE repair_test PARTITION(par="partition_1") VALUES ("test");
  3. SHOW PARTITIONS repair_test;

查看分区信息的结果如下

  1. 0: jdbc:hive2://localhost:10000> show partitions repair_test;
  2. INFO : Compiling command(queryId=hive_20180810175151_5260f52e-10bb-4589-ad48-31ba72a81c21): show partitions repair_test
  3. INFO : Semantic Analysis Completed
  4. INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:partition, type:string, comment:from deserializer)], properties:null)
  5. INFO : Completed compiling command(queryId=hive_20180810175151_5260f52e-10bb-4589-ad48-31ba72a81c21); Time taken: 0.029 seconds
  6. INFO : Executing command(queryId=hive_20180810175151_5260f52e-10bb-4589-ad48-31ba72a81c21): show partitions repair_test
  7. INFO : Starting task [Stage-0:DDL] in serial mode
  8. INFO : Completed executing command(queryId=hive_20180810175151_5260f52e-10bb-4589-ad48-31ba72a81c21); Time taken: 0.017 seconds
  9. INFO : OK
  10. +------------------+--+
  11. | partition |
  12. +------------------+--+
  13. | par=partition_1 |
  14. +------------------+--+
  15. 1 row selected (0.073 seconds)
  16. 0: jdbc:hive2://localhost:10000>

然后我们通过hdfs的put命令手动创建一个数据

  1. [ericsson@h3cnamenode1 pcc]$ echo "123123" > test.txt
  2. [ericsson@h3cnamenode1 pcc]$ hdfs dfs -mkdir -p /user/hive/warehouse/test.db/repair_test/par=partition_2/
  3. [ericsson@h3cnamenode1 pcc]$ hdfs dfs -put -f test.txt /user/hive/warehouse/test.db/repair_test/par=partition_2/
  4. [ericsson@h3cnamenode1 pcc]$ hdfs dfs -ls -R /user/hive/warehouse/test.db/repair_test
  5. drwxrwxrwt - ericsson hive 0 2018-08-10 17:46 /user/hive/warehouse/test.db/repair_test/par=partition_1
  6. drwxrwxrwt - ericsson hive 0 2018-08-10 17:46 /user/hive/warehouse/test.db/repair_test/par=partition_1/.hive-staging_hive_2018-08-10_17-45-59_029_1594310228554990949-1
  7. drwxrwxrwt - ericsson hive 0 2018-08-10 17:46 /user/hive/warehouse/test.db/repair_test/par=partition_1/.hive-staging_hive_2018-08-10_17-45-59_029_1594310228554990949-1/-ext-10000
  8. -rwxrwxrwt 3 ericsson hive 5 2018-08-10 17:46 /user/hive/warehouse/test.db/repair_test/par=partition_1/000000_0
  9. drwxr-xr-x - ericsson hive 0 2018-08-10 17:57 /user/hive/warehouse/test.db/repair_test/par=partition_2
  10. -rw-r--r-- 3 ericsson hive 7 2018-08-10 17:57 /user/hive/warehouse/test.db/repair_test/par=partition_2/test.txt
  11. [ericsson@h3cnamenode1 pcc]$

 这时候我们查询分区信息,发现partition_2这个分区并没有加入到hive中

  1. 0: jdbc:hive2://localhost:10000> show partitions repair_test;
  2. INFO : Compiling command(queryId=hive_20180810175959_e7cefe8c-57b5-486c-8e03-b1201dac4d79): show partitions repair_test
  3. INFO : Semantic Analysis Completed
  4. INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:partition, type:string, comment:from deserializer)], properties:null)
  5. INFO : Completed compiling command(queryId=hive_20180810175959_e7cefe8c-57b5-486c-8e03-b1201dac4d79); Time taken: 0.029 seconds
  6. INFO : Executing command(queryId=hive_20180810175959_e7cefe8c-57b5-486c-8e03-b1201dac4d79): show partitions repair_test
  7. INFO : Starting task [Stage-0:DDL] in serial mode
  8. INFO : Completed executing command(queryId=hive_20180810175959_e7cefe8c-57b5-486c-8e03-b1201dac4d79); Time taken: 0.02 seconds
  9. INFO : OK
  10. +------------------+--+
  11. | partition |
  12. +------------------+--+
  13. | par=partition_1 |
  14. +------------------+--+
  15. 1 row selected (0.079 seconds)
  16. 0: jdbc:hive2://localhost:10000>

 运行MSCK REPAIR TABLE 命令后再查询分区信息,可以看到通过put命令放入的分区已经可以查询了

  1. 0: jdbc:hive2://localhost:10000> MSCK REPAIR TABLE repair_test;
  2. INFO : Compiling command(queryId=hive_20180810180000_7099daf2-6fde-44dd-8938-d2a02589358f): MSCK REPAIR TABLE repair_test
  3. INFO : Semantic Analysis Completed
  4. INFO : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
  5. INFO : Completed compiling command(queryId=hive_20180810180000_7099daf2-6fde-44dd-8938-d2a02589358f); Time taken: 0.004 seconds
  6. INFO : Executing command(queryId=hive_20180810180000_7099daf2-6fde-44dd-8938-d2a02589358f): MSCK REPAIR TABLE repair_test
  7. INFO : Starting task [Stage-0:DDL] in serial mode
  8. INFO : Completed executing command(queryId=hive_20180810180000_7099daf2-6fde-44dd-8938-d2a02589358f); Time taken: 0.138 seconds
  9. INFO : OK
  10. No rows affected (0.154 seconds)
  11. 0: jdbc:hive2://localhost:10000> show partitions repair_test;
  12. INFO : Compiling command(queryId=hive_20180810180000_ff711820-6f41-4d5d-9fee-b6e1cdbe1e25): show partitions repair_test
  13. INFO : Semantic Analysis Completed
  14. INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:partition, type:string, comment:from deserializer)], properties:null)
  15. INFO : Completed compiling command(queryId=hive_20180810180000_ff711820-6f41-4d5d-9fee-b6e1cdbe1e25); Time taken: 0.045 seconds
  16. INFO : Executing command(queryId=hive_20180810180000_ff711820-6f41-4d5d-9fee-b6e1cdbe1e25): show partitions repair_test
  17. INFO : Starting task [Stage-0:DDL] in serial mode
  18. INFO : Completed executing command(queryId=hive_20180810180000_ff711820-6f41-4d5d-9fee-b6e1cdbe1e25); Time taken: 0.016 seconds
  19. INFO : OK
  20. +------------------+--+
  21. | partition |
  22. +------------------+--+
  23. | par=partition_1 |
  24. | par=partition_2 |
  25. +------------------+--+
  26. 2 rows selected (0.088 seconds)
  27. 0: jdbc:hive2://localhost:10000> select * from repair_test;
  28. INFO : Compiling command(queryId=hive_20180810180101_1225075e-43c8-4a49-b8ef-a12f72544a38): select * from repair_test
  29. INFO : Semantic Analysis Completed
  30. INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:repair_test.col_a, type:string, comment:null), FieldSchema(name:repair_test.par, type:string, comment:null)], properties:null)
  31. INFO : Completed compiling command(queryId=hive_20180810180101_1225075e-43c8-4a49-b8ef-a12f72544a38); Time taken: 0.059 seconds
  32. INFO : Executing command(queryId=hive_20180810180101_1225075e-43c8-4a49-b8ef-a12f72544a38): select * from repair_test
  33. INFO : Completed executing command(queryId=hive_20180810180101_1225075e-43c8-4a49-b8ef-a12f72544a38); Time taken: 0.001 seconds
  34. INFO : OK
  35. +--------------------+------------------+--+
  36. | repair_test.col_a | repair_test.par |
  37. +--------------------+------------------+--+
  38. | test | partition_1 |
  39. | 123123 | partition_2 |
  40. +--------------------+------------------+--+
  41. 2 rows selected (0.121 seconds)
  42. 0: jdbc:hive2://localhost:10000>

三、注意

大致情况是很多人以为alter table drop partition只能删除一个分区的数据,结果用hdfs dfs -rmr 删除hive分区表的hdfs文件。这就导致了一个问题hdfs上的文件虽然删除了,但是hive metastore中的原信息没有删除。如果用show parttions table_name 这些分区信息还在,需要把这些分区原信息清除。

若这时MSCK REPAIR TABLE这个命令能否删除已经不存在hdfs上的表分区信息,发现不行,我去jira查了下,发现Fix Version/s: 3.0.0, 2.4.0, 3.1.0 这几个版本的hive才支持这个功能。但由于我们的hive版本是1.1.0-cdh5.11.0, 这个方法无法使用。

附上官网的链接
Recover Partitions (MSCK REPAIR TABLE)

Recover Partitions (MSCK REPAIR TABLE)

Hive stores a list of partitions for each table in its metastore. If, however, new partitions are directly added to HDFS (say by using hadoop fs -put command) or removed from HDFS, the metastore (and hence Hive) will not be aware of these changes to partition information unless the user runs ALTER TABLE table_name ADD/DROP PARTITION commands on each of the newly added or removed partitions, respectively.
However, users can run a metastore check command with the repair table option:
MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS];
which will update metadata about partitions to the Hive metastore for partitions for which such metadata doesn't already exist. The default option for MSC command is ADD PARTITIONS. With this option, it will add any partitions that exist on HDFS but not in metastore to the metastore. The DROP PARTITIONS option will remove the partition information from metastore, that is already removed from HDFS. The SYNC PARTITIONS option is equivalent to calling both ADD and DROP PARTITIONS. See HIVE-874 and HIVE-17824 for more details. When there is a large number of untracked partitions, there is a provision to run MSCK REPAIR TABLE batch wise to avoid OOME (Out of Memory Error). By giving the configured batch size for the property hive.msck.repair.batch.size it can run in the batches internally. The default value of the property is zero, it means it will execute all the partitions at once. MSCK command without the REPAIR option can be used to find details about metadata mismatch metastore.
The equivalent command on Amazon Elastic MapReduce (EMR)'s version of Hive is:
ALTER TABLE table_name RECOVER PARTITIONS;
Starting with Hive 1.3, MSCK will throw exceptions if directories with disallowed characters in partition values are found on HDFS. Use hive.msck.path.validation setting on the client to alter this behavior; "skip" will simply skip the directories. "ignore" will try to create partitions anyway (old behavior). This may or may not work.

HIVE-17824 是关于hive msck repair 增加清理metastore中已经不在hdfs上的分区信息 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/2023面试高手/article/detail/429624
推荐阅读
相关标签
  

闽ICP备14008679号