当前位置:   article > 正文

HBase 永久RIT(Region-In-Transition)问题解决记录_regions in transition

regions in transition

本文环境:
CDH6.2.0 (Hbase 2.1.0)
Kylin 2.6.3
  • 1
  • 2
  • 3

HBase 永久RIT(Region-In-Transition)问题:异常关机导致HBase表损坏和丢失,大量Regions 处于Offline状态,无法上线。

问题1:启动HBase时,HBase Regionserver Web UI,一直停留在The RegionServer is initializing! 界面。

在这里插入图片描述
故障排查及解决思路
查看HBase日志
在这里插入图片描述
hbase日志里多次出现,再次深入发现其已经RIT
在这里插入图片描述

# 查看hdfs safe mode
hdfs dfsadmin -safemode get
  • 1
  • 2

在这里插入图片描述
安全模式是已经关闭的。如果没关上请使用下面的命令先关闭。

# 退出hdfs safe mode
hdfs dfsadmin -safemode leave
  • 1
  • 2

在这里插入图片描述
故障排查及解决思路

# 查看dfs 状态报告
hdfs dfsadmin -report
  • 1
  • 2

在这里插入图片描述
没有丢块,这就很奇怪了。继续检查hdfs有没有损坏文件

# 查看损坏文件、当前hdfs的副本数
hdfs fsck /
或者
hdfs fsck -locations
  • 1
  • 2
  • 3
  • 4

还是没有发现异常的情况。
在这里插入图片描述
Under replicated blocks    副本数少于指定副本数的block数量
Blocks with corrupt replicas   存在损坏副本的block的数据
Missing blocks        丢失block数量

核心修复步骤1:(尝试性执行)

更改已经上传文件的副本数,修复Missing blocks

hadoop fs -setrep -R 3 /
  • 1

通过该命令,对于存在副本缺失问题(Under replicated blocks)的block,可以从剩下的1-2个副本,重新生成3个副本,从而找回了丢失的副本。

核心修复步骤2:

删除损坏文件

hdfs fsck -delete 
  • 1

通过多次运行该命令,对于副本全部丢失(Missing blocks)或损坏的block,可以从namenode节点删除元信息和损坏文件。

tips:问题最后依然没有被解决。

由于之前没有完全解决这个问题,导致现在发生了更严重的问题。

在这里插入图片描述
陷入RIT的region激增至117个

查看hbase数据一致性

sudo -u hbase hbase hbck
  • 1

日志摘要如下:

Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /hbase/data/default/KYLIN_CNJJRE3KX1/eb35470f15e4bb228262a54169d92c63/.regioninfo
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:85)
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:75)
	at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:152)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1909)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:735)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:415)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)

	at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1499)
	at org.apache.hadoop.ipc.Client.call(Client.java:1445)
	at org.apache.hadoop.ipc.Client.call(Client.java:1355)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
	at com.sun.proxy.$Proxy9.getBlockLocations(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:304)
	at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
	at com.sun.proxy.$Proxy10.getBlockLocations(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:853)
	... 21 more
20/04/22 12:55:42 WARN util.HBaseFsck: Unable to read .tableinfo from hdfs://nameservice1/hbase
org.apache.hadoop.hbase.TableInfoMissingException: No table descriptor file under hdfs://nameservice1/hbase/data/default/KYLIN_HWXQTTYU05
	at org.apache.hadoop.hbase.util.FSTableDescriptors.getTableDescriptorFromFs(FSTableDescriptors.java:557)
	at org.apache.hadoop.hbase.util.FSTableDescriptors.getTableDescriptorFromFs(FSTableDescriptors.java:545)
	at org.apache.hadoop.hbase.util.HBaseFsck.loadHdfsRegionInfos(HBaseFsck.java:1392)
	at org.apache.hadoop.hbase.util.HBaseFsck.onlineConsistencyRepair(HBaseFsck.java:745)
	at org.apache.hadoop.hbase.util.HBaseFsck.onlineHbck(HBaseFsck.java:836)
	at org.apache.hadoop.hbase.util.HBaseFsck.exec(HBaseFsck.java:5154)
	at org.apache.hadoop.hbase.util.HBaseFsck$HBaseFsckTool.run(HBaseFsck.java:4947)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
	at org.apache.hadoop.hbase.util.HBaseFsck.main(HBaseFsck.java:4935)
ERROR: Unable to read .tableinfo from hdfs://nameservice1/hbase/KYLIN_HWXQTTYU05

20/04/22 12:55:46 INFO util.HBaseFsck: Checking and fixing region consistency
ERROR: Region { meta => KYLIN_IC4ZTNA3D5,,1577784211982.0d116f627cb9ca1b3b9f42992e5ab536., hdfs => hdfs://nameservice1/hbase/data/default/KYLIN_IC4ZTNA3D5/0d116f627cb9ca1b3b9f42992e5ab536, deployed => , replicaId => 0 } not deployed on any region server.

......
ERROR: Region { meta => KYLIN_4SQ5XIVLWU,,1580047544478.96f18011aabf07e2e490450d70a0ca20., hdfs => hdfs://nameservice1/hbase/data/default/KYLIN_4SQ5XIVLWU/96f18011aabf07e2e490450d70a0ca20, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => KYLIN_7QLMJ32LLH,,1567588165728.88c950931a99a85753cd4f87e43eca48., hdfs => hdfs://nameservice1/hbase/data/default/KYLIN_7QLMJ32LLH/88c950931a99a85753cd4f87e43eca48, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: KYLIN_CNJJRE3KX1 has dangling table state tableName=KYLIN_CNJJRE3KX1, state=ENABLED
20/04/22 12:55:49 INFO util.HBaseFsck: Handling overlap merges in parallel. set hbasefsck.overlap.merge.parallel to false to run serially.
ERROR: There is a hole in the region chain between  and .  You need to create a new .regioninfo and region dir in hdfs to plug the hole.
ERROR: Found inconsistency in table KYLIN_UQPUCHTR5M
......
ERROR: There is a hole in the region chain between  and .  You need to create a new .regioninfo and region dir in hdfs to plug the hole.
ERROR: Found inconsistency in table KYLIN_E21JMZ27KX
Summary:
Table KYLIN_UQPUCHTR5M is okay.
    Number of regions: 0
    Deployed on: 
......
Table KYLIN_9B9SYOGJ7C is okay.
    Number of regions: 0
    Deployed on: 
Table KYLIN_E21JMZ27KX is okay.
    Number of regions: 0
    Deployed on: 
Table KYLIN_ULZF9NZ36M is okay.
    Number of regions: 1
    Deployed on:  node5,16020,1587527823160
787 inconsistencies detected.
Status: INCONSISTENT
20/04/22 12:55:49 INFO zookeeper.ZooKeeper: Session: 0x3719fbb6659010c closed
20/04/22 12:55:49 INFO zookeeper.ClientCnxn: EventThread shut down
20/04/22 12:55:49 INFO client.ConnectionImplementation: Closing master protocol: MasterService
20/04/22 12:55:49 INFO zookeeper.ZooKeeper: Session: 0x1719fbb42bd011a closed
20/04/22 12:55:49 INFO zookeeper.ClientCnxn: EventThread shut down
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83

在这里插入图片描述
很显然Hbase的数据出现了不一致性,但是由于题主使用的是CDH6.2这个坑爹玩意,很多hbck的修复工具在当前环境下无法运行,详见https://developer.aliyun.com/article/738179?spm=a2c6h.12873581.0.0.2d33187akpUNcd

以上未完待续。

关于本篇文章所描述的问题,我的解决方案在这https://blog.csdn.net/qq_36933797/article/details/105729051

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/盐析白兔/article/detail/638333
推荐阅读
相关标签
  

闽ICP备14008679号