赞
踩
本文环境:
CDH6.2.0 (Hbase 2.1.0)
Kylin 2.6.3
HBase 永久RIT(Region-In-Transition)问题:异常关机导致HBase表损坏和丢失,大量Regions 处于Offline状态,无法上线。
故障排查及解决思路
查看HBase日志
hbase日志里多次出现,再次深入发现其已经RIT
# 查看hdfs safe mode
hdfs dfsadmin -safemode get
安全模式是已经关闭的。如果没关上请使用下面的命令先关闭。
# 退出hdfs safe mode
hdfs dfsadmin -safemode leave
故障排查及解决思路
# 查看dfs 状态报告
hdfs dfsadmin -report
没有丢块,这就很奇怪了。继续检查hdfs有没有损坏文件
# 查看损坏文件、当前hdfs的副本数
hdfs fsck /
或者
hdfs fsck -locations
还是没有发现异常的情况。
Under replicated blocks 副本数少于指定副本数的block数量
Blocks with corrupt replicas 存在损坏副本的block的数据
Missing blocks 丢失block数量
hadoop fs -setrep -R 3 /
通过该命令,对于存在副本缺失问题(Under replicated blocks)的block,可以从剩下的1-2个副本,重新生成3个副本,从而找回了丢失的副本。
核心修复步骤2:
hdfs fsck -delete
通过多次运行该命令,对于副本全部丢失(Missing blocks)或损坏的block,可以从namenode节点删除元信息和损坏文件。
tips:问题最后依然没有被解决。
由于之前没有完全解决这个问题,导致现在发生了更严重的问题。
陷入RIT的region激增至117个
sudo -u hbase hbase hbck
日志摘要如下:
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /hbase/data/default/KYLIN_CNJJRE3KX1/eb35470f15e4bb228262a54169d92c63/.regioninfo at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:85) at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:75) at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:152) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1909) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:735) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:415) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1499) at org.apache.hadoop.ipc.Client.call(Client.java:1445) at org.apache.hadoop.ipc.Client.call(Client.java:1355) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) at com.sun.proxy.$Proxy9.getBlockLocations(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:304) at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) at com.sun.proxy.$Proxy10.getBlockLocations(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:853) ... 21 more 20/04/22 12:55:42 WARN util.HBaseFsck: Unable to read .tableinfo from hdfs://nameservice1/hbase org.apache.hadoop.hbase.TableInfoMissingException: No table descriptor file under hdfs://nameservice1/hbase/data/default/KYLIN_HWXQTTYU05 at org.apache.hadoop.hbase.util.FSTableDescriptors.getTableDescriptorFromFs(FSTableDescriptors.java:557) at org.apache.hadoop.hbase.util.FSTableDescriptors.getTableDescriptorFromFs(FSTableDescriptors.java:545) at org.apache.hadoop.hbase.util.HBaseFsck.loadHdfsRegionInfos(HBaseFsck.java:1392) at org.apache.hadoop.hbase.util.HBaseFsck.onlineConsistencyRepair(HBaseFsck.java:745) at org.apache.hadoop.hbase.util.HBaseFsck.onlineHbck(HBaseFsck.java:836) at org.apache.hadoop.hbase.util.HBaseFsck.exec(HBaseFsck.java:5154) at org.apache.hadoop.hbase.util.HBaseFsck$HBaseFsckTool.run(HBaseFsck.java:4947) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) at org.apache.hadoop.hbase.util.HBaseFsck.main(HBaseFsck.java:4935) ERROR: Unable to read .tableinfo from hdfs://nameservice1/hbase/KYLIN_HWXQTTYU05 20/04/22 12:55:46 INFO util.HBaseFsck: Checking and fixing region consistency ERROR: Region { meta => KYLIN_IC4ZTNA3D5,,1577784211982.0d116f627cb9ca1b3b9f42992e5ab536., hdfs => hdfs://nameservice1/hbase/data/default/KYLIN_IC4ZTNA3D5/0d116f627cb9ca1b3b9f42992e5ab536, deployed => , replicaId => 0 } not deployed on any region server. ...... ERROR: Region { meta => KYLIN_4SQ5XIVLWU,,1580047544478.96f18011aabf07e2e490450d70a0ca20., hdfs => hdfs://nameservice1/hbase/data/default/KYLIN_4SQ5XIVLWU/96f18011aabf07e2e490450d70a0ca20, deployed => , replicaId => 0 } not deployed on any region server. ERROR: Region { meta => KYLIN_7QLMJ32LLH,,1567588165728.88c950931a99a85753cd4f87e43eca48., hdfs => hdfs://nameservice1/hbase/data/default/KYLIN_7QLMJ32LLH/88c950931a99a85753cd4f87e43eca48, deployed => , replicaId => 0 } not deployed on any region server. ERROR: KYLIN_CNJJRE3KX1 has dangling table state tableName=KYLIN_CNJJRE3KX1, state=ENABLED 20/04/22 12:55:49 INFO util.HBaseFsck: Handling overlap merges in parallel. set hbasefsck.overlap.merge.parallel to false to run serially. ERROR: There is a hole in the region chain between and . You need to create a new .regioninfo and region dir in hdfs to plug the hole. ERROR: Found inconsistency in table KYLIN_UQPUCHTR5M ...... ERROR: There is a hole in the region chain between and . You need to create a new .regioninfo and region dir in hdfs to plug the hole. ERROR: Found inconsistency in table KYLIN_E21JMZ27KX Summary: Table KYLIN_UQPUCHTR5M is okay. Number of regions: 0 Deployed on: ...... Table KYLIN_9B9SYOGJ7C is okay. Number of regions: 0 Deployed on: Table KYLIN_E21JMZ27KX is okay. Number of regions: 0 Deployed on: Table KYLIN_ULZF9NZ36M is okay. Number of regions: 1 Deployed on: node5,16020,1587527823160 787 inconsistencies detected. Status: INCONSISTENT 20/04/22 12:55:49 INFO zookeeper.ZooKeeper: Session: 0x3719fbb6659010c closed 20/04/22 12:55:49 INFO zookeeper.ClientCnxn: EventThread shut down 20/04/22 12:55:49 INFO client.ConnectionImplementation: Closing master protocol: MasterService 20/04/22 12:55:49 INFO zookeeper.ZooKeeper: Session: 0x1719fbb42bd011a closed 20/04/22 12:55:49 INFO zookeeper.ClientCnxn: EventThread shut down
很显然Hbase的数据出现了不一致性,但是由于题主使用的是CDH6.2这个坑爹玩意,很多hbck的修复工具在当前环境下无法运行,详见https://developer.aliyun.com/article/738179?spm=a2c6h.12873581.0.0.2d33187akpUNcd。
以上未完待续。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。