赞
踩
- 在ambari上启动某台主机上的regionserver时,启动不了,同时在主机上的/var/log/hbase/ 目录中有日志产生。
- 日志hbase-hbase-regionserver-XXX.log 中新增内容:
- 2016年 11月 30日 星期三 10:06:22 CST Starting regionserver on XXX
- core file size (blocks, -c) 0
- data seg size (kbytes, -d) unlimited
- scheduling priority (-e) 0
- file size (blocks, -f) unlimited
- pending signals (-i) 1031397
- max locked memory (kbytes, -l) 64
- max memory size (kbytes, -m) unlimited
- open files (-n) 10000
- pipe size (512 bytes, -p) 8
- POSIX message queues (bytes, -q) 819200
- real-time priority (-r) 0
- stack size (kbytes, -s) 10240
- cpu time (seconds, -t) unlimited
- max user processes (-u) 16000
- virtual memory (kbytes, -v) unlimited
- file locks (-x) unlimited
-
- 另外还会产生一个日志文件hs_err_pid16030.log
解决办法:
在hs_err_pid开头的文件中,有提示的办法,在主机上执行命令ulimit -c unlimited,之后在ambari页面上重新启动regionserver,启动成功,告警消失了。
- 集群ambari中的指标加载不出来,查看34上/var/log/ambari-server/ambari-server.log有如下错误:
-
- 29 Nov 2016 16:18:18,337 ERROR [pool-9-thread-319] BaseProvider:240 - Caught exception getting JMX metrics :
- Connection refused, skipping same exceptions for next 5 minutes
- java.net.ConnectException: Connection refused
解决办法:
昨天有问题,当时没有解决,今天集群自己恢复了,可能是时间同步的问题
集群时间有问题了,同步时间后自动恢复正常
- 2017-04-18 14:23:25,008 WARN datanode.DataNode (DataNode.java:checkStorageLocations(2439)) - Invalid dfs.datanode.data.dir /data1/hadoop/hdfs/data :
- org.apache.hadoop.util.DiskChecker$DiskErrorException: Directory is not writable: /data1/hadoop/hdfs/data
- 。。。
- 2017-04-18 14:23:26,751 WARN common.Storage (BlockPoolSliceStorage.java:loadBpStorageDirectories(221)) - Failed to analyze storage directories for block pool BP-1071526479-192.16.10.34-1472798972660
- java.io.IOException: BlockPoolSliceStorage.recoverTransitionRead: attempt to load an used block storage: /data2/hadoop/hdfs/data/current/BP-1071526479-192.16.10.34-1472798972660
- at
-
Directory is not writable: /data1/hadoop/hdfs/data 。
/dev/sdb1 ext4 3.6T 137G 3.3T 4% /data1
一块磁盘出问题了
解决方法:修改hdfs-sites配置属性
dfs.datanode.failed.volumes.tolerated值为1;重启动datanode即可。
- 1、卸载磁盘
- 查看使用/data1的进程,并杀掉
- [root@XXX ~]# fuser -m /data1
- /data1: 4036m 26457c
- [root@XXX ~]# kill -9 26457
- [root@XXX ~]# fuser -m /data1
- /data1: 4036m
- [root@XXX ~]# kill -9 4036
-
- 2、卸载磁盘
- umount /dev/sdb1 /data1
-
- 3、格式化磁盘
- mkfs.ext4 /dev/sdb1
- DataNode Health Summary
- DataNode Health: [Live=18, Stale=0, Dead=1]
- DataNode Health: [Live=18, Stale=0, Dead=1]
解决办法:
Dead掉的是主机能ping通,但是ssh的时候秒退,估计是客户端连接数过多,就会这样,要修改配置文件,把连接数改大
解决办法:
- 1、查看agent是否在运行,显示没有
- sudo ambari-agent status
-
- 2、启动agent
- sudo ambari-agent start
- 2017-08-03 09:20:59,089 INFO [regionserver/XXX/192.16.10.25:16020]
- regionserver.HRegionServer: STOPPED:
- Unhandled: org.apache.hadoop.hbase.ClockOutOfSyncException:
- Server XXX,16020,1501723257281 has been rejected;
- Reported time is too far out of sync with master.
- Time difference of 224557ms > max allowed of 30000ms
解决办法:
同步集群时间
- 调大如下两个属性值
- mapreduce.map.memory.mb
- mapreduce.reduce.memory.mb
- Aug 2017 09:02:07,516 ERROR [qtp-ambari-client-986626] MetricsRequestHelper:114 - Error getting timeline metrics : Read timed out
- 04 Aug 2017 09:02:07,517 ERROR [qtp-ambari-client-986626] MetricsRequestHelper:121 - Error getting timeline metrics : Read timed out Can not connect to collector, socket error.
- 04 Aug 2017 09:11:52,441 ERROR [pool-9-thread-58891] BaseProvider:240 - Caught exception getting JMX metrics : Connection refused, skipping same exceptions for next 5 minutes
- java.net.ConnectException: Connection refused
- at java.net.PlainSocketImpl.socketConnect(Native Method)
- at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
解决办法:
- 解决方法:
- a)在data主机重启ambari-agent
- ambari-agent restart
-
- b) 在master1主机重启ambari-server
- ambari-server restart
- Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
- [2017-04-11 10:52:25,713] {bash_operator.py:77} INFO - at java.util.Arrays.copyOf(Arrays.java:3332)
- [2017-04-11 10:52:25,713] {bash_operator.py:77} INFO - at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
- [2017-04-11 10:52:25,713] {bash_operator.py:77} INFO - at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
- [2017-04-11 10:52:25,713] {bash_operator.py:77} INFO - at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:569)
- [2017-04-11 10:52:25,714] {bash_operator.py:77} INFO - at java.lang.StringBuffer.append(StringBuffer.java:369)
- [2017-04-11 10:52:25,714] {bash_operator.py:77} INFO - at java.io.BufferedReader.readLine(BufferedReader.java:370)
- [2017-04-11 10:52:25,714] {bash_operator.py:77} INFO - at java.io.BufferedReader.readLine(BufferedReader.java:389)
- [2017-04-11 10:52:25,714] {bash_operator.py:77} INFO - at jline.console.history.FileHistory.load(FileHistory.java:69)
- [2017-04-11 10:52:25,714] {bash_operator.py:77} INFO - at jline.console.history.FileHistory.load(FileHistory.java:55)
- [2017-04-11 10:52:25,714] {bash_operator.py:77} INFO - at jline.console.history.FileHistory.<init>(FileHistory.java:44)
- [2017-04-11 10:52:25,714] {bash_operator.py:77} INFO - at org.apache.hive.beeline.BeeLine.getConsoleReader(BeeLine.java:873)
- [2017-04-11 10:52:25,715] {bash_operator.py:77} INFO - at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:780)
- [2017-04-11 10:52:25,715] {bash_operator.py:77} INFO - at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:485)
- [2017-04-11 10:52:25,715] {bash_operator.py:77} INFO - at org.apache.hive.beeline.BeeLine.main(BeeLine.java:468)
- [2017-04-11 10:52:25,715] {bash_operator.py:77} INFO - at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
- [2017-04-11 10:52:25,715] {bash_operator.py:77} INFO - at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
- [2017-04-11 10:52:25,715] {bash_operator.py:77} INFO - at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
- [2017-04-11 10:52:25,715] {bash_operator.py:77} INFO - at java.lang.reflect.Method.invoke(Method.java:497)
- [2017-04-11 10:52:25,715] {bash_operator.py:77} INFO - at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
- [2017-04-11 10:52:25,715] {bash_operator.py:77} INFO - at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
- [2017-04-11 10:52:26,055] {bash_operator.py:80} INFO - Command exited with return code 0
解决办法:
通过查找资料和测试,发现原因是 ~/.beeline/history 文件过大导致的,把该文件删除后测试成功
https://issues.apache.org/jira/browse/HIVE-10836
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。