ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AMBARI-10111) Tune GC settings for HBase daemon
Date Wed, 18 Mar 2015 02:38:38 GMT

    [ https://issues.apache.org/jira/browse/AMBARI-10111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14366522#comment-14366522
] 

Hudson commented on AMBARI-10111:
---------------------------------

FAILURE: Integrated in Ambari-trunk-Commit #2064 (See [https://builds.apache.org/job/Ambari-trunk-Commit/2064/])
AMBARI-10111. Tune GC settings for HBase daemon. (swagle) (swagle: http://git-wip-us.apache.org/repos/asf?p=ambari.git&a=commit&h=8599d19d93a774fa248d97400b7b6270c1d677fd)
* ambari-server/src/main/resources/common-services/AMBARI_METRICS/0.1.0/package/scripts/params.py
* ambari-server/src/test/python/stacks/2.2/common/test_stack_advisor.py
* ambari-server/src/main/resources/stacks/HDP/2.0.6/services/stack_advisor.py
* ambari-server/src/main/resources/common-services/AMBARI_METRICS/0.1.0/configuration/ams-hbase-env.xml
* ambari-server/src/main/resources/common-services/AMBARI_METRICS/0.1.0/package/scripts/hbase.py


> Tune GC settings for HBase daemon
> ---------------------------------
>
>                 Key: AMBARI-10111
>                 URL: https://issues.apache.org/jira/browse/AMBARI-10111
>             Project: Ambari
>          Issue Type: Task
>          Components: ambari-metrics
>    Affects Versions: 2.0.0
>            Reporter: Siddharth Wagle
>            Assignee: Siddharth Wagle
>             Fix For: 2.0.0
>
>
> After looking at HBase logs and GC logs, discovered that:
> 1. RS stopped due to unreachable ZK, the cause based on logs is long gc pause:
> {code}
> 2015-03-17 15:34:33,810 WARN  [M:0;perf-400-10mar-b-ams-1:48488.oldLogCleaner] util.Sleeper:
We slept 78463ms instead of 60000ms, this is likely due to a long garbage collecting pause
and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> 2015-03-17 15:34:33,810 WARN  [RS:0;perf-400-10mar-b-ams-1:38900.compactionChecker] util.Sleeper:
We slept 26609ms instead of 10000ms, this is likely due to a long garbage collecting pause
and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> 2015-03-17 15:34:33,810 WARN  [RS:0;perf-400-10mar-b-ams-1:38900.periodicFlusher] util.Sleeper:
We slept 26608ms instead of 10000ms, this is likely due to a long garbage collecting pause
and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> {code}
> 2. Looking at gc logs, it seems everything is nice and clean and a sudden unexpected
event as seen here:
> {code}
> 2015-03-17T15:34:10.908+0000: 47483.945: [GC2015-03-17T15:34:10.908+0000: 47483.945:
[ParNew: 613440K->68096K(613440K), 0.0645970 secs] 10961428K->10444812K(12514816K),
0.0648180 secs] [Times: user=0.34 sys=0.01, real=0.06 secs]
> 2015-03-17T15:34:11.525+0000: 47484.562: [GC2015-03-17T15:34:11.525+0000: 47484.562:
[ParNew: 613424K->68096K(613440K), 0.0585420 secs] 10956776K->10438129K(12514816K),
0.0590000 secs] [Times: user=0.31 sys=0.02, real=0.06 secs]
> 2015-03-17T15:34:12.133+0000: 47485.170: [GC2015-03-17T15:34:12.134+0000: 47485.170:
[ParNew (promotion failed): 613346K->613346K(613440K), 0.1492450 secs]2015-03-17T15:34:12.283+0000:
47485.320: [CMS2015-03-17T15:34:28.879+0000: 47501.916: [CMS-concurrent-sweep: 33.271/35.389
secs] [Times: user=144.69 sys=3.76, real=35.39 secs]
>  (concurrent mode failure): 10365250K->4753155K(11901376K), 21.5240950 secs] 10955450K->4753155K(12514816K),
[CMS Perm : 41519K->41519K(69388K)], 21.6736450 secs] [Times: user=21.93 sys=0.02, real=21.67
secs]
> 2015-03-17T15:34:34.584+0000: 47507.621: [GC2015-03-17T15:34:34.584+0000: 47507.621:
[ParNew: 545344K->59138K(613440K), 0.0552630 secs] 5298499K->4812293K(12514816K), 0.0554990
secs] [Times: user=0.32 sys=0.02, real=0.06 secs]
> {code}
> The possible reason is GCE VM paused for more that a minute leading to the unexpected
event.
> *Recommendations*:
> - Tune GC settings for the HBase daemon.
> - Create a Canary script to check health of the system and restart the underlying daemons
if needed. Creating a Jira for 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message