hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jinsong Hu" <jinsong...@hotmail.com>
Subject Re: YouAreDeadException with hbase
Date Fri, 16 Jul 2010 20:30:27 GMT
I already implemented all these configs before my test. I checked 
gc-hbase.log,

I see a GC failure which looks very suspecious:

5515.974: [GC 5515.974: [ParNew: 19120K->2112K(19136K), 0.8344240 secs] 
2016283K
->2007324K(2069308K) icms_dc=100 , 0.8345620 secs] [Times: user=0.08 
sys=0.00, r
eal=0.83 secs]

5517.355: [GC 5517.355: [ParNew (promotion failed): 18113K->18113K(19136K), 
0.77
00840 secs]5518.125: [CMS5649.813: [CMS-concurrent-mark: 171.151/310.961 
secs] [
Times: user=95.87 sys=3.06, real=310.97 secs]
 (concurrent mode failure): 2009649K->572240K(2054592K), 280.2155930 secs] 
20233
25K->572240K(2073728K), [CMS Perm : 18029K->17976K(30064K)] icms_dc=100 , 
281.03
57280 secs] [Times: user=4.55 sys=4.07, real=281.05 secs]

5798.909: [GC [1 CMS-initial-mark: 572240K(2054592K)] 573896K(2092928K), 
0.05792
20 secs] [Times: user=0.01 sys=0.00, real=0.08 secs]


the concurrent mode failure, and  281.05 seconds of GC time, looks like the 
culprit for the problem.
I just wonder how to resolve this issue.

Jimmy.


--------------------------------------------------
From: "Ryan Rawson" <ryanobjc@gmail.com>
Sent: Friday, July 16, 2010 12:57 PM
To: <user@hbase.apache.org>
Subject: Re: YouAreDeadException with hbase

> Sometimes the GC can chain multiple medium pauses into one large
> pause.  I've seen this before where there are 2 long pauses back to
> back and the result was a 50second+ pause.
>
> This article talks a lot about GC performance and tuning, check it out:
> http://wiki.apache.org/hadoop/PerformanceTuning
>
> -ryan
>
> On Fri, Jul 16, 2010 at 11:55 AM, Jinsong Hu <jinsong_hu@hotmail.com> 
> wrote:
>> Yes, the root cause seems to be the gap of 4 minutes between
>> 2010-07-16 05:49:26,805   and 2010-07-16 05:53:23,476 .  but I checked GC
>> gc-hbase.log and don't see
>> 4 minute gap in gc. I just wonder what could cause this large gap.
>>
>> I also wonder if there is a configuration that I can do to avoid this 
>> long
>> pause, or get around the problem
>> cause by this long pause.
>>
>> Jimmy
>>
>> --------------------------------------------------
>> From: "Stack" <stack@duboce.net>
>> Sent: Friday, July 16, 2010 11:44 AM
>> To: <user@hbase.apache.org>
>> Subject: Re: YouAreDeadException with hbase
>>
>>> You'll see this if the server reports to the master after the master
>>> has ruled it 'dead'.
>>>
>>> Here is the code that produces the exception:
>>>
>>>   if (!isDead(serverName)) return;
>>>   String message = "Server " + what + " rejected; currently processing " 
>>> +
>>>     serverName + " as dead server";
>>>   LOG.debug(message);
>>>   throw new YouAreDeadException(message);
>>>
>>> Servers are on the 'dead' list if zk reports their session has
>>> expired.  The master moves then to cleanup after the dead server and
>>> process its logs.  If during this cleanup time the server reports in,
>>> master will return the youaredead exception.
>>>
>>> Usually the RS has lost its zk session but has yet to realize it.
>>>
>>> St.Ack
>>>
>>> On Thu, Jul 15, 2010 at 11:52 PM, Jinsong Hu <jinsong_hu@hotmail.com>
>>> wrote:
>>>>
>>>> Hi, There:
>>>>  I got some YouAreDeadException  with hbase. what can cause it ? I do
>>>> notice
>>>> between 5:49 to 5:53 ,
>>>> for 4 minutes, there is no log. This doesn't look like GC issue as I
>>>> checked
>>>> the GC log, the longest GC
>>>> is only 9.6 seconds.
>>>>
>>>> Jimmy.
>>>>
>>>>
>>>> 2010-07-16 05:49:26,805 DEBUG
>>>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Ca
>>>> che Stats: Sizes: Total=3.355194MB (3518176), Free=405.4198MB
>>>> (425113472),
>>>> Max=4
>>>> 08.775MB (428631648), Counts: Blocks=1, Access=2178914, Hit=1034,
>>>> Miss=2177880,
>>>> Evictions=0, Evicted=0, Ratios: Hit Ratio=0.04745483165606856%, Miss
>>>> Ratio=99.95
>>>> 254278182983%, Evicted/Run=NaN
>>>> 2010-07-16 05:53:23,476 DEBUG
>>>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Ca
>>>> che Stats: Sizes: Total=3.355194MB (3518176), Free=405.4198MB
>>>> (425113472),
>>>> Max=4
>>>> 08.775MB (428631648), Counts: Blocks=1, Access=2178915, Hit=1035,
>>>> Miss=2177880,
>>>> Evictions=0, Evicted=0, Ratios: Hit Ratio=0.04750070511363447%, Miss
>>>> Ratio=99.95
>>>> 250105857849%, Evicted/Run=NaN
>>>>
>>>> ....
>>>> 2010-07-16 05:53:26,171 INFO org.apache.zookeeper.ClientCnxn: Client
>>>> session
>>>> tim
>>>> ed out, have not heard from server in 240540ms for sessionid
>>>> 0x329c88039b0006c,
>>>> closing socket connection and attempting reconnect
>>>> 2010-07-16 05:53:27,333 INFO org.apache.zookeeper.ClientCnxn: Opening
>>>> socket
>>>> con
>>>> nection to server t-zookeeper2.cloud.ppops.net/10.110.24.57:2181
>>>> 2010-07-16 05:53:27,334 INFO org.apache.zookeeper.ClientCnxn: Socket
>>>> connection
>>>> established to t-zookeeper2.cloud.ppops.net/10.110.24.57:2181, 
>>>> initiating
>>>> sessio
>>>> n
>>>> 2010-07-16 05:53:27,335 INFO org.apache.zookeeper.ClientCnxn: Unable to
>>>> reconnec
>>>> t to ZooKeeper service, session 0x329c88039b0006c has expired, closing
>>>> socket co
>>>> nnection
>>>> 2010-07-16 05:53:27,896 INFO org.apache.zookeeper.ClientCnxn: Client
>>>> session
>>>> tim
>>>> ed out, have not heard from server in 240520ms for sessionid
>>>> 0x129c87a7f98007a,
>>>> closing socket connection and attempting reconnect
>>>>
>>>>
>>>> 2010-07-16 05:53:39,090 FATAL
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer
>>>> : Aborting region server
>>>> serverName=m0002028.ppops.net,60020,1279237223465,
>>>> load
>>>> =(requests=952, regions=21, usedHeap=575, maxHeap=2043): Unhandled
>>>> exception
>>>> org.apache.hadoop.hbase.YouAreDeadException:
>>>> org.apache.hadoop.hbase.YouAreDeadE
>>>> xception: Server REPORT rejected; currently processing
>>>> m0002028.ppops.net,60020,
>>>> 1279237223465 as dead server
>>>>      at
>>>> org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManage
>>>> r.java:217)
>>>>      at
>>>> org.apache.hadoop.hbase.master.ServerManager.regionServerReport(Serve
>>>> rManager.java:271)
>>>>      at
>>>> org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.jav
>>>> a:684)
>>>>      at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>>>>      at
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
>>>> sorImpl.java:25)
>>>>      at java.lang.reflect.Method.invoke(Method.java:597)
>>>>      at
>>>> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:576)
>>>>      at
>>>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:
>>>> 919)
>>>>
>>>>      at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>>> Method)
>>>>
>>>
>>
> 

Mime
View raw message