hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jinsong Hu" <jinsong...@hotmail.com>
Subject Re: YouAreDeadException with hbase
Date Fri, 16 Jul 2010 20:50:56 GMT
I was doing stress testing, so the load is not small. But I purposely 
limited the data rate on client
side so load is not big either. using "iostat -x 5" and I can see there are 
lots of situations that
the CPU goes to very high level and stay there for long time. but then it 
ultimately go down.
It is highly possible that during a certain period the CPU was too busy and 
the GC process
was starved for CPU.

 I researched this failure and found

an excellent thread talking about GC:

http://forums.sun.com/thread.jspa?threadID=698490

that is more detailed than http://wiki.apache.org/hadoop/PerformanceTuning .
I will do some tunning and see if it helps following the posted config 
there.

Jimmy.

--------------------------------------------------
From: "Ryan Rawson" <ryanobjc@gmail.com>
Sent: Friday, July 16, 2010 1:35 PM
To: <user@hbase.apache.org>
Subject: Re: YouAreDeadException with hbase

> These 2 lines are different GC collections:
>
> 5517.355: [GC 5517.355: [ParNew (promotion failed): 
> 18113K->18113K(19136K), 0.77
> 00840 secs]
> 5518.125: [CMS5649.813: [CMS-concurrent-mark: 171.151/310.961 secs]
> [Times: user=95.87 sys=3.06, real=310.97 secs]
> (concurrent mode failure): 2009649K->572240K(2054592K), 280.2155930 secs] 
> 20233
> 25K->572240K(2073728K), [CMS Perm : 18029K->17976K(30064K)] icms_dc=100 , 
> 281.03
> 57280 secs] [Times: user=4.55 sys=4.07, real=281.05 secs]
>
> It's a little hard to read that, it looks like the CMS concurrent mark
> took 310 seconds then failed, then we got a 281 second real time
> pause, but interestingly enough the user and system time is fairly
> low.
>
> How loaded are these machines?  You need to be giving enough
> uncontended CPU time to hbase.
>
>
> On Fri, Jul 16, 2010 at 1:30 PM, Jinsong Hu <jinsong_hu@hotmail.com> 
> wrote:
>> I already implemented all these configs before my test. I checked
>> gc-hbase.log,
>>
>> I see a GC failure which looks very suspecious:
>>
>> 5515.974: [GC 5515.974: [ParNew: 19120K->2112K(19136K), 0.8344240 secs]
>> 2016283K
>> ->2007324K(2069308K) icms_dc=100 , 0.8345620 secs] [Times: user=0.08
>> sys=0.00, r
>> eal=0.83 secs]
>>
>> 5517.355: [GC 5517.355: [ParNew (promotion failed): 
>> 18113K->18113K(19136K),
>> 0.77
>> 00840 secs]5518.125: [CMS5649.813: [CMS-concurrent-mark: 171.151/310.961
>> secs] [
>> Times: user=95.87 sys=3.06, real=310.97 secs]
>> (concurrent mode failure): 2009649K->572240K(2054592K), 280.2155930 secs]
>> 20233
>> 25K->572240K(2073728K), [CMS Perm : 18029K->17976K(30064K)] icms_dc=100 ,
>> 281.03
>> 57280 secs] [Times: user=4.55 sys=4.07, real=281.05 secs]
>>
>> 5798.909: [GC [1 CMS-initial-mark: 572240K(2054592K)] 573896K(2092928K),
>> 0.05792
>> 20 secs] [Times: user=0.01 sys=0.00, real=0.08 secs]
>>
>>
>> the concurrent mode failure, and  281.05 seconds of GC time, looks like 
>> the
>> culprit for the problem.
>> I just wonder how to resolve this issue.
>>
>> Jimmy.
>>
>>
>> --------------------------------------------------
>> From: "Ryan Rawson" <ryanobjc@gmail.com>
>> Sent: Friday, July 16, 2010 12:57 PM
>> To: <user@hbase.apache.org>
>> Subject: Re: YouAreDeadException with hbase
>>
>>> Sometimes the GC can chain multiple medium pauses into one large
>>> pause.  I've seen this before where there are 2 long pauses back to
>>> back and the result was a 50second+ pause.
>>>
>>> This article talks a lot about GC performance and tuning, check it out:
>>> http://wiki.apache.org/hadoop/PerformanceTuning
>>>
>>> -ryan
>>>
>>> On Fri, Jul 16, 2010 at 11:55 AM, Jinsong Hu <jinsong_hu@hotmail.com>
>>> wrote:
>>>>
>>>> Yes, the root cause seems to be the gap of 4 minutes between
>>>> 2010-07-16 05:49:26,805   and 2010-07-16 05:53:23,476 .  but I checked 
>>>> GC
>>>> gc-hbase.log and don't see
>>>> 4 minute gap in gc. I just wonder what could cause this large gap.
>>>>
>>>> I also wonder if there is a configuration that I can do to avoid this
>>>> long
>>>> pause, or get around the problem
>>>> cause by this long pause.
>>>>
>>>> Jimmy
>>>>
>>>> --------------------------------------------------
>>>> From: "Stack" <stack@duboce.net>
>>>> Sent: Friday, July 16, 2010 11:44 AM
>>>> To: <user@hbase.apache.org>
>>>> Subject: Re: YouAreDeadException with hbase
>>>>
>>>>> You'll see this if the server reports to the master after the master
>>>>> has ruled it 'dead'.
>>>>>
>>>>> Here is the code that produces the exception:
>>>>>
>>>>>  if (!isDead(serverName)) return;
>>>>>  String message = "Server " + what + " rejected; currently processing

>>>>> "
>>>>> +
>>>>>    serverName + " as dead server";
>>>>>  LOG.debug(message);
>>>>>  throw new YouAreDeadException(message);
>>>>>
>>>>> Servers are on the 'dead' list if zk reports their session has
>>>>> expired.  The master moves then to cleanup after the dead server and
>>>>> process its logs.  If during this cleanup time the server reports in,
>>>>> master will return the youaredead exception.
>>>>>
>>>>> Usually the RS has lost its zk session but has yet to realize it.
>>>>>
>>>>> St.Ack
>>>>>
>>>>> On Thu, Jul 15, 2010 at 11:52 PM, Jinsong Hu <jinsong_hu@hotmail.com>
>>>>> wrote:
>>>>>>
>>>>>> Hi, There:
>>>>>>  I got some YouAreDeadException  with hbase. what can cause it ?
I do
>>>>>> notice
>>>>>> between 5:49 to 5:53 ,
>>>>>> for 4 minutes, there is no log. This doesn't look like GC issue as
I
>>>>>> checked
>>>>>> the GC log, the longest GC
>>>>>> is only 9.6 seconds.
>>>>>>
>>>>>> Jimmy.
>>>>>>
>>>>>>
>>>>>> 2010-07-16 05:49:26,805 DEBUG
>>>>>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Ca
>>>>>> che Stats: Sizes: Total=3.355194MB (3518176), Free=405.4198MB
>>>>>> (425113472),
>>>>>> Max=4
>>>>>> 08.775MB (428631648), Counts: Blocks=1, Access=2178914, Hit=1034,
>>>>>> Miss=2177880,
>>>>>> Evictions=0, Evicted=0, Ratios: Hit Ratio=0.04745483165606856%, Miss
>>>>>> Ratio=99.95
>>>>>> 254278182983%, Evicted/Run=NaN
>>>>>> 2010-07-16 05:53:23,476 DEBUG
>>>>>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Ca
>>>>>> che Stats: Sizes: Total=3.355194MB (3518176), Free=405.4198MB
>>>>>> (425113472),
>>>>>> Max=4
>>>>>> 08.775MB (428631648), Counts: Blocks=1, Access=2178915, Hit=1035,
>>>>>> Miss=2177880,
>>>>>> Evictions=0, Evicted=0, Ratios: Hit Ratio=0.04750070511363447%, Miss
>>>>>> Ratio=99.95
>>>>>> 250105857849%, Evicted/Run=NaN
>>>>>>
>>>>>> ....
>>>>>> 2010-07-16 05:53:26,171 INFO org.apache.zookeeper.ClientCnxn: Client
>>>>>> session
>>>>>> tim
>>>>>> ed out, have not heard from server in 240540ms for sessionid
>>>>>> 0x329c88039b0006c,
>>>>>> closing socket connection and attempting reconnect
>>>>>> 2010-07-16 05:53:27,333 INFO org.apache.zookeeper.ClientCnxn: Opening
>>>>>> socket
>>>>>> con
>>>>>> nection to server t-zookeeper2.cloud.ppops.net/10.110.24.57:2181
>>>>>> 2010-07-16 05:53:27,334 INFO org.apache.zookeeper.ClientCnxn: Socket
>>>>>> connection
>>>>>> established to t-zookeeper2.cloud.ppops.net/10.110.24.57:2181,
>>>>>> initiating
>>>>>> sessio
>>>>>> n
>>>>>> 2010-07-16 05:53:27,335 INFO org.apache.zookeeper.ClientCnxn: Unable

>>>>>> to
>>>>>> reconnec
>>>>>> t to ZooKeeper service, session 0x329c88039b0006c has expired, 
>>>>>> closing
>>>>>> socket co
>>>>>> nnection
>>>>>> 2010-07-16 05:53:27,896 INFO org.apache.zookeeper.ClientCnxn: Client
>>>>>> session
>>>>>> tim
>>>>>> ed out, have not heard from server in 240520ms for sessionid
>>>>>> 0x129c87a7f98007a,
>>>>>> closing socket connection and attempting reconnect
>>>>>>
>>>>>>
>>>>>> 2010-07-16 05:53:39,090 FATAL
>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer
>>>>>> : Aborting region server
>>>>>> serverName=m0002028.ppops.net,60020,1279237223465,
>>>>>> load
>>>>>> =(requests=952, regions=21, usedHeap=575, maxHeap=2043): Unhandled
>>>>>> exception
>>>>>> org.apache.hadoop.hbase.YouAreDeadException:
>>>>>> org.apache.hadoop.hbase.YouAreDeadE
>>>>>> xception: Server REPORT rejected; currently processing
>>>>>> m0002028.ppops.net,60020,
>>>>>> 1279237223465 as dead server
>>>>>>     at
>>>>>> org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManage
>>>>>> r.java:217)
>>>>>>     at
>>>>>> org.apache.hadoop.hbase.master.ServerManager.regionServerReport(Serve
>>>>>> rManager.java:271)
>>>>>>     at
>>>>>> org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.jav
>>>>>> a:684)
>>>>>>     at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>>>>>>     at
>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
>>>>>> sorImpl.java:25)
>>>>>>     at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>     at
>>>>>> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:576)
>>>>>>     at
>>>>>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:
>>>>>> 919)
>>>>>> 

Mime
View raw message