Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of saint.ack@gmail.com
 designates 209.85.214.46 as permitted sender)
MIME-Version: 1.0
Sender: saint.ack@gmail.com
In-Reply-To: <513A34D0.80507@psafe.com>
References: <513A0758.3090507@psafe.com>
	<CALte62wk7h2Mg8hj2-pc-+6gnvXJx-A6o6S4EP3MVdjOgozqgg@mail.gmail.com>
	<CAAT7MkoTwPJt4=YerAMaCCTok9QZXTuHXyWKisMZDuNO-cydKg@mail.gmail.com>
	<CADcMMgER-mtEW9V00PaApneY-M+yZ6V-Fa5SNXr+CCTYCFqi7A@mail.gmail.com>
	<513A34D0.80507@psafe.com>
Date: Fri, 8 Mar 2013 14:02:16 -0800
Message-ID: 
 <CADcMMgFrsdVCBwoXdKY-i21RVxwJpVeJpP_-hJciOwYJwDF5mA@mail.gmail.com>
Subject: Re: RegionServers Crashing every hour in production env
From: Stack <stack@duboce.net>
To: Hbase-User <user@hbase.apache.org>
Content-Type: multipart/alternative; boundary=f46d041c46aaf857cd04d770f98b

--f46d041c46aaf857cd04d770f98b
Content-Type: text/plain; charset=UTF-8

On Fri, Mar 8, 2013 at 10:58 AM, Pablo Musa <pablo@psafe.com> wrote:

> 0.94 currently doesn't support hadoop 2.0
>> Can you deploy hadoop 1.1.1 instead ?
>>
>
> I am using cdh4.2.0 which uses this version as default installation.
> I think it will be a problem for me to deploy 1.1.1 because I would need to
> "upgrade" the whole cluster with 70TB of data (backup everything, go
> offline, etc.).
>
> Is there a problem to use cdh4.2.0?
> I should send my email to cdh list?
>
>
That combo should be fine.


>  You Full GC'ing around this time?
>>
>
> The GC shows it took a long time. However it does not make any sense
> to be it, since the same ammount of data was cleaned before and AFTER
> in just 0.01 secs!
>
>
If JVM is full GC'ing, the application is stopped.


>
> [Times: user=0.08 sys=137.62, real=137.62 secs]
>
> Besides the whole time was used by system. That is what is bugging me.
>
>
The below does not look like a full GC but that is a long pause in system
time, enough to kill your zk session.

You swapping?

Hardware is good?

St.Ack


>  ...
>
>
> 1044.081: [GC 1044.081: [ParNew: 58970K->402K(59008K), 0.0040990 secs]
> 275097K->216577K(1152704K), 0.0041820 secs] [Times: user=0.03 sys=0.00,
> real=0.01 secs]
>
> 1087.319: [GC 1087.319: [ParNew: 52873K->6528K(59008K), 0.0055000 secs]
> 269048K->223592K(1152704K), 0.0055930 secs] [Times: user=0.04 sys=0.01,
> real=0.00 secs]
>
> 1087.834: [GC 1087.834: [ParNew: 59008K->6527K(59008K), 137.6353620
> secs] 276072K->235097K(1152704K), 137.6354700 secs] [Times: user=0.08
> sys=137.62, real=137.62 secs]
>
> 1226.638: [GC 1226.638: [ParNew: 59007K->1897K(59008K), 0.0079960 secs]
> 287577K->230937K(1152704K), 0.0080770 secs] [Times: user=0.05 sys=0.00,
> real=0.01 secs]
>
> 1227.251: [GC 1227.251: [ParNew: 54377K->2379K(59008K), 0.0095650 secs]
> 283417K->231420K(1152704K), 0.0096340 secs] [Times: user=0.06 sys=0.00,
> real=0.01 secs]
>
>
> I really appreciate you guys helping me to find out what is wrong.
>
> Thanks,
> Pablo
>
>
>
> On 03/08/2013 02:11 PM, Stack wrote:
>
>> What RAM says.
>>
>> 2013-03-07 17:24:57,887 INFO org.apache.zookeeper.****ClientCnxn: Client
>>
>> session timed out, have not heard from server in 159348ms for sessionid
>> 0x13d3c4bcba600a7, closing socket connection and attempting reconnect
>>
>> You Full GC'ing around this time?
>>
>> Put up your configs in a place where we can take a look?
>>
>> St.Ack
>>
>>
>> On Fri, Mar 8, 2013 at 8:32 AM, ramkrishna vasudevan <
>> ramkrishna.s.vasudevan@gmail.**com <ramkrishna.s.vasudevan@gmail.com>>
>> wrote:
>>
>>  I think it is with your GC config.  What is your heap size?  What is the
>>> data that you pump in and how much is the block cache size?
>>>
>>> Regards
>>> Ram
>>>
>>> On Fri, Mar 8, 2013 at 9:31 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>>>
>>>  0.94 currently doesn't support hadoop 2.0
>>>>
>>>> Can you deploy hadoop 1.1.1 instead ?
>>>>
>>>> Are you using 0.94.5 ?
>>>>
>>>> Thanks
>>>>
>>>> On Fri, Mar 8, 2013 at 7:44 AM, Pablo Musa <pablo@psafe.com> wrote:
>>>>
>>>>  Hey guys,
>>>>> as I sent in an email a long time ago, the RSs in my cluster did not
>>>>>
>>>> get
>>>
>>>> along
>>>>> and crashed 3 times a day. I tried a lot of options we discussed in the
>>>>> emails, but it not solved the problem. As I used an old version of
>>>>>
>>>> hadoop I
>>>>
>>>>> thought this was the problem.
>>>>>
>>>>> So, I upgraded from hadoop 0.20 - hbase 0.90 - zookeeper 3.3.5 to
>>>>>
>>>> hadoop
>>>
>>>> 2.0.0
>>>>> - hbase 0.94 - zookeeper 3.4.5.
>>>>>
>>>>> Unfortunately the RSs did not stop crashing, and worst! Now they crash
>>>>> every
>>>>> hour and some times when the RS that holds the .ROOT. crashes all
>>>>>
>>>> cluster
>>>
>>>> get
>>>>> stuck in transition and everything stops working.
>>>>> In this case I need to clean zookeeper znodes, restart the master and
>>>>>
>>>> the
>>>
>>>> RSs.
>>>>> To avoid this case I am running on production with only ONE RS and a
>>>>> monitoring
>>>>> script that check every minute, if the RS is ok. If not, restart it.
>>>>> * This case does not get the cluster stuck.
>>>>>
>>>>> This is driving me crazy, but I really cant find a solution for the
>>>>> cluster.
>>>>> I tracked all logs from the start time 16:49 from all interesting nodes
>>>>> (zoo,
>>>>> namenode, master, rs, dn2, dn9, dn10) and copied here what I think is
>>>>> usefull.
>>>>>
>>>>> There are some strange errors in the DATANODE2, as an error copiyng a
>>>>>
>>>> block
>>>>
>>>>> to itself.
>>>>>
>>>>> The gc log points to GC timeout. However it is very weird that the RS
>>>>>
>>>> spend
>>>>
>>>>> so much time in GC while in the other cases it takes 0.001sec. Besides,
>>>>> the time
>>>>> spent, is in sys which makes me think that might be a problem in
>>>>>
>>>> another
>>>
>>>> place.
>>>>>
>>>>> I know that it is a bunch of logs, and that it is very difficult to
>>>>>
>>>> find
>>>
>>>> the
>>>>> problem without much context. But I REALLY need some help. If it is not
>>>>>
>>>> the
>>>>
>>>>> solution, at least what I should read, where I should look, or which
>>>>>
>>>> cases
>>>>
>>>>> I
>>>>> should monitor.
>>>>>
>>>>> Thank you very much,
>>>>> Pablo Musa
>>>>>
>>>>>
>

--f46d041c46aaf857cd04d770f98b--