hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: RegionServers Crashing every hour in production env
Date Fri, 08 Mar 2013 16:01:44 GMT
0.94 currently doesn't support hadoop 2.0

Can you deploy hadoop 1.1.1 instead ?

Are you using 0.94.5 ?


On Fri, Mar 8, 2013 at 7:44 AM, Pablo Musa <pablo@psafe.com> wrote:

> Hey guys,
> as I sent in an email a long time ago, the RSs in my cluster did not get
> along
> and crashed 3 times a day. I tried a lot of options we discussed in the
> emails, but it not solved the problem. As I used an old version of hadoop I
> thought this was the problem.
> So, I upgraded from hadoop 0.20 - hbase 0.90 - zookeeper 3.3.5 to hadoop
> 2.0.0
> - hbase 0.94 - zookeeper 3.4.5.
> Unfortunately the RSs did not stop crashing, and worst! Now they crash
> every
> hour and some times when the RS that holds the .ROOT. crashes all cluster
> get
> stuck in transition and everything stops working.
> In this case I need to clean zookeeper znodes, restart the master and the
> RSs.
> To avoid this case I am running on production with only ONE RS and a
> monitoring
> script that check every minute, if the RS is ok. If not, restart it.
> * This case does not get the cluster stuck.
> This is driving me crazy, but I really cant find a solution for the
> cluster.
> I tracked all logs from the start time 16:49 from all interesting nodes
> (zoo,
> namenode, master, rs, dn2, dn9, dn10) and copied here what I think is
> usefull.
> There are some strange errors in the DATANODE2, as an error copiyng a block
> to itself.
> The gc log points to GC timeout. However it is very weird that the RS spend
> so much time in GC while in the other cases it takes 0.001sec. Besides,
> the time
> spent, is in sys which makes me think that might be a problem in another
> place.
> I know that it is a bunch of logs, and that it is very difficult to find
> the
> problem without much context. But I REALLY need some help. If it is not the
> solution, at least what I should read, where I should look, or which cases
> I
> should monitor.
> Thank you very much,
> Pablo Musa

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message