hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: Trying to understand HBase/ZooKeeper Logs
Date Wed, 03 Mar 2010 18:15:09 GMT

Grep your master log for "Received report from unknown server" and if
you do find it, it means that you have DNS flapping. This may explain
why you see a "new instance" which in this case would be the master
registering the region server a second or third time. This patch in
this jira fixes this issue


On Wed, Mar 3, 2010 at 9:28 AM, Michael Segel <michael_segel@hotmail.com> wrote:
>> Date: Wed, 3 Mar 2010 09:17:06 -0800
>> From: phunt@apache.org
>> To: hbase-user@hadoop.apache.org
>> Subject: Re: Trying to understand HBase/ZooKeeper Logs
> [SNIP]
>> There are a few issues involved with the ping time:
>> 1) the network (obv :-) )
>> 2) the zk server - if the server is highly loaded the pings may take
>> longer. The heartbeat is also a "health check" that the client is doing
>> against the server (as much as it is a "health check" for the server
>> that the client is still live). The HB is routed "all the way" through
>> the ZK server, ie through the processing pipeline. So if the server were
>> stalled it would not respond immediately (vs say reading the HB at the
>> thread that reads data from the client). You can see the min/max/avg
>> request latencies on the zk server by using the "stat" 4letter word. See
>> the ZK admin docs on this http://bit.ly/dglVld
>> 3) the zk client - clients can only process HB responses if they are
>> running. Say the JVM GC runs in blocking mode, this will block all
>> client threads (incl the zk client thread) and the HB response will sit
>> until the GC is finished. This is why HBase RSs typically use very very
>> large (from our, zk, perspective) session timeouts.
>> 50ms is not long btw. I believe that RS are using >> 30sec timeouts.
>> I can't shed directly light on this (ie what's the problem in hbase that
>> could cause your issue). I'll let jd/stack comment on that.
>> Patrick
> Thanks for the quick response.
> I'm trying to track down the issue of why we're getting a lot of 'partial' failures.
Unfortunately this is currently a lot like watching a pot boil. :-(
> What I am calling a 'partial failure' is that the region servers are spawning second
or even third instances where only the last one appears to be live.
> From what I can tell is that there's a spike of network activity that causes one of the
processes to think that there is something wrong and spawn a new instance.
> Is this a good description?
> Because some of the failures occur late at night with no load on the system, I suspect
that we have issues with the network but I can't definitively say.
> Which process is the most sensitive to network latency issues?
> Sorry, still relatively new to HBase and I'm trying to track down a nasty issue that
cause Hbase to fail on an almost regular basis. I think its a networking issue, but I can't
be sure.
> Thx
> -Mike
> _________________________________________________________________
> Hotmail: Powerful Free email with security by Microsoft.
> http://clk.atdmt.com/GBL/go/201469230/direct/01/

View raw message