hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: Trying to understand HBase/ZooKeeper Logs
Date Wed, 03 Mar 2010 18:15:09 GMT
Michael,

Grep your master log for "Received report from unknown server" and if
you do find it, it means that you have DNS flapping. This may explain
why you see a "new instance" which in this case would be the master
registering the region server a second or third time. This patch in
this jira fixes this issue
https://issues.apache.org/jira/browse/HBASE-2174

J-D

On Wed, Mar 3, 2010 at 9:28 AM, Michael Segel <michael_segel@hotmail.com> wrote:
>
>
>
>> Date: Wed, 3 Mar 2010 09:17:06 -0800
>> From: phunt@apache.org
>> To: hbase-user@hadoop.apache.org
>> Subject: Re: Trying to understand HBase/ZooKeeper Logs
> [SNIP]
>> There are a few issues involved with the ping time:
>>
>> 1) the network (obv :-) )
>> 2) the zk server - if the server is highly loaded the pings may take
>> longer. The heartbeat is also a "health check" that the client is doing
>> against the server (as much as it is a "health check" for the server
>> that the client is still live). The HB is routed "all the way" through
>> the ZK server, ie through the processing pipeline. So if the server were
>> stalled it would not respond immediately (vs say reading the HB at the
>> thread that reads data from the client). You can see the min/max/avg
>> request latencies on the zk server by using the "stat" 4letter word. See
>> the ZK admin docs on this http://bit.ly/dglVld
>> 3) the zk client - clients can only process HB responses if they are
>> running. Say the JVM GC runs in blocking mode, this will block all
>> client threads (incl the zk client thread) and the HB response will sit
>> until the GC is finished. This is why HBase RSs typically use very very
>> large (from our, zk, perspective) session timeouts.
>>
>> 50ms is not long btw. I believe that RS are using >> 30sec timeouts.
>>
>> I can't shed directly light on this (ie what's the problem in hbase that
>> could cause your issue). I'll let jd/stack comment on that.
>>
>> Patrick
>>
>
> Thanks for the quick response.
>
> I'm trying to track down the issue of why we're getting a lot of 'partial' failures.
Unfortunately this is currently a lot like watching a pot boil. :-(
>
> What I am calling a 'partial failure' is that the region servers are spawning second
or even third instances where only the last one appears to be live.
>
> From what I can tell is that there's a spike of network activity that causes one of the
processes to think that there is something wrong and spawn a new instance.
>
> Is this a good description?
>
> Because some of the failures occur late at night with no load on the system, I suspect
that we have issues with the network but I can't definitively say.
>
> Which process is the most sensitive to network latency issues?
>
> Sorry, still relatively new to HBase and I'm trying to track down a nasty issue that
cause Hbase to fail on an almost regular basis. I think its a networking issue, but I can't
be sure.
>
> Thx
>
> -Mike
>
>
>
>
>
> _________________________________________________________________
> Hotmail: Powerful Free email with security by Microsoft.
> http://clk.atdmt.com/GBL/go/201469230/direct/01/

Mime
View raw message