hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Hunt <ph...@apache.org>
Subject Re: Trying to understand HBase/ZooKeeper Logs
Date Wed, 03 Mar 2010 18:32:25 GMT
Also check the ZK server logs and see if you notice any session 
expirations (esp during this timeframe). "grep -i expir <zk server logs>"

Patrick

Jean-Daniel Cryans wrote:
> Michael,
> 
> Grep your master log for "Received report from unknown server" and if
> you do find it, it means that you have DNS flapping. This may explain
> why you see a "new instance" which in this case would be the master
> registering the region server a second or third time. This patch in
> this jira fixes this issue
> https://issues.apache.org/jira/browse/HBASE-2174
> 
> J-D
> 
> On Wed, Mar 3, 2010 at 9:28 AM, Michael Segel <michael_segel@hotmail.com> wrote:
>>
>>
>>> Date: Wed, 3 Mar 2010 09:17:06 -0800
>>> From: phunt@apache.org
>>> To: hbase-user@hadoop.apache.org
>>> Subject: Re: Trying to understand HBase/ZooKeeper Logs
>> [SNIP]
>>> There are a few issues involved with the ping time:
>>>
>>> 1) the network (obv :-) )
>>> 2) the zk server - if the server is highly loaded the pings may take
>>> longer. The heartbeat is also a "health check" that the client is doing
>>> against the server (as much as it is a "health check" for the server
>>> that the client is still live). The HB is routed "all the way" through
>>> the ZK server, ie through the processing pipeline. So if the server were
>>> stalled it would not respond immediately (vs say reading the HB at the
>>> thread that reads data from the client). You can see the min/max/avg
>>> request latencies on the zk server by using the "stat" 4letter word. See
>>> the ZK admin docs on this http://bit.ly/dglVld
>>> 3) the zk client - clients can only process HB responses if they are
>>> running. Say the JVM GC runs in blocking mode, this will block all
>>> client threads (incl the zk client thread) and the HB response will sit
>>> until the GC is finished. This is why HBase RSs typically use very very
>>> large (from our, zk, perspective) session timeouts.
>>>
>>> 50ms is not long btw. I believe that RS are using >> 30sec timeouts.
>>>
>>> I can't shed directly light on this (ie what's the problem in hbase that
>>> could cause your issue). I'll let jd/stack comment on that.
>>>
>>> Patrick
>>>
>> Thanks for the quick response.
>>
>> I'm trying to track down the issue of why we're getting a lot of 'partial' failures.
Unfortunately this is currently a lot like watching a pot boil. :-(
>>
>> What I am calling a 'partial failure' is that the region servers are spawning second
or even third instances where only the last one appears to be live.
>>
>> From what I can tell is that there's a spike of network activity that causes one
of the processes to think that there is something wrong and spawn a new instance.
>>
>> Is this a good description?
>>
>> Because some of the failures occur late at night with no load on the system, I suspect
that we have issues with the network but I can't definitively say.
>>
>> Which process is the most sensitive to network latency issues?
>>
>> Sorry, still relatively new to HBase and I'm trying to track down a nasty issue that
cause Hbase to fail on an almost regular basis. I think its a networking issue, but I can't
be sure.
>>
>> Thx
>>
>> -Mike
>>
>>
>>
>>
>>
>> _________________________________________________________________
>> Hotmail: Powerful Free email with security by Microsoft.
>> http://clk.atdmt.com/GBL/go/201469230/direct/01/

Mime
View raw message