hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: Regionserver problems because of datanode timeouts
Date Tue, 09 Mar 2010 18:06:49 GMT
Prior to region server failures, do you see anything in your Munin graphs?

J-D

On Tue, Mar 9, 2010 at 2:02 AM, Ferdy <ferdy.galema@kalooga.com> wrote:
> Hi,
>
> @Michael
> You must be referring to https://issues.apache.org/jira/browse/HBASE-2174.
> This might very well be the problem. I will look into this patch.
>
> @J-D
> We finetuned our configurion to the point that every process (both daemons
> and job tasks) should have enough RAM and swapping should never occur. I can
> verify that this indeed is the case. (We're using Muning, it does a fine
> task at monitoring performance basics).
>
> To sum it up:
> -Hbase regionservers are having troubles even when there is no direct load
> (but there are other non-hbase hadoop jobs)
> -all Hadoop and Hbase processes use incremental garbage collect options
> -no swapping occurs ever
> -we can circumvent the problem by using very long timeouts
>
> I have a strong feeling it's network-related, because our non-hbase hadoop
> jobs do generate a lot of DNS requests.
>
> Ferdy
>
> Michael Segel wrote:
>>
>> This looks similar to the problem we were having with 'flaky dns'.
>> I just got the patched sources built. (rpmbuild was barfing on a lot of
>> little things that had to be tweaked....) We put it in place last night and
>> so far we're Ok.
>>
>> This morning our IS guys were in and looked at their logs from over the
>> weekend. It looks like HBase was querying a bunch of different DNS servers
>> on both IPV6 and IPV4.
>> Not sure if this was causing the error or making HBase think our DNS was
>> 'flaky'. (I don't know because I can't control the network or DNS.)
>>
>> If you look back on this thread, you should be able to find the patches.
>>
>> HTH
>>
>> -Mike
>>
>>
>>
>>>
>>> Date: Mon, 8 Mar 2010 10:00:30 -0800
>>> Subject: Re: Regionserver problems because of datanode timeouts
>>> From: jdcryans@apache.org
>>> To: hbase-user@hadoop.apache.org
>>>
>>> Hey,
>>>
>>> Sorry I forgot to answer last month, your mail slipped though others.
>>>
>>> So you were saying that HBase would be failing even if not used, but
>>> as you said the cluster itself is used for heavy IO jobs. This is a
>>> problem for the default hbase configurations as you can see, but I'm
>>> also wondering if you swapping a lot with all those concurrent tasks
>>> running on your nodes? If so, it could explain the "random" failures
>>> easily.
>>>
>>> J-D
>>>
>>> On Mon, Mar 8, 2010 at 3:04 AM, Ferdy <ferdy.galema@kalooga.com> wrote:
>>>
>>>>
>>>> Hi,
>>>>
>>>> Perhaps this is of use to anyone else:
>>>>
>>>> We tried all hbase versions up to 0.20.3 but it did not seem to make a
>>>> difference for our problem. However when using the following properties
>>>> the
>>>> problem seems to be solved. The properties simply apply ridiculously
>>>> long
>>>> timeouts, but that doesn't bother us since we only use hbase for offline
>>>> processing.
>>>>
>>>> <!--
>>>> in hbase-site.xml:
>>>> -->
>>>>
>>>>  <property>
>>>>  <name>hbase.zookeeper.property.tickTime</name>
>>>>  <value>20000</value>
>>>>  <description>Property from ZooKeeper's config zoo.cfg.
>>>>  The number of milliseconds of each tick.  See
>>>>  zookeeper.session.timeout description.
>>>>  </description>
>>>>  </property>
>>>>
>>>>  <property>
>>>>  <name>hbase.zookeeper.property.initLimit</name>
>>>>  <value>20</value>
>>>>  <description>Property from ZooKeeper's config zoo.cfg.
>>>>  The number of ticks that the initial synchronization phase can take.
>>>>  </description>
>>>>  </property>
>>>>
>>>>  <property>
>>>>  <name>hbase.zookeeper.property.syncLimit</name>
>>>>  <value>20</value>
>>>>  <description>Property from ZooKeeper's config zoo.cfg.
>>>>  The number of ticks that can pass between sending a request and getting
>>>> an
>>>>  acknowledgment.
>>>>  </description>
>>>>  </property>
>>>>
>>>>  <property>
>>>>  <name>zookeeper.session.timeout</name>
>>>>  <value>400000</value>
>>>>  <description>ZooKeeper session timeout.
>>>>    HBase passes this to the zk quorum as suggested maximum time for a
>>>>    session.  See
>>>>
>>>> http://hadoop.apache.org/zookeeper/docs/current/zookeeperProgrammers.html#ch_zkSessions
>>>>    "The client sends a requested timeout, the server responds with the
>>>>    timeout that it can give the client. The current implementation
>>>>    requires that the timeout be a minimum of 2 times the tickTime
>>>>    (as set in the server configuration) and a maximum of 20 times
>>>>    the tickTime." Set the zk ticktime with
>>>> hbase.zookeeper.property.tickTime.
>>>>    In milliseconds.
>>>>  </description>
>>>>  </property>
>>>>
>>>>  <property>
>>>>      <name>hbase.regionserver.lease.period</name>
>>>>  <value>400000</value>
>>>>  <description>HRegion server lease period in milliseconds. Default
is
>>>>    60 seconds. Clients must report in within this period else they are
>>>>    considered dead.</description>
>>>>  </property>
>>>>
>>>>  <property>
>>>>  <name>dfs.socket.timeout</name>
>>>>  <value>400000</value>
>>>>  </property>
>>>>
>>>>
>>>> <!--
>>>> in hdfs-site.xml:
>>>> -->
>>>>
>>>>  <property>
>>>>  <name>dfs.socket.timeout</name>
>>>>  <value>400000</value>
>>>>  </property>
>>>>
>>>>
>>
>>
>>  _________________________________________________________________
>> Your E-mail and More On-the-Go. Get Windows Live Hotmail Free.
>> http://clk.atdmt.com/GBL/go/201469229/direct/01/
>>
>

Mime
View raw message