Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hbase-user@hadoop.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
Message-ID: <4BACF4BB.8010502@apache.org>
Date: Fri, 26 Mar 2010 10:54:03 -0700
From: Patrick Hunt <phunt@apache.org>
User-Agent: Thunderbird 2.0.0.24 (X11/20100317)
MIME-Version: 1.0
To: hbase-user@hadoop.apache.org
Subject: Re: Cannot open filename Exceptions
References: <f7a6f8651003150340i4ca03bd3gb9131044e28bdc2e@mail.gmail.com>
	 <f7a6f8651003182343y67ebf197n8135432e3d29cbef@mail.gmail.com>
	 <7c962aed1003230043h3c6baa36yc258e7b0932c9326@mail.gmail.com>
	 <f7a6f8651003232042k560013aeu511534b30263f1e4@mail.gmail.com>
	 <7c962aed1003232212j518b1bcfj4b9de6fabe5d6f56@mail.gmail.com>
	 <f7a6f8651003240243r523aab9bgd0cdcf5f8cd887a@mail.gmail.com>
	 <f7a6f8651003242001r1dc1da2cq18dadf269aec2c17@mail.gmail.com>
	 <31a243e71003242016h5922c98bh435983daa356e0a2@mail.gmail.com>
	 <f7a6f8651003250107k61d0b25fvbb8d91f32e103945@mail.gmail.com>
	 <31a243e71003250932i3ac29bbdr5d61feaef4e1948@mail.gmail.com>
 <f7a6f8651003251847p2ee49711n4a9d8cd8735d4de8@mail.gmail.com>
In-Reply-To: <f7a6f8651003251847p2ee49711n4a9d8cd8735d4de8@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Zheng Lv wrote:
>   I didn't change the tick value, and I will do it right now. But I wanna
> know why the timeout value can only be 20 times bigger than ticktime, can
> you tell me?

The limit is mainly there to keep users from shooting themselves in the 
foot. Typically (we've not seen any case other than hbase where this has 
been necessary) you want low timeouts, in the 5-10 second range, perhaps 
30seconds on the outside. This results in sessions being cleaned up 
quickly, and in general allows clients to be very responsive to 
failures. As HBase RS are effected by limitations in the Sun GC we've 
had to force a larger timeout than normally would be used via a 
workaround. In 3.3 ZooKeeper we added new configurations options 
specific to this HBase use case (there is now a parameter to control the 
max timeout limitation). In 3.3 we've also added log messages on 
server/client to log the negotiated timeout and the client API provides 
programmatic access to the negotiated timeout.

There are other reasons why we have min/max limits on the negotiated 
timeout, in particular to limit memory use on the server. There is state 
associated with each session we do not want this to grow too large, 
having a max timeout limit effectively helps to cap this.

Patrick

> 
> 2010/3/26 Jean-Daniel Cryans <jdcryans@apache.org>
> 
>> 4 CPUs seems ok, unless you are running 2-3 MR tasks at the same time.
>>
>> So your value for the timeout is 240000, but did you change the tick
>> time? The GC pause you got seemed to last almost a minute which, if
>> you did not change the tick value, matches 3000*20 (disregard your
>> session timeout).
>>
>> J-D
>>
>> On Thu, Mar 25, 2010 at 1:07 AM, Zheng Lv <lvzheng19800619@gmail.com>
>> wrote:
>>> Hello J-D,
>>>  Thank you for your reply first.
>>>  >How many CPUs do you have?
>>>  Every server has 2 Dual-Core cpus.
>>>  >Are you swapping?
>>>  Now I'm not sure about it with our monitor tools, but now we have
>> written
>>> a script to record vmstat log every 2 seconds. If something wrong happen
>>> again, we can take it.
>>>  >Also if the only you are using this system currently to batch load
>>>  >data or as an analytics backend, you probably want to set the timeout
>>>  >higher:
>>>  But our value of this property is already 240000.
>>>
>>>  We will try to optimize our garbage collector and we will see what will
>>> happen.
>>>  Thanks again, J-D,
>>>    LvZheng
>>>
>>> 2010/3/25 Jean-Daniel Cryans <jdcryans@apache.org>
>>>
>>>> 2010-03-24 11:33:52,331 WARN org.apache.hadoop.hbase.util.Sleeper: We
>>>> slept 54963ms, ten times longer than scheduled: 3000
>>>>
>>>> You had an important garbage collector pause (aka pause of the world
>>>> in java-speak) and your region server's session with zookeeper expired
>>>> (it literally stopped responding for too long, so long it was
>>>> considered dead). Are you swapping? How many CPUs do you have? If you
>>>> are slowing down the garbage collecting process, it will take more
>>>> time.
>>>>
>>>> Also if the only you are using this system currently to batch load
>>>> data or as an analytics backend, you probably want to set the timeout
>>>> higher:
>>>>
>>>>  <property>
>>>>    <name>zookeeper.session.timeout</name>
>>>>    <value>60000</value>
>>>>    <description>ZooKeeper session timeout.
>>>>      HBase passes this to the zk quorum as suggested maximum time for a
>>>>      session.  See
>>>>
>>>>
>> http://hadoop.apache.org/zookeeper/docs/current/zookeeperProgrammers.html#ch_zkSessions
>>>>      "The client sends a requested timeout, the server responds with the
>>>>      timeout that it can give the client. The current implementation
>>>>      requires that the timeout be a minimum of 2 times the tickTime
>>>>      (as set in the server configuration) and a maximum of 20 times
>>>>      the tickTime." Set the zk ticktime with
>>>> hbase.zookeeper.property.tickTime.
>>>>      In milliseconds.
>>>>    </description>
>>>>  </property>
>>>>
>>>> This value can only be 20 times bigger than this:
>>>>
>>>>  <property>
>>>>    <name>hbase.zookeeper.property.tickTime</name>
>>>>    <value>3000</value>
>>>>    <description>Property from ZooKeeper's config zoo.cfg.
>>>>    The number of milliseconds of each tick.  See
>>>>    zookeeper.session.timeout description.
>>>>    </description>
>>>>  </property>
>>>>
>>>>
>>>> So you could set tick to 6000, timeout to 120000 for a 2min timeout.
>>>>
>