Return-Path: Delivered-To: apmail-hadoop-hbase-user-archive@minotaur.apache.org Received: (qmail 22557 invoked from network); 26 Mar 2010 17:55:06 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 26 Mar 2010 17:55:06 -0000 Received: (qmail 60000 invoked by uid 500); 26 Mar 2010 17:55:05 -0000 Delivered-To: apmail-hadoop-hbase-user-archive@hadoop.apache.org Received: (qmail 59968 invoked by uid 500); 26 Mar 2010 17:55:05 -0000 Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-user@hadoop.apache.org Delivered-To: mailing list hbase-user@hadoop.apache.org Received: (qmail 59956 invoked by uid 99); 26 Mar 2010 17:55:05 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 26 Mar 2010 17:55:05 +0000 X-ASF-Spam-Status: No, hits=0.7 required=10.0 tests=RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [216.145.54.171] (HELO mrout1.yahoo.com) (216.145.54.171) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 26 Mar 2010 17:54:57 +0000 Received: from [10.73.135.250] ([10.73.135.250]) by mrout1.yahoo.com (8.13.6/8.13.6/y.out) with ESMTP id o2QHs1bE086742; Fri, 26 Mar 2010 10:54:02 -0700 (PDT) Message-ID: <4BACF4BB.8010502@apache.org> Date: Fri, 26 Mar 2010 10:54:03 -0700 From: Patrick Hunt User-Agent: Thunderbird 2.0.0.24 (X11/20100317) MIME-Version: 1.0 To: hbase-user@hadoop.apache.org Subject: Re: Cannot open filename Exceptions References: <7c962aed1003230043h3c6baa36yc258e7b0932c9326@mail.gmail.com> <7c962aed1003232212j518b1bcfj4b9de6fabe5d6f56@mail.gmail.com> <31a243e71003242016h5922c98bh435983daa356e0a2@mail.gmail.com> <31a243e71003250932i3ac29bbdr5d61feaef4e1948@mail.gmail.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Zheng Lv wrote: > I didn't change the tick value, and I will do it right now. But I wanna > know why the timeout value can only be 20 times bigger than ticktime, can > you tell me? The limit is mainly there to keep users from shooting themselves in the foot. Typically (we've not seen any case other than hbase where this has been necessary) you want low timeouts, in the 5-10 second range, perhaps 30seconds on the outside. This results in sessions being cleaned up quickly, and in general allows clients to be very responsive to failures. As HBase RS are effected by limitations in the Sun GC we've had to force a larger timeout than normally would be used via a workaround. In 3.3 ZooKeeper we added new configurations options specific to this HBase use case (there is now a parameter to control the max timeout limitation). In 3.3 we've also added log messages on server/client to log the negotiated timeout and the client API provides programmatic access to the negotiated timeout. There are other reasons why we have min/max limits on the negotiated timeout, in particular to limit memory use on the server. There is state associated with each session we do not want this to grow too large, having a max timeout limit effectively helps to cap this. Patrick > > 2010/3/26 Jean-Daniel Cryans > >> 4 CPUs seems ok, unless you are running 2-3 MR tasks at the same time. >> >> So your value for the timeout is 240000, but did you change the tick >> time? The GC pause you got seemed to last almost a minute which, if >> you did not change the tick value, matches 3000*20 (disregard your >> session timeout). >> >> J-D >> >> On Thu, Mar 25, 2010 at 1:07 AM, Zheng Lv >> wrote: >>> Hello J-D, >>> Thank you for your reply first. >>> >How many CPUs do you have? >>> Every server has 2 Dual-Core cpus. >>> >Are you swapping? >>> Now I'm not sure about it with our monitor tools, but now we have >> written >>> a script to record vmstat log every 2 seconds. If something wrong happen >>> again, we can take it. >>> >Also if the only you are using this system currently to batch load >>> >data or as an analytics backend, you probably want to set the timeout >>> >higher: >>> But our value of this property is already 240000. >>> >>> We will try to optimize our garbage collector and we will see what will >>> happen. >>> Thanks again, J-D, >>> LvZheng >>> >>> 2010/3/25 Jean-Daniel Cryans >>> >>>> 2010-03-24 11:33:52,331 WARN org.apache.hadoop.hbase.util.Sleeper: We >>>> slept 54963ms, ten times longer than scheduled: 3000 >>>> >>>> You had an important garbage collector pause (aka pause of the world >>>> in java-speak) and your region server's session with zookeeper expired >>>> (it literally stopped responding for too long, so long it was >>>> considered dead). Are you swapping? How many CPUs do you have? If you >>>> are slowing down the garbage collecting process, it will take more >>>> time. >>>> >>>> Also if the only you are using this system currently to batch load >>>> data or as an analytics backend, you probably want to set the timeout >>>> higher: >>>> >>>> >>>> zookeeper.session.timeout >>>> 60000 >>>> ZooKeeper session timeout. >>>> HBase passes this to the zk quorum as suggested maximum time for a >>>> session. See >>>> >>>> >> http://hadoop.apache.org/zookeeper/docs/current/zookeeperProgrammers.html#ch_zkSessions >>>> "The client sends a requested timeout, the server responds with the >>>> timeout that it can give the client. The current implementation >>>> requires that the timeout be a minimum of 2 times the tickTime >>>> (as set in the server configuration) and a maximum of 20 times >>>> the tickTime." Set the zk ticktime with >>>> hbase.zookeeper.property.tickTime. >>>> In milliseconds. >>>> >>>> >>>> >>>> This value can only be 20 times bigger than this: >>>> >>>> >>>> hbase.zookeeper.property.tickTime >>>> 3000 >>>> Property from ZooKeeper's config zoo.cfg. >>>> The number of milliseconds of each tick. See >>>> zookeeper.session.timeout description. >>>> >>>> >>>> >>>> >>>> So you could set tick to 6000, timeout to 120000 for a 2min timeout. >>>> >