Return-Path: Delivered-To: apmail-hadoop-hbase-user-archive@minotaur.apache.org Received: (qmail 24014 invoked from network); 9 Nov 2009 17:24:00 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 9 Nov 2009 17:24:00 -0000 Received: (qmail 37138 invoked by uid 500); 9 Nov 2009 17:24:00 -0000 Delivered-To: apmail-hadoop-hbase-user-archive@hadoop.apache.org Received: (qmail 37104 invoked by uid 500); 9 Nov 2009 17:24:00 -0000 Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-user@hadoop.apache.org Delivered-To: mailing list hbase-user@hadoop.apache.org Received: (qmail 37094 invoked by uid 99); 9 Nov 2009 17:24:00 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 Nov 2009 17:24:00 +0000 X-ASF-Spam-Status: No, hits=-2.6 required=5.0 tests=BAYES_00 X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of elsif.then@gmail.com designates 209.85.218.223 as permitted sender) Received: from [209.85.218.223] (HELO mail-bw0-f223.google.com) (209.85.218.223) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 Nov 2009 17:23:54 +0000 Received: by bwz23 with SMTP id 23so3645755bwz.29 for ; Mon, 09 Nov 2009 09:23:32 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from :user-agent:mime-version:to:subject:references:in-reply-to :content-type:content-transfer-encoding; bh=lsftLYCMGctrgH6uadkhHNFkVvfr9YHSXwa9CWFTUB4=; b=ZQWAruOmGD14uIo5r0Rv8dqPKSGwtV51ORWDyFn4tn7siDzxb5xcssrfMxm16VySWt cpLE6XheotlvCKl1cfEPeawN91hQGGDQ5KFDl9wbDTFH7Sj9O7g4bWTec3X9h3iHtqUf bZNhgQ1y09qlmFOmeqnCkQ9YhyfmuynrsLf2c= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; b=d9ZKgbmxeCctCdyLbBzW3EE7elVOhxaZDWKBaO1rWAE8qudJ9UEMpGnxGw+qd1PIZv A34HKjgqlokrPUlET7CpCPR3YEE9b14TuEKzIq9XQUTJE5BRlFNqtYk62Ry0FnhRLCC9 RmYyjAU4tdbVax50G1p7Sq2Z6yhgGg7cCrIG0= Received: by 10.103.48.26 with SMTP id a26mr3155489muk.83.1257787412631; Mon, 09 Nov 2009 09:23:32 -0800 (PST) Received: from ?127.0.0.1? (tor-exit-router.viol8r.org [66.35.1.170]) by mx.google.com with ESMTPS id i7sm11450431mue.53.2009.11.09.09.23.29 (version=SSLv3 cipher=RC4-MD5); Mon, 09 Nov 2009 09:23:31 -0800 (PST) Message-ID: <4AF85080.9090902@gmail.com> Date: Mon, 09 Nov 2009 09:25:20 -0800 From: elsif User-Agent: Thunderbird 2.0.0.23 (X11/20090812) MIME-Version: 1.0 To: hbase-user@hadoop.apache.org Subject: Re: HBase Exceptions on version 0.20.1 References: <4AF0EB59.7090108@gmail.com> <7c962aed0911040009j4bd114f4oe2015f22a65d3ab8@mail.gmail.com> <4AF20D33.4020701@gmail.com> <7c962aed0911061705v789abae7hc01b3b17baf1296e@mail.gmail.com> <4AF5AA6C.60602@gmail.com> <994508.86897.qm@web65508.mail.ac4.yahoo.com> In-Reply-To: <994508.86897.qm@web65508.mail.ac4.yahoo.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit The larger issue here is that any hbase cluster will reach this tipping point at some point in its lifetime as more and more data is added. We need to have a graceful method to put the cluster into safe mode until more resources can be added or the load on the cluster has been reduced. We cannot allow hbase to run itself into the ground causing data loss or corruption under any circumstances. * * Andrew Purtell wrote: > You should consider provisioning more nodes to get beyond this ceiling you encountered. > > DFS write latency spikes from 3 seconds to 6 seconds, to 15! Flushing cannot happen fast enough to avoid an OOME. Possibly there was even insufficient CPU to GC. The log entries you highlighted indicate the load you are exerting on your current cluster needs to be spread out over more resources than currently allocated. > > This: > >> 2009-11-06 09:15:37,144 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 286007ms, ten times longer than scheduled: 10000 >> > > indicates a thread that wanted to sleep for 10 seconds was starved for CPU for 286 seconds. Obviously Zookeeper timeouts and resulting HBase process shutdowns, missed DFS heartbeats possibly resulting in spurious declaration of dead datanodes, and other serious problems will result from this. > > Did your systems start to swap? > > When region servers shut down, the master notices this and splits their HLogs into per region reconstruction logs. These are the "oldlogfile.log" files. The master log will shed light on why this particular reconstruction log was botched. Would have happened at the master. The region server probably did do a clean shutdown. I suspect DFS was in extremis due to overloading so the split failed. The checksum error indicates incomplete write at the OS level. Did a datanode crash? > > HBASE-1956 is about making the DFS latency metric exportable via the > Hadoop metrics layer, perhaps via Ganglia. Write latency above 1 or 2 > seconds is a warning. Anything above 5 seconds is an alarm. It's a > good indication that an overloading condition is in progress. > > The Hadoop stack, being pre 1.0, has some rough edges. Response to overloading is one of them. For one thing, HBase could be better about applying backpressure to writing clients when the system is under stress. We will get there. HBASE-1956 is a start. > > - Andy > >