Return-Path: Delivered-To: apmail-hadoop-hbase-user-archive@minotaur.apache.org Received: (qmail 78364 invoked from network); 29 Jun 2009 05:22:46 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 29 Jun 2009 05:22:46 -0000 Received: (qmail 5893 invoked by uid 500); 29 Jun 2009 05:22:57 -0000 Delivered-To: apmail-hadoop-hbase-user-archive@hadoop.apache.org Received: (qmail 5844 invoked by uid 500); 29 Jun 2009 05:22:57 -0000 Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-user@hadoop.apache.org Delivered-To: mailing list hbase-user@hadoop.apache.org Received: (qmail 5834 invoked by uid 99); 29 Jun 2009 05:22:57 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Jun 2009 05:22:57 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ryanobjc@gmail.com designates 209.85.217.215 as permitted sender) Received: from [209.85.217.215] (HELO mail-gx0-f215.google.com) (209.85.217.215) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Jun 2009 05:22:46 +0000 Received: by gxk11 with SMTP id 11so6029027gxk.5 for ; Sun, 28 Jun 2009 22:22:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=VJcbH2LZUXV1MKeSdJ/eJgriLPzrjOHa/T78vrnWBtw=; b=ngbxX4/vGrJ9kcnl76LdB8G71cOlBhaY2wjeMX2x9pVShWWoIG+0h858fUTb2qzfBz tKx8s6F+qabtfIKGjeXLsdv7VCYyvpTUH6bWb6sgHBpUa6p7EK/L/eQxX6ntoke+deVB Dx7wTIr7yHGfi4454TaUZvniiwjNr+uS4+zPY= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=Qhn3C9EmYHZ+lit0ftRu1Gc4Ppzi0QWelIBInhqn60SftaHExX8jmZnDLbNgCdH4TU u2MGbVGjwguXBKb6Kg+qRXi4CS0acBMsVO64lry3QxqEOYJDWML/NhZ7crN6gcdQq4p4 GvKfz+My0cl1N7GRYcHwUt3fNL0wwxq72OF0A= MIME-Version: 1.0 Received: by 10.150.219.9 with SMTP id r9mr3533164ybg.278.1246252945331; Sun, 28 Jun 2009 22:22:25 -0700 (PDT) In-Reply-To: <251184.86580.qm@web94707.mail.in2.yahoo.com> References: <404225.82417.qm@web94712.mail.in2.yahoo.com> <673330.87534.qm@web65514.mail.ac4.yahoo.com> <251184.86580.qm@web94707.mail.in2.yahoo.com> Date: Sun, 28 Jun 2009 22:22:25 -0700 Message-ID: <78568af10906282222kec8d584pb5bc6fcf107c73c2@mail.gmail.com> Subject: Re: Region servers going down frequently (0.20 alpha) From: Ryan Rawson To: hbase-user@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Can you post more of the regionserver logs prior to the crash? you can use pastebin.com if you'd like... -ryan On Sun, Jun 28, 2009 at 10:12 PM, Murali Krishna. P wrote: > Hi Andrew, > =A0Thanks for looking into this. > I tried adding 3 nodes to the zoo.cfg but it threw an erros saying 'myid'= file is missing. Now even if i go back to my old config, it still throws t= he error :( > > =A0Thanks, > Murali Krishna > > > > > ________________________________ > From: Andrew Purtell > To: hbase-user@hadoop.apache.org > Sent: Sunday, 28 June, 2009 10:47:12 PM > Subject: Re: Region servers going down frequently (0.20 alpha) > > Hello, > > As a first step, deploy Zookeeper quorum peers on all of your nodes and > list all peers in the zoo.cfg files of your Zookeeper install and HBase: > > =A0server.1=3Dnode1:2888:3888 > =A0server.2=3Dnode2:2888:3888 > =A0server.3=3Dnode3:2888:3888 > > Are you running mapreduce tasks as well as otherwise what you have descri= bed > below? > > Do you see any messages in the master or region server logs along the lin= es > of "we slept for NNNNNNms, wanted NNNNms"? How much RAM do these nodes ha= ve? > Do you have host level metrics running? If not, consider watching this wi= th > Ganglia, or, in this case, since the cluster is so small three terminals > running top or atop. After 20, 30 minutes, is all available RAM full and = are > the nodes going in to swap? > > =A0 - Andy > > > > > ________________________________ > From: Murali Krishna. P > To: hbase-user@hadoop.apache.org > Sent: Sunday, June 28, 2009 8:23:27 AM > Subject: Region servers going down frequently (0.20 alpha) > > Hi, > =A0I am repeatedly running into this issue where all the region servers t= ries to restart but fails to come up. All the region servers seems to be ha= ving same kind of exception which causes this state. > > My cluster is as follows: > node1 : Master, NN, DN, RS, TT, XX > node2: Zookeeper, JT, DN, RS, TT, XX > node3: DN, RS, TT, XX > > where =A0XX is my own hbase client with around 150 threads writing to a c= ommon table. > > The setup works fine for some time and then goes down (after 20, 30 mins)= . Here is the sequence in the region server logs.. > > =A0 =A0* RS gets a zookeeper event : Got ZooKeeper event, state: Disconne= cted, type: None, path: > null > =A0 =A0* RS retries 'processing image', gets LeaseStillHeldE: 2009-06-28 = 02:14:17,013 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: Proce= ssing message (Retry: 1) > org.apache.hadoop.hbase.Leases$LeaseStillHeldException > =A0 =A0* After 10 retries, gets another zoookeeper event : Got ZooKeeper = event, state: Expired, type: None, path: null > 2009-06-28 02:14:17,751 ERROR org..apache.hadoop.hbase.regionserver.HRegi= onServer: ZooKeeper session expired > 2009-06-28 02:14:17,751 INFO org.apache.hadoop.hbase.regionserver.HRegion= Server: Restarting Region Server > =A0 =A0* Decides to restart region server, but logs of error like this: 2= 009-06-28 02:14:17,997 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server h= andler 280 on 60020, call exists([B@75880048, row=3D724b330295375ad0ba68fa8= 5325381, maxVersions=3D1, timeRange=3D[0,9223372036854775807), families=3DA= LL) from 69.147.127.248:48945: error: java.io.IOException: Ser > ver not running, aborting > =A0 =A0* Above might be happening because client 'XX' still trying to wri= te? Finally it closes the region server and tries to restart. But gets the = following exception:2009-06-28 02:14:26,462 INFO org.apache.hadoop.hbase.re= gionserver.HRegionServer: Starting shutdown thread. > 2009-06-28 02:14:26,462 INFO org.apache.hadoop.hbase..regionserver.HRegio= nServer: Runs every 10000000ms > 2009-06-28 02:14:26,462 INFO org.apache.hadoop.hbase.regionserver.HRegion= Server: Shutdown thread complete > 2009-06-28 02:14:27,032 ERROR org.apache.hadoop.hbase.regionserver.HRegio= nServer: Failed init > java.lang.NullPointerException > =A0 =A0 =A0 =A0at org.apache.hadoop.hbase.regionserver.HRegionServer.init= (HRegionServer.java:713) > =A0 =A0 =A0 =A0at org.apache.hadoop.hbase.regionserver.HRegionServer.run(= HRegionServer..java:431) > =A0 =A0 =A0 =A0at java.lang.Thread.run(Thread.java:619) > 2009-06-28 02:14:27,110 FATAL org.apache.hadoop.hbase.regionserver.HRegio= nServer: Unhandled exception. Aborting... > java.io.IOException: Region server startup failed > =A0 =A0 =A0 =A0at org.apache.hadoop.hbase.regionserver.HRegionServer.conv= ertThrowableToIOE(HRegionServer.java:832) > =A0 =A0 =A0 =A0at org.apache.hadoop.hbase.regionserver.HRegionServer.init= (HRegionServer.java:751) > =A0 =A0 =A0 =A0at org..apache.hadoop.hbase.regionserver.HRegionServer.run= (HRegionServer.java:431) > =A0 =A0 =A0 =A0at java.lang.Thread.run(Thread.java:619) > Caused by: java.lang.NullPointerException > =A0 =A0 =A0 =A0at org.apache.hadoop.hbase.regionserver.HRegionServer.init= (HRegionServer.java:713) > =A0 =A0 =A0 =A0... 2 more > 2009-06-28 02:14:27,122 INFO org.apache.hadoop.hbase.regionserver.HRegion= Server: Dump of metrics: request=3D0.0, regions=3D9, stores=3D10, storefil > es=3D20, storefileIndexSize=3D0, memcacheSize=3D52, usedHeap=3D170, maxHe= ap=3D1995, blockCacheSize=3D49971560, blockCacheFree=3D28440, blockCacheCou= nt=3D765, > blockCacheHitRatio=3D94 > 2009-06-28 02:14:27,131 INFO org.apache.hadoop.ipc.HBaseServer: Stopping = server on 60020 > 2009-06-28 02:14:27,131 INFO org.apache.hadoop.hbase.regionserver.HRegion= Server: Stopping infoServer > 2009-06-28 02:14:27,131 INFO org.apache.hadoop.hbase.regionserver.HRegion= Server: On abort, closed hlog > 2009-06-28 02:14:27,136 INFO org.apache.hadoop.hbase..regionserver.HRegio= nServer: aborting server at: 0.0.0.0:60020 > > There region server dies after that. All the 3 region servers die like th= is and I have to start the region server manually. But aftert 10-15 minutes= , it runs into the same stage again. Please help me in finding what is the = root cause of this? > > Thanks, > Murali Krishna > /