Return-Path: Delivered-To: apmail-hadoop-hbase-user-archive@minotaur.apache.org Received: (qmail 17152 invoked from network); 29 Jun 2009 06:45:20 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 29 Jun 2009 06:45:20 -0000 Received: (qmail 51794 invoked by uid 500); 29 Jun 2009 06:45:30 -0000 Delivered-To: apmail-hadoop-hbase-user-archive@hadoop.apache.org Received: (qmail 51752 invoked by uid 500); 29 Jun 2009 06:45:30 -0000 Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-user@hadoop.apache.org Delivered-To: mailing list hbase-user@hadoop.apache.org Received: (qmail 51734 invoked by uid 99); 29 Jun 2009 06:45:28 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Jun 2009 06:45:28 +0000 X-ASF-Spam-Status: No, hits=3.4 required=10.0 tests=HTML_MESSAGE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [76.13.9.54] (HELO web65510.mail.ac4.yahoo.com) (76.13.9.54) by apache.org (qpsmtpd/0.29) with SMTP; Mon, 29 Jun 2009 06:45:18 +0000 Received: (qmail 91033 invoked by uid 60001); 29 Jun 2009 06:44:57 -0000 Message-ID: <519449.90147.qm@web65510.mail.ac4.yahoo.com> X-YMail-OSG: HOMx8T8VM1kVorG6qUDFFJ_7zB7LS3r09jdLHSlR1dPej1KHtApXjgV8VroNSdhndoBx6Kv9Zx6gNgFxs5ZcBDaTMjpVGNJqlvd1jLvljnJhw1Ignnn4YMpbp5L9wHZfBvkWEPGun.Ys8nvpjj0HklP5NVG99DsDQMlwJUm_EdwYxTzzRlhPskr7ZdVaIn74BJ3w2aKC5yZ.cTcK3G5Nzmk4WI8Z86gf7cmn_B1c58zWZtBBQt0ie822558hicH_q5oUNICglByXE0EAldycFhgWcfX4hTUCdiJnBFa8GM7ySIyus1Wqy70qiu3Z8ULn8LySgsk- Received: from [69.108.154.189] by web65510.mail.ac4.yahoo.com via HTTP; Sun, 28 Jun 2009 23:44:57 PDT X-RocketYMMF: apurtell X-Mailer: YahooMailRC/1277.43 YahooMailWebService/0.7.289.10 References: <404225.82417.qm@web94712.mail.in2.yahoo.com> <673330.87534.qm@web65514.mail.ac4.yahoo.com> <251184.86580.qm@web94707.mail.in2.yahoo.com> Date: Sun, 28 Jun 2009 23:44:57 -0700 (PDT) From: Andrew Purtell Subject: Re: Region servers going down frequently (0.20 alpha) To: hbase-user@hadoop.apache.org In-Reply-To: <251184.86580.qm@web94707.mail.in2.yahoo.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="0-329625613-1246257897=:90147" X-Virus-Checked: Checked by ClamAV on apache.org --0-329625613-1246257897=:90147 Content-Type: text/plain; charset=us-ascii Hi, Configuring 'myid' files are part of the Zookeeper set up process. Are you aware of the instructions for how to set up Zookeeper here: http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html ? From: http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_zkMulitServerSetup "For reliable ZooKeeper service, you should deploy ZooKeeper in a cluster known as an ensemble. [...] Here are the steps to setting a server that will be part of an ensemble. These steps should be performed on every host in the ensemble: ..." - Andy ________________________________ From: Murali Krishna. P To: hbase-user@hadoop.apache.org Sent: Sunday, June 28, 2009 10:12:02 PM Subject: Re: Region servers going down frequently (0.20 alpha) Hi Andrew, Thanks for looking into this. I tried adding 3 nodes to the zoo.cfg but it threw an erros saying 'myid' file is missing. Now even if i go back to my old config, it still throws the error :( Thanks, Murali Krishna ________________________________ From: Andrew Purtell To: hbase-user@hadoop.apache.org Sent: Sunday, 28 June, 2009 10:47:12 PM Subject: Re: Region servers going down frequently (0.20 alpha) Hello, As a first step, deploy Zookeeper quorum peers on all of your nodes and list all peers in the zoo.cfg files of your Zookeeper install and HBase: server.1=node1:2888:3888 server.2=node2:2888:3888 server.3=node3:2888:3888 Are you running mapreduce tasks as well as otherwise what you have described below? Do you see any messages in the master or region server logs along the lines of "we slept for NNNNNNms, wanted NNNNms"? How much RAM do these nodes have? Do you have host level metrics running? If not, consider watching this with Ganglia, or, in this case, since the cluster is so small three terminals running top or atop. After 20, 30 minutes, is all available RAM full and are the nodes going in to swap? - Andy ________________________________ From: Murali Krishna. P To: hbase-user@hadoop.apache.org Sent: Sunday, June 28, 2009 8:23:27 AM Subject: Region servers going down frequently (0.20 alpha) Hi, I am repeatedly running into this issue where all the region servers tries to restart but fails to come up. All the region servers seems to be having same kind of exception which causes this state. My cluster is as follows: node1 : Master, NN, DN, RS, TT, XX node2: Zookeeper, JT, DN, RS, TT, XX node3: DN, RS, TT, XX where XX is my own hbase client with around 150 threads writing to a common table. The setup works fine for some time and then goes down (after 20, 30 mins). Here is the sequence in the region server logs.. * RS gets a zookeeper event : Got ZooKeeper event, state: Disconnected, type: None, path: null * RS retries 'processing image', gets LeaseStillHeldE: 2009-06-28 02:14:17,013 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: Processing message (Retry: 1) org.apache.hadoop.hbase.Leases$LeaseStillHeldException * After 10 retries, gets another zoookeeper event : Got ZooKeeper event, state: Expired, type: None, path: null 2009-06-28 02:14:17,751 ERROR org..apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expired 2009-06-28 02:14:17,751 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Restarting Region Server * Decides to restart region server, but logs of error like this: 2009-06-28 02:14:17,997 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 280 on 60020, call exists([B@75880048, row=724b330295375ad0ba68fa85325381, maxVersions=1, timeRange=[0,9223372036854775807), families=ALL) from 69.147.127.248:48945: error: java.io.IOException: Ser ver not running, aborting * Above might be happening because client 'XX' still trying to write? Finally it closes the region server and tries to restart. But gets the following exception:2009-06-28 02:14:26,462 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Starting shutdown thread. 2009-06-28 02:14:26,462 INFO org.apache.hadoop.hbase..regionserver.HRegionServer: Runs every 10000000ms 2009-06-28 02:14:26,462 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Shutdown thread complete 2009-06-28 02:14:27,032 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Failed init java.lang.NullPointerException at org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:713) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer..java:431) at java.lang.Thread.run(Thread.java:619) 2009-06-28 02:14:27,110 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: Unhandled exception. Aborting... java.io.IOException: Region server startup failed at org.apache.hadoop.hbase.regionserver.HRegionServer.convertThrowableToIOE(HRegionServer.java:832) at org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:751) at org..apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:431) at java.lang.Thread.run(Thread.java:619) Caused by: java.lang.NullPointerException at org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:713) ... 2 more 2009-06-28 02:14:27,122 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: request=0.0, regions=9, stores=10, storefil es=20, storefileIndexSize=0, memcacheSize=52, usedHeap=170, maxHeap=1995, blockCacheSize=49971560, blockCacheFree=28440, blockCacheCount=765, blockCacheHitRatio=94 2009-06-28 02:14:27,131 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 60020 2009-06-28 02:14:27,131 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer 2009-06-28 02:14:27,131 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: On abort, closed hlog 2009-06-28 02:14:27,136 INFO org.apache.hadoop.hbase..regionserver.HRegionServer: aborting server at: 0.0.0.0:60020 There region server dies after that. All the 3 region servers die like this and I have to start the region server manually. But aftert 10-15 minutes, it runs into the same stage again. Please help me in finding what is the root cause of this? Thanks, Murali Krishna / --0-329625613-1246257897=:90147--