hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Murali Krishna. P" <muralikpb...@yahoo.com>
Subject Re: Region servers going down frequently (0.20 alpha)
Date Mon, 29 Jun 2009 06:22:07 GMT
Hi,
 managed bring up the cluster after reseting the zoo.cfg to have only one server. But only
one of the regionserver is running now. Others not starting saying address already in use
'0.0.0.0/60020', but there is no old process running.. 

Attached the region server log of the one which is running. It seems to be stable till now.
I have 4G memory and the usage is close 4G now.

 Thanks,
Murali Krishna




________________________________
From: Ryan Rawson <ryanobjc@gmail.com>
To: hbase-user@hadoop.apache.org
Sent: Monday, 29 June, 2009 10:52:25 AM
Subject: Re: Region servers going down frequently (0.20 alpha)

Can you post more of the regionserver logs prior to the crash?

you can use pastebin.com if you'd like...

-ryan

On Sun, Jun 28, 2009 at 10:12 PM, Murali Krishna.
P<muralikpbhat@yahoo.com> wrote:
> Hi Andrew,
>  Thanks for looking into this.
> I tried adding 3 nodes to the zoo.cfg but it threw an erros saying 'myid' file is missing.
Now even if i go back to my old config, it still throws the error :(
>
>  Thanks,
> Murali Krishna
>
>
>
>
> ________________________________
> From: Andrew Purtell <apurtell@apache.org>
> To: hbase-user@hadoop.apache.org
> Sent: Sunday, 28 June, 2009 10:47:12 PM
> Subject: Re: Region servers going down frequently (0.20 alpha)
>
> Hello,
>
> As a first step, deploy Zookeeper quorum peers on all of your nodes and
> list all peers in the zoo.cfg files of your Zookeeper install and HBase:
>
>  server.1=node1:2888:3888
>  server.2=node2:2888:3888
>  server.3=node3:2888:3888
>
> Are you running mapreduce tasks as well as otherwise what you have described
> below?
>
> Do you see any messages in the master or region server logs along the lines
> of "we slept for NNNNNNms, wanted NNNNms"? How much RAM do these nodes have?
> Do you have host level metrics running? If not, consider watching this with
> Ganglia, or, in this case, since the cluster is so small three terminals
> running top or atop. After 20, 30 minutes, is all available RAM full and are
> the nodes going in to swap?
>
>   - Andy
>
>
>
>
> ________________________________
> From: Murali Krishna. P <muralikpbhat@yahoo.com>
> To: hbase-user@hadoop.apache.org
> Sent: Sunday, June 28, 2009 8:23:27 AM
> Subject: Region servers going down frequently (0.20 alpha)
>
> Hi,
>  I am repeatedly running into this issue where all the region servers tries to restart
but fails to come up. All the region servers seems to be having same kind of exception which
causes this state.
>
> My cluster is as follows:
> node1 : Master, NN, DN, RS, TT, XX
> node2: Zookeeper, JT, DN, RS, TT, XX
> node3: DN, RS, TT, XX
>
> where  XX is my own hbase client with around 150 threads writing to a common table.
>
> The setup works fine for some time and then goes down (after 20, 30 mins). Here is the
sequence in the region server logs..
>
>    * RS gets a zookeeper event : Got ZooKeeper event, state: Disconnected, type: None,
path:
> null
>    * RS retries 'processing image', gets LeaseStillHeldE: 2009-06-28 02:14:17,013 WARN
org.apache.hadoop.hbase.regionserver.HRegionServer: Processing message (Retry: 1)
> org.apache.hadoop.hbase.Leases$LeaseStillHeldException
>    * After 10 retries, gets another zoookeeper event : Got ZooKeeper event, state: Expired,
type: None, path: null
> 2009-06-28 02:14:17,751 ERROR org..apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper
session expired
> 2009-06-28 02:14:17,751 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Restarting
Region Server
>    * Decides to restart region server, but logs of error like this: 2009-06-28 02:14:17,997
INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 280 on 60020, call exists([B@75880048,
row=724b330295375ad0ba68fa85325381, maxVersions=1, timeRange=[0,9223372036854775807), families=ALL)
from 69.147.127.248:48945: error: java.io.IOException: Ser
> ver not running, aborting
>    * Above might be happening because client 'XX' still trying to write? Finally it closes
the region server and tries to restart. But gets the following exception:2009-06-28 02:14:26,462
INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Starting shutdown thread.
> 2009-06-28 02:14:26,462 INFO org.apache.hadoop.hbase..regionserver.HRegionServer: Runs
every 10000000ms
> 2009-06-28 02:14:26,462 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Shutdown
thread complete
> 2009-06-28 02:14:27,032 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Failed
init
> java.lang.NullPointerException
>        at org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:713)
>        at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer..java:431)
>        at java.lang.Thread.run(Thread.java:619)
> 2009-06-28 02:14:27,110 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: Unhandled
exception. Aborting...
> java.io.IOException: Region server startup failed
>        at org.apache.hadoop.hbase.regionserver..HRegionServer.convertThrowableToIOE(HRegionServer.java:832)
>        at org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:751)
>        at org..apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:431)
>        at java.lang.Thread.run(Thread.java:619)
> Caused by: java.lang.NullPointerException
>        at org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:713)
>        ... 2 more
> 2009-06-28 02:14:27,122 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Dump
of metrics: request=0.0, regions=9, stores=10, storefil
> es=20, storefileIndexSize=0, memcacheSize=52, usedHeap=170, maxHeap=1995, blockCacheSize=49971560,
blockCacheFree=28440, blockCacheCount=765,
> blockCacheHitRatio=94
> 2009-06-28 02:14:27,131 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 60020
> 2009-06-28 02:14:27,131 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping
infoServer
> 2009-06-28 02:14:27,131 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: On abort,
closed hlog
> 2009-06-28 02:14:27,136 INFO org.apache.hadoop.hbase..regionserver.HRegionServer: aborting
server at: 0.0.0.0:60020
>
> There region server dies after that. All the 3 region servers die like this and I have
to start the region server manually. But aftert 10-15 minutes, it runs into the same stage
again. Please help me in finding what is the root cause of this?
>
> Thanks,
> Murali Krishna
> /

Mime
View raw message