Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of vkjk89@gmail.com designates
 209.85.223.170 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <F59E25BC-8DFF-4C72-96AA-430414621DCC@gmail.com>
References: 
 <CADyrYJitX2FjsduExC5_n6bjm6e=-rcsSdsQh-OAU0ukfKNV_w@mail.gmail.com>
	<CADyrYJhi9Cfk3-KhZ2tAUYRMcKUER-CO0hnMakiO4-E+xhKgRA@mail.gmail.com>
	<CALr1C9rfJyG=W736WRdtiictwWF02a2w6K_DbQwnfaqcghmmSg@mail.gmail.com>
	<CADyrYJh33GxASPkYPLxaNUK2pXrp8hREiXSN-BtpvqWg-B+6OA@mail.gmail.com>
	<F59E25BC-8DFF-4C72-96AA-430414621DCC@gmail.com>
Date: Wed, 5 Jun 2013 16:24:22 +0530
Message-ID: 
 <CADyrYJh-bu=+nwi7tLzmJR4qPWnSoVGUX2c1TTpov3ySAKXo3g@mail.gmail.com>
Subject: Re: HMaster and HRegionServer going down
From: Vimal Jain <vkjk89@gmail.com>
To: user@hbase.apache.org
Content-Type: multipart/alternative; boundary=089e0122ac4439e65304de66051b

--089e0122ac4439e65304de66051b
Content-Type: text/plain; charset=ISO-8859-1

Yes.I did check those.
But i am not sure if those parameter setting is the issue  , as there are
some other exceptions in logs ( "DFSOutputStream ResponseProcessor
exception " etc . )


On Wed, Jun 5, 2013 at 4:19 PM, Ted Yu <yuzhihong@gmail.com> wrote:

> There are a few tips under :
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>
> Can you check ?
>
> Thanks
>
> On Jun 5, 2013, at 2:05 AM, Vimal Jain <vkjk89@gmail.com> wrote:
>
> > I don't think so , as i dont find any issues in data node logs.
> > Also there are lot of exceptions like "session expired" , "slept more
> than
> > configured time" . what are these ?
> >
> >
> > On Wed, Jun 5, 2013 at 2:27 PM, Azuryy Yu <azuryyyu@gmail.com> wrote:
> >
> >> Because your data node 192.168.20.30 broke down. which leads to RS down.
> >>
> >>
> >> On Wed, Jun 5, 2013 at 3:19 PM, Vimal Jain <vkjk89@gmail.com> wrote:
> >>
> >>> Here is the complete log:
> >>>
> >>> http://bin.cakephp.org/saved/103001 - Hregion
> >>> http://bin.cakephp.org/saved/103000 - Hmaster
> >>> http://bin.cakephp.org/saved/103002 - Datanode
> >>>
> >>>
> >>> On Wed, Jun 5, 2013 at 11:58 AM, Vimal Jain <vkjk89@gmail.com> wrote:
> >>>
> >>>> Hi,
> >>>> I have set up Hbase in pseudo-distributed mode.
> >>>> It was working fine for 6 days , but suddenly today morning both
> >> HMaster
> >>>> and Hregion process went down.
> >>>> I checked in logs of both hadoop and hbase.
> >>>> Please help here.
> >>>> Here are the snippets :-
> >>>>
> >>>> *Datanode logs:*
> >>>> 2013-06-05 05:12:51,436 INFO
> >>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
> >>> receiveBlock
> >>>> for block blk_1597245478875608321_2818 java.io.EOFException: while
> >> trying
> >>>> to read 2347 bytes
> >>>> 2013-06-05 05:12:51,442 INFO
> >>>> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> >>>> blk_1597245478875608321_2818 received exception java.io.EOFException:
> >>> while
> >>>> trying to read 2347 bytes
> >>>> 2013-06-05 05:12:51,442 ERROR
> >>>> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> >>>> 192.168.20.30:50010,
> >>>> storageID=DS-1816106352-192.168.20.30-50010-1369314076237,
> >>> infoPort=50075,
> >>>> ipcPort=50020):DataXceiver
> >>>> java.io.EOFException: while trying to read 2347 bytes
> >>>>
> >>>>
> >>>> *HRegion logs:*
> >>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
> >>>> slept 4694929ms instead of 3000ms, this is likely due to a long
> garbage
> >>>> collecting pause and it's usually bad, see
> >>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>> 2013-06-05 05:12:51,045 WARN org.apache.hadoop.hdfs.DFSClient:
> >>>> DFSOutputStream ResponseProcessor exception  for block
> >>>> blk_1597245478875608321_2818java.net.SocketTimeoutException: 63000
> >> millis
> >>>> timeout while waiting for channel to be ready for read. ch :
> >>>> java.nio.channels.SocketChannel[connected local=/192.168.20.30:44333
> >>> remote=/
> >>>> 192.168.20.30:50010]
> >>>> 2013-06-05 05:12:51,046 WARN org.apache.hadoop.hbase.util.Sleeper: We
> >>>> slept 11695345ms instead of 10000000ms, this is likely due to a long
> >>>> garbage collecting pause and it's usually bad, see
> >>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>> 2013-06-05 05:12:51,048 WARN org.apache.hadoop.hdfs.DFSClient: Error
> >>>> Recovery for block blk_1597245478875608321_2818 bad datanode[0]
> >>>> 192.168.20.30:50010
> >>>> 2013-06-05 05:12:51,075 WARN org.apache.hadoop.hdfs.DFSClient: Error
> >>> while
> >>>> syncing
> >>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
> >>>> Aborting...
> >>>>    at
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
> >>>> 2013-06-05 05:12:51,110 FATAL
> >>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync.
> >> Requesting
> >>>> close of hlog
> >>>> java.io.IOException: Reflection
> >>>> Caused by: java.lang.reflect.InvocationTargetException
> >>>> Caused by: java.io.IOException: DFSOutputStream is closed
> >>>> 2013-06-05 05:12:51,180 FATAL
> >>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync.
> >> Requesting
> >>>> close of hlog
> >>>> java.io.IOException: Reflection
> >>>> Caused by: java.lang.reflect.InvocationTargetException
> >>>> Caused by: java.io.IOException: DFSOutputStream is closed
> >>>> 2013-06-05 05:12:51,183 ERROR
> >>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Failed close of HLog
> >>> writer
> >>>> java.io.IOException: Reflection
> >>>> Caused by: java.lang.reflect.InvocationTargetException
> >>>> Caused by: java.io.IOException: DFSOutputStream is closed
> >>>> 2013-06-05 05:12:51,184 WARN
> >>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Riding over HLog close
> >>>> failure! error count=1
> >>>> 2013-06-05 05:12:52,557 FATAL
> >>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
> >>> server
> >>>> hbase.rummycircle.com,60020,1369877672964:
> >>>> regionserver:60020-0x13ef31264d00001
> >> regionserver:60020-0x13ef31264d00001
> >>>> received expired from ZooKeeper, aborting
> >>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>> KeeperErrorCode = Session expired
> >>>> 2013-06-05 05:12:52,557 FATAL
> >>>> org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer
> abort:
> >>>> loaded coprocessors are: []
> >>>> 2013-06-05 05:12:52,621 INFO
> >>>> org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker
> >>>> interrupted while waiting for task, exiting:
> >>> java.lang.InterruptedException
> >>>> java.io.InterruptedIOException: Aborting compaction of store cfp_info
> >> in
> >>>> region
> >> event_data,244630,1369879570539.3ebddcd11a3c22585a690bf40911cb1e.
> >>>> because user requested stop.
> >>>> 2013-06-05 05:12:53,425 WARN
> >>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> >>> transient
> >>>> ZooKeeper exception:
> >>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>> KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com
> >>>> ,60020,1369877672964
> >>>> 2013-06-05 05:12:55,426 WARN
> >>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> >>> transient
> >>>> ZooKeeper exception:
> >>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>> KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com
> >>>> ,60020,1369877672964
> >>>> 2013-06-05 05:12:59,427 WARN
> >>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> >>> transient
> >>>> ZooKeeper exception:
> >>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>> KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com
> >>>> ,60020,1369877672964
> >>>> 2013-06-05 05:13:07,427 WARN
> >>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> >>> transient
> >>>> ZooKeeper exception:
> >>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>> KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com
> >>>> ,60020,1369877672964
> >>>> 2013-06-05 05:13:07,427 ERROR
> >>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper
> >> delete
> >>>> failed after 3 retries
> >>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>> KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com
> >>>> ,60020,1369877672964
> >>>>    at
> >>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> >>>>    at
> >>> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> >>>> 2013-06-05 05:13:07,436 ERROR org.apache.hadoop.hdfs.DFSClient:
> >> Exception
> >>>> closing file /hbase/.logs/hbase.rummycircle.com,60020,1369877672964/
> >>>> hbase.rummycircle.com%2C60020%2C1369877672964.1370382721642 :
> >>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
> >>>> Aborting...
> >>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
> >>>> Aborting...
> >>>>    at
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
> >>>>
> >>>>
> >>>> *HMaster logs:*
> >>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
> >>>> slept 4702394ms instead of 10000ms, this is likely due to a long
> >> garbage
> >>>> collecting pause and it's usually bad, see
> >>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
> >>>> slept 4988731ms instead of 300000ms, this is likely due to a long
> >> garbage
> >>>> collecting pause and it's usually bad, see
> >>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
> >>>> slept 4988726ms instead of 300000ms, this is likely due to a long
> >> garbage
> >>>> collecting pause and it's usually bad, see
> >>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
> >>>> slept 4698291ms instead of 10000ms, this is likely due to a long
> >> garbage
> >>>> collecting pause and it's usually bad, see
> >>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>> 2013-06-05 05:12:50,711 WARN org.apache.hadoop.hbase.util.Sleeper: We
> >>>> slept 4694502ms instead of 1000ms, this is likely due to a long
> garbage
> >>>> collecting pause and it's usually bad, see
> >>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>> 2013-06-05 05:12:50,714 WARN org.apache.hadoop.hbase.util.Sleeper: We
> >>>> slept 4694492ms instead of 1000ms, this is likely due to a long
> garbage
> >>>> collecting pause and it's usually bad, see
> >>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>> 2013-06-05 05:12:50,715 WARN org.apache.hadoop.hbase.util.Sleeper: We
> >>>> slept 4695589ms instead of 60000ms, this is likely due to a long
> >> garbage
> >>>> collecting pause and it's usually bad, see
> >>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>> 2013-06-05 05:12:52,263 FATAL org.apache.hadoop.hbase.master.HMaster:
> >>>> Master server abort: loaded coprocessors are: []
> >>>> 2013-06-05 05:12:52,465 INFO
> >>> org.apache.hadoop.hbase.master.ServerManager:
> >>>> Waiting for region servers count to settle; currently checked in 1,
> >> slept
> >>>> for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout of
> >> 4500
> >>>> ms, interval of 1500 ms.
> >>>> 2013-06-05 05:12:52,561 ERROR org.apache.hadoop.hbase.master.HMaster:
> >>>> Region server hbase.rummycircle.com,60020,1369877672964 reported a
> >> fatal
> >>>> error:
> >>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>> KeeperErrorCode = Session expired
> >>>> 2013-06-05 05:12:53,970 INFO
> >>> org.apache.hadoop.hbase.master.ServerManager:
> >>>> Waiting for region servers count to settle; currently checked in 1,
> >> slept
> >>>> for 1506 ms, expecting minimum of 1, maximum of 2147483647, timeout of
> >>> 4500
> >>>> ms, interval of 1500 ms.
> >>>> 2013-06-05 05:12:55,476 INFO
> >>> org.apache.hadoop.hbase.master.ServerManager:
> >>>> Waiting for region servers count to settle; currently checked in 1,
> >> slept
> >>>> for 3012 ms, expecting minimum of 1, maximum of 2147483647, timeout of
> >>> 4500
> >>>> ms, interval of 1500 ms.
> >>>> 2013-06-05 05:12:56,981 INFO
> >>> org.apache.hadoop.hbase.master.ServerManager:
> >>>> Finished waiting for region servers count to settle; checked in 1,
> >> slept
> >>>> for 4517 ms, expecting minimum of 1, maximum of 2147483647, master is
> >>>> running.
> >>>> 2013-06-05 05:12:57,019 INFO
> >>>> org.apache.hadoop.hbase.catalog.CatalogTracker: Failed verification of
> >>>> -ROOT-,,0 at address=hbase.rummycircle.com,60020,1369877672964;
> >>>> java.io.EOFException
> >>>> 2013-06-05 05:17:52,302 WARN
> >>>> org.apache.hadoop.hbase.master.SplitLogManager: error while splitting
> >>> logs
> >>>> in [hdfs://
> >>
> 192.168.20.30:9000/hbase/.logs/hbase.rummycircle.com,60020,1369877672964-splitting
> >>> ]
> >>>> installed = 19 but only 0 done
> >>>> 2013-06-05 05:17:52,321 FATAL org.apache.hadoop.hbase.master.HMaster:
> >>>> master:60000-0x13ef31264d00000 master:60000-0x13ef31264d00000 received
> >>>> expired from ZooKeeper, aborting
> >>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>> KeeperErrorCode = Session expired
> >>>> java.io.IOException: Giving up after tries=1
> >>>> Caused by: java.lang.InterruptedException: sleep interrupted
> >>>> 2013-06-05 05:17:52,381 ERROR
> >>>> org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start
> >> master
> >>>> java.lang.RuntimeException: HMaster Aborted
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Thanks and Regards,
> >>>> Vimal Jain
> >>>
> >>>
> >>>
> >>> --
> >>> Thanks and Regards,
> >>> Vimal Jain
> >
> >
> >
> > --
> > Thanks and Regards,
> > Vimal Jain
>


-- 
Thanks and Regards,
Vimal Jain

--089e0122ac4439e65304de66051b--