Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of yuzhihong@gmail.com
 designates 209.85.160.43 as permitted sender)
References: 
 <CADyrYJitX2FjsduExC5_n6bjm6e=-rcsSdsQh-OAU0ukfKNV_w@mail.gmail.com>
 <CADyrYJhi9Cfk3-KhZ2tAUYRMcKUER-CO0hnMakiO4-E+xhKgRA@mail.gmail.com>
 <CALr1C9rfJyG=W736WRdtiictwWF02a2w6K_DbQwnfaqcghmmSg@mail.gmail.com>
 <CADyrYJh33GxASPkYPLxaNUK2pXrp8hREiXSN-BtpvqWg-B+6OA@mail.gmail.com>
Mime-Version: 1.0 (1.0)
In-Reply-To: 
 <CADyrYJh33GxASPkYPLxaNUK2pXrp8hREiXSN-BtpvqWg-B+6OA@mail.gmail.com>
Content-Type: multipart/alternative;
	boundary=Apple-Mail-288C1261-1B71-4DC4-8EA9-5827CA14D006
Content-Transfer-Encoding: 7bit
Message-Id: <F59E25BC-8DFF-4C72-96AA-430414621DCC@gmail.com>
Cc: "user@hbase.apache.org" <user@hbase.apache.org>
From: Ted Yu <yuzhihong@gmail.com>
Subject: Re: HMaster and HRegionServer going down
Date: Wed, 5 Jun 2013 03:49:04 -0700
To: "user@hbase.apache.org" <user@hbase.apache.org>

--Apple-Mail-288C1261-1B71-4DC4-8EA9-5827CA14D006
Content-Type: text/plain;
	charset=us-ascii
Content-Transfer-Encoding: quoted-printable

There are a few tips under :
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired

Can you check ?

Thanks

On Jun 5, 2013, at 2:05 AM, Vimal Jain <vkjk89@gmail.com> wrote:

> I don't think so , as i dont find any issues in data node logs.
> Also there are lot of exceptions like "session expired" , "slept more than=

> configured time" . what are these ?
>=20
>=20
> On Wed, Jun 5, 2013 at 2:27 PM, Azuryy Yu <azuryyyu@gmail.com> wrote:
>=20
>> Because your data node 192.168.20.30 broke down. which leads to RS down.
>>=20
>>=20
>> On Wed, Jun 5, 2013 at 3:19 PM, Vimal Jain <vkjk89@gmail.com> wrote:
>>=20
>>> Here is the complete log:
>>>=20
>>> http://bin.cakephp.org/saved/103001 - Hregion
>>> http://bin.cakephp.org/saved/103000 - Hmaster
>>> http://bin.cakephp.org/saved/103002 - Datanode
>>>=20
>>>=20
>>> On Wed, Jun 5, 2013 at 11:58 AM, Vimal Jain <vkjk89@gmail.com> wrote:
>>>=20
>>>> Hi,
>>>> I have set up Hbase in pseudo-distributed mode.
>>>> It was working fine for 6 days , but suddenly today morning both
>> HMaster
>>>> and Hregion process went down.
>>>> I checked in logs of both hadoop and hbase.
>>>> Please help here.
>>>> Here are the snippets :-
>>>>=20
>>>> *Datanode logs:*
>>>> 2013-06-05 05:12:51,436 INFO
>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
>>> receiveBlock
>>>> for block blk_1597245478875608321_2818 java.io.EOFException: while
>> trying
>>>> to read 2347 bytes
>>>> 2013-06-05 05:12:51,442 INFO
>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
>>>> blk_1597245478875608321_2818 received exception java.io.EOFException:
>>> while
>>>> trying to read 2347 bytes
>>>> 2013-06-05 05:12:51,442 ERROR
>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
>>>> 192.168.20.30:50010,
>>>> storageID=3DDS-1816106352-192.168.20.30-50010-1369314076237,
>>> infoPort=3D50075,
>>>> ipcPort=3D50020):DataXceiver
>>>> java.io.EOFException: while trying to read 2347 bytes
>>>>=20
>>>>=20
>>>> *HRegion logs:*
>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
>>>> slept 4694929ms instead of 3000ms, this is likely due to a long garbage=

>>>> collecting pause and it's usually bad, see
>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>> 2013-06-05 05:12:51,045 WARN org.apache.hadoop.hdfs.DFSClient:
>>>> DFSOutputStream ResponseProcessor exception  for block
>>>> blk_1597245478875608321_2818java.net.SocketTimeoutException: 63000
>> millis
>>>> timeout while waiting for channel to be ready for read. ch :
>>>> java.nio.channels.SocketChannel[connected local=3D/192.168.20.30:44333
>>> remote=3D/
>>>> 192.168.20.30:50010]
>>>> 2013-06-05 05:12:51,046 WARN org.apache.hadoop.hbase.util.Sleeper: We
>>>> slept 11695345ms instead of 10000000ms, this is likely due to a long
>>>> garbage collecting pause and it's usually bad, see
>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>> 2013-06-05 05:12:51,048 WARN org.apache.hadoop.hdfs.DFSClient: Error
>>>> Recovery for block blk_1597245478875608321_2818 bad datanode[0]
>>>> 192.168.20.30:50010
>>>> 2013-06-05 05:12:51,075 WARN org.apache.hadoop.hdfs.DFSClient: Error
>>> while
>>>> syncing
>>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
>>>> Aborting...
>>>>    at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFS=
Client.java:3096)
>>>> 2013-06-05 05:12:51,110 FATAL
>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync.
>> Requesting
>>>> close of hlog
>>>> java.io.IOException: Reflection
>>>> Caused by: java.lang.reflect.InvocationTargetException
>>>> Caused by: java.io.IOException: DFSOutputStream is closed
>>>> 2013-06-05 05:12:51,180 FATAL
>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync.
>> Requesting
>>>> close of hlog
>>>> java.io.IOException: Reflection
>>>> Caused by: java.lang.reflect.InvocationTargetException
>>>> Caused by: java.io.IOException: DFSOutputStream is closed
>>>> 2013-06-05 05:12:51,183 ERROR
>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Failed close of HLog
>>> writer
>>>> java.io.IOException: Reflection
>>>> Caused by: java.lang.reflect.InvocationTargetException
>>>> Caused by: java.io.IOException: DFSOutputStream is closed
>>>> 2013-06-05 05:12:51,184 WARN
>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Riding over HLog close
>>>> failure! error count=3D1
>>>> 2013-06-05 05:12:52,557 FATAL
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
>>> server
>>>> hbase.rummycircle.com,60020,1369877672964:
>>>> regionserver:60020-0x13ef31264d00001
>> regionserver:60020-0x13ef31264d00001
>>>> received expired from ZooKeeper, aborting
>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>> KeeperErrorCode =3D Session expired
>>>> 2013-06-05 05:12:52,557 FATAL
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort:=

>>>> loaded coprocessors are: []
>>>> 2013-06-05 05:12:52,621 INFO
>>>> org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker
>>>> interrupted while waiting for task, exiting:
>>> java.lang.InterruptedException
>>>> java.io.InterruptedIOException: Aborting compaction of store cfp_info
>> in
>>>> region
>> event_data,244630,1369879570539.3ebddcd11a3c22585a690bf40911cb1e.
>>>> because user requested stop.
>>>> 2013-06-05 05:12:53,425 WARN
>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
>>> transient
>>>> ZooKeeper exception:
>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>> KeeperErrorCode =3D Session expired for /hbase/rs/hbase.rummycircle.com=

>>>> ,60020,1369877672964
>>>> 2013-06-05 05:12:55,426 WARN
>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
>>> transient
>>>> ZooKeeper exception:
>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>> KeeperErrorCode =3D Session expired for /hbase/rs/hbase.rummycircle.com=

>>>> ,60020,1369877672964
>>>> 2013-06-05 05:12:59,427 WARN
>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
>>> transient
>>>> ZooKeeper exception:
>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>> KeeperErrorCode =3D Session expired for /hbase/rs/hbase.rummycircle.com=

>>>> ,60020,1369877672964
>>>> 2013-06-05 05:13:07,427 WARN
>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
>>> transient
>>>> ZooKeeper exception:
>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>> KeeperErrorCode =3D Session expired for /hbase/rs/hbase.rummycircle.com=

>>>> ,60020,1369877672964
>>>> 2013-06-05 05:13:07,427 ERROR
>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper
>> delete
>>>> failed after 3 retries
>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>> KeeperErrorCode =3D Session expired for /hbase/rs/hbase.rummycircle.com=

>>>> ,60020,1369877672964
>>>>    at
>>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
>>>>    at
>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>>>> 2013-06-05 05:13:07,436 ERROR org.apache.hadoop.hdfs.DFSClient:
>> Exception
>>>> closing file /hbase/.logs/hbase.rummycircle.com,60020,1369877672964/
>>>> hbase.rummycircle.com%2C60020%2C1369877672964.1370382721642 :
>>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
>>>> Aborting...
>>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
>>>> Aborting...
>>>>    at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFS=
Client.java:3096)
>>>>=20
>>>>=20
>>>> *HMaster logs:*
>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
>>>> slept 4702394ms instead of 10000ms, this is likely due to a long
>> garbage
>>>> collecting pause and it's usually bad, see
>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
>>>> slept 4988731ms instead of 300000ms, this is likely due to a long
>> garbage
>>>> collecting pause and it's usually bad, see
>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
>>>> slept 4988726ms instead of 300000ms, this is likely due to a long
>> garbage
>>>> collecting pause and it's usually bad, see
>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
>>>> slept 4698291ms instead of 10000ms, this is likely due to a long
>> garbage
>>>> collecting pause and it's usually bad, see
>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>> 2013-06-05 05:12:50,711 WARN org.apache.hadoop.hbase.util.Sleeper: We
>>>> slept 4694502ms instead of 1000ms, this is likely due to a long garbage=

>>>> collecting pause and it's usually bad, see
>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>> 2013-06-05 05:12:50,714 WARN org.apache.hadoop.hbase.util.Sleeper: We
>>>> slept 4694492ms instead of 1000ms, this is likely due to a long garbage=

>>>> collecting pause and it's usually bad, see
>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>> 2013-06-05 05:12:50,715 WARN org.apache.hadoop.hbase.util.Sleeper: We
>>>> slept 4695589ms instead of 60000ms, this is likely due to a long
>> garbage
>>>> collecting pause and it's usually bad, see
>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>> 2013-06-05 05:12:52,263 FATAL org.apache.hadoop.hbase.master.HMaster:
>>>> Master server abort: loaded coprocessors are: []
>>>> 2013-06-05 05:12:52,465 INFO
>>> org.apache.hadoop.hbase.master.ServerManager:
>>>> Waiting for region servers count to settle; currently checked in 1,
>> slept
>>>> for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout of
>> 4500
>>>> ms, interval of 1500 ms.
>>>> 2013-06-05 05:12:52,561 ERROR org.apache.hadoop.hbase.master.HMaster:
>>>> Region server hbase.rummycircle.com,60020,1369877672964 reported a
>> fatal
>>>> error:
>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>> KeeperErrorCode =3D Session expired
>>>> 2013-06-05 05:12:53,970 INFO
>>> org.apache.hadoop.hbase.master.ServerManager:
>>>> Waiting for region servers count to settle; currently checked in 1,
>> slept
>>>> for 1506 ms, expecting minimum of 1, maximum of 2147483647, timeout of
>>> 4500
>>>> ms, interval of 1500 ms.
>>>> 2013-06-05 05:12:55,476 INFO
>>> org.apache.hadoop.hbase.master.ServerManager:
>>>> Waiting for region servers count to settle; currently checked in 1,
>> slept
>>>> for 3012 ms, expecting minimum of 1, maximum of 2147483647, timeout of
>>> 4500
>>>> ms, interval of 1500 ms.
>>>> 2013-06-05 05:12:56,981 INFO
>>> org.apache.hadoop.hbase.master.ServerManager:
>>>> Finished waiting for region servers count to settle; checked in 1,
>> slept
>>>> for 4517 ms, expecting minimum of 1, maximum of 2147483647, master is
>>>> running.
>>>> 2013-06-05 05:12:57,019 INFO
>>>> org.apache.hadoop.hbase.catalog.CatalogTracker: Failed verification of
>>>> -ROOT-,,0 at address=3Dhbase.rummycircle.com,60020,1369877672964;
>>>> java.io.EOFException
>>>> 2013-06-05 05:17:52,302 WARN
>>>> org.apache.hadoop.hbase.master.SplitLogManager: error while splitting
>>> logs
>>>> in [hdfs://
>> 192.168.20.30:9000/hbase/.logs/hbase.rummycircle.com,60020,1369877672964-=
splitting
>>> ]
>>>> installed =3D 19 but only 0 done
>>>> 2013-06-05 05:17:52,321 FATAL org.apache.hadoop.hbase.master.HMaster:
>>>> master:60000-0x13ef31264d00000 master:60000-0x13ef31264d00000 received
>>>> expired from ZooKeeper, aborting
>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>> KeeperErrorCode =3D Session expired
>>>> java.io.IOException: Giving up after tries=3D1
>>>> Caused by: java.lang.InterruptedException: sleep interrupted
>>>> 2013-06-05 05:17:52,381 ERROR
>>>> org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start
>> master
>>>> java.lang.RuntimeException: HMaster Aborted
>>>>=20
>>>>=20
>>>>=20
>>>> --
>>>> Thanks and Regards,
>>>> Vimal Jain
>>>=20
>>>=20
>>>=20
>>> --
>>> Thanks and Regards,
>>> Vimal Jain
>=20
>=20
>=20
> --=20
> Thanks and Regards,
> Vimal Jain

--Apple-Mail-288C1261-1B71-4DC4-8EA9-5827CA14D006--