hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vimal Jain <vkj...@gmail.com>
Subject Re: HMaster and HRegionServer going down
Date Wed, 05 Jun 2013 12:10:13 GMT
Yes.
Thats true.
There are some errors in all 3 logs during same period , i.e. data , master
and region.
But i am unable to deduce the exact cause of error.
Can you please help in detecting the problem ?

So far i am suspecting following :-
I have 1GB heap (default) allocated for all 3 processes , i.e.
Master,Region,Zookeeper.
Both  Master and Region took more time for GC ( as inferred from lines in
logs like "slept more time than configured one" etc ) .
Due to this there was  zookeeper connection time out for both Master and
Region and hence both went down.

I am newbie to Hbase and hence may be my findings are not correct.
I want to be 100 % sure before increasing heap space for both Master and
Region ( Both around 2GB) to solve this.
At present i have restarted the cluster with default heap space only ( 1GB
).



On Wed, Jun 5, 2013 at 5:23 PM, Azuryy Yu <azuryyyu@gmail.com> wrote:

> there have errors in your dats node log, and the error time match with rs
> log error time.
>
> --Send from my Sony mobile.
> On Jun 5, 2013 5:06 PM, "Vimal Jain" <vkjk89@gmail.com> wrote:
>
> > I don't think so , as i dont find any issues in data node logs.
> > Also there are lot of exceptions like "session expired" , "slept more
> than
> > configured time" . what are these ?
> >
> >
> > On Wed, Jun 5, 2013 at 2:27 PM, Azuryy Yu <azuryyyu@gmail.com> wrote:
> >
> > > Because your data node 192.168.20.30 broke down. which leads to RS
> down.
> > >
> > >
> > > On Wed, Jun 5, 2013 at 3:19 PM, Vimal Jain <vkjk89@gmail.com> wrote:
> > >
> > > > Here is the complete log:
> > > >
> > > > http://bin.cakephp.org/saved/103001 - Hregion
> > > > http://bin.cakephp.org/saved/103000 - Hmaster
> > > > http://bin.cakephp.org/saved/103002 - Datanode
> > > >
> > > >
> > > > On Wed, Jun 5, 2013 at 11:58 AM, Vimal Jain <vkjk89@gmail.com>
> wrote:
> > > >
> > > > > Hi,
> > > > > I have set up Hbase in pseudo-distributed mode.
> > > > > It was working fine for 6 days , but suddenly today morning both
> > > HMaster
> > > > > and Hregion process went down.
> > > > > I checked in logs of both hadoop and hbase.
> > > > > Please help here.
> > > > > Here are the snippets :-
> > > > >
> > > > > *Datanode logs:*
> > > > > 2013-06-05 05:12:51,436 INFO
> > > > > org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
> > > > receiveBlock
> > > > > for block blk_1597245478875608321_2818 java.io.EOFException: while
> > > trying
> > > > > to read 2347 bytes
> > > > > 2013-06-05 05:12:51,442 INFO
> > > > > org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> > > > > blk_1597245478875608321_2818 received exception
> java.io.EOFException:
> > > > while
> > > > > trying to read 2347 bytes
> > > > > 2013-06-05 05:12:51,442 ERROR
> > > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > DatanodeRegistration(
> > > > > 192.168.20.30:50010,
> > > > > storageID=DS-1816106352-192.168.20.30-50010-1369314076237,
> > > > infoPort=50075,
> > > > > ipcPort=50020):DataXceiver
> > > > > java.io.EOFException: while trying to read 2347 bytes
> > > > >
> > > > >
> > > > > *HRegion logs:*
> > > > > 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper:
> We
> > > > > slept 4694929ms instead of 3000ms, this is likely due to a long
> > garbage
> > > > > collecting pause and it's usually bad, see
> > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > 2013-06-05 05:12:51,045 WARN org.apache.hadoop.hdfs.DFSClient:
> > > > > DFSOutputStream ResponseProcessor exception  for block
> > > > > blk_1597245478875608321_2818java.net.SocketTimeoutException: 63000
> > > millis
> > > > > timeout while waiting for channel to be ready for read. ch :
> > > > > java.nio.channels.SocketChannel[connected local=/
> 192.168.20.30:44333
> > > > remote=/
> > > > > 192.168.20.30:50010]
> > > > > 2013-06-05 05:12:51,046 WARN org.apache.hadoop.hbase.util.Sleeper:
> We
> > > > > slept 11695345ms instead of 10000000ms, this is likely due to a
> long
> > > > > garbage collecting pause and it's usually bad, see
> > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > 2013-06-05 05:12:51,048 WARN org.apache.hadoop.hdfs.DFSClient:
> Error
> > > > > Recovery for block blk_1597245478875608321_2818 bad datanode[0]
> > > > > 192.168.20.30:50010
> > > > > 2013-06-05 05:12:51,075 WARN org.apache.hadoop.hdfs.DFSClient:
> Error
> > > > while
> > > > > syncing
> > > > > java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
> > > > > Aborting...
> > > > >     at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
> > > > > 2013-06-05 05:12:51,110 FATAL
> > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync.
> > > Requesting
> > > > > close of hlog
> > > > > java.io.IOException: Reflection
> > > > > Caused by: java.lang.reflect.InvocationTargetException
> > > > > Caused by: java.io.IOException: DFSOutputStream is closed
> > > > > 2013-06-05 05:12:51,180 FATAL
> > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync.
> > > Requesting
> > > > > close of hlog
> > > > > java.io.IOException: Reflection
> > > > > Caused by: java.lang.reflect.InvocationTargetException
> > > > > Caused by: java.io.IOException: DFSOutputStream is closed
> > > > > 2013-06-05 05:12:51,183 ERROR
> > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Failed close of HLog
> > > > writer
> > > > > java.io.IOException: Reflection
> > > > > Caused by: java.lang.reflect.InvocationTargetException
> > > > > Caused by: java.io.IOException: DFSOutputStream is closed
> > > > > 2013-06-05 05:12:51,184 WARN
> > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Riding over HLog
> close
> > > > > failure! error count=1
> > > > > 2013-06-05 05:12:52,557 FATAL
> > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
> > > > server
> > > > > hbase.rummycircle.com,60020,1369877672964:
> > > > > regionserver:60020-0x13ef31264d00001
> > > regionserver:60020-0x13ef31264d00001
> > > > > received expired from ZooKeeper, aborting
> > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > KeeperErrorCode = Session expired
> > > > > 2013-06-05 05:12:52,557 FATAL
> > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer
> > abort:
> > > > > loaded coprocessors are: []
> > > > > 2013-06-05 05:12:52,621 INFO
> > > > > org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker
> > > > > interrupted while waiting for task, exiting:
> > > > java.lang.InterruptedException
> > > > > java.io.InterruptedIOException: Aborting compaction of store
> cfp_info
> > > in
> > > > > region
> > > event_data,244630,1369879570539.3ebddcd11a3c22585a690bf40911cb1e.
> > > > > because user requested stop.
> > > > > 2013-06-05 05:12:53,425 WARN
> > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> > > > transient
> > > > > ZooKeeper exception:
> > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > hbase.rummycircle.com
> > > > > ,60020,1369877672964
> > > > > 2013-06-05 05:12:55,426 WARN
> > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> > > > transient
> > > > > ZooKeeper exception:
> > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > hbase.rummycircle.com
> > > > > ,60020,1369877672964
> > > > > 2013-06-05 05:12:59,427 WARN
> > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> > > > transient
> > > > > ZooKeeper exception:
> > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > hbase.rummycircle.com
> > > > > ,60020,1369877672964
> > > > > 2013-06-05 05:13:07,427 WARN
> > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> > > > transient
> > > > > ZooKeeper exception:
> > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > hbase.rummycircle.com
> > > > > ,60020,1369877672964
> > > > > 2013-06-05 05:13:07,427 ERROR
> > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper
> > > delete
> > > > > failed after 3 retries
> > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > hbase.rummycircle.com
> > > > > ,60020,1369877672964
> > > > >     at
> > > > >
> org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> > > > >     at
> > > > org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> > > > > 2013-06-05 05:13:07,436 ERROR org.apache.hadoop.hdfs.DFSClient:
> > > Exception
> > > > > closing file /hbase/.logs/hbase.rummycircle.com
> ,60020,1369877672964/
> > > > > hbase.rummycircle.com%2C60020%2C1369877672964.1370382721642 :
> > > > > java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
> > > > > Aborting...
> > > > > java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
> > > > > Aborting...
> > > > >     at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
> > > > >
> > > > >
> > > > > *HMaster logs:*
> > > > > 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper:
> We
> > > > > slept 4702394ms instead of 10000ms, this is likely due to a long
> > > garbage
> > > > > collecting pause and it's usually bad, see
> > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper:
> We
> > > > > slept 4988731ms instead of 300000ms, this is likely due to a long
> > > garbage
> > > > > collecting pause and it's usually bad, see
> > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper:
> We
> > > > > slept 4988726ms instead of 300000ms, this is likely due to a long
> > > garbage
> > > > > collecting pause and it's usually bad, see
> > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper:
> We
> > > > > slept 4698291ms instead of 10000ms, this is likely due to a long
> > > garbage
> > > > > collecting pause and it's usually bad, see
> > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > 2013-06-05 05:12:50,711 WARN org.apache.hadoop.hbase.util.Sleeper:
> We
> > > > > slept 4694502ms instead of 1000ms, this is likely due to a long
> > garbage
> > > > > collecting pause and it's usually bad, see
> > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > 2013-06-05 05:12:50,714 WARN org.apache.hadoop.hbase.util.Sleeper:
> We
> > > > > slept 4694492ms instead of 1000ms, this is likely due to a long
> > garbage
> > > > > collecting pause and it's usually bad, see
> > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > 2013-06-05 05:12:50,715 WARN org.apache.hadoop.hbase.util.Sleeper:
> We
> > > > > slept 4695589ms instead of 60000ms, this is likely due to a long
> > > garbage
> > > > > collecting pause and it's usually bad, see
> > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > 2013-06-05 05:12:52,263 FATAL
> org.apache.hadoop.hbase.master.HMaster:
> > > > > Master server abort: loaded coprocessors are: []
> > > > > 2013-06-05 05:12:52,465 INFO
> > > > org.apache.hadoop.hbase.master.ServerManager:
> > > > > Waiting for region servers count to settle; currently checked in
1,
> > > slept
> > > > > for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout
of
> > > 4500
> > > > > ms, interval of 1500 ms.
> > > > > 2013-06-05 05:12:52,561 ERROR
> org.apache.hadoop.hbase.master.HMaster:
> > > > > Region server hbase.rummycircle.com,60020,1369877672964 reported
a
> > > fatal
> > > > > error:
> > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > KeeperErrorCode = Session expired
> > > > > 2013-06-05 05:12:53,970 INFO
> > > > org.apache.hadoop.hbase.master.ServerManager:
> > > > > Waiting for region servers count to settle; currently checked in
1,
> > > slept
> > > > > for 1506 ms, expecting minimum of 1, maximum of 2147483647, timeout
> > of
> > > > 4500
> > > > > ms, interval of 1500 ms.
> > > > > 2013-06-05 05:12:55,476 INFO
> > > > org.apache.hadoop.hbase.master.ServerManager:
> > > > > Waiting for region servers count to settle; currently checked in
1,
> > > slept
> > > > > for 3012 ms, expecting minimum of 1, maximum of 2147483647, timeout
> > of
> > > > 4500
> > > > > ms, interval of 1500 ms.
> > > > > 2013-06-05 05:12:56,981 INFO
> > > > org.apache.hadoop.hbase.master.ServerManager:
> > > > > Finished waiting for region servers count to settle; checked in 1,
> > > slept
> > > > > for 4517 ms, expecting minimum of 1, maximum of 2147483647, master
> is
> > > > > running.
> > > > > 2013-06-05 05:12:57,019 INFO
> > > > > org.apache.hadoop.hbase.catalog.CatalogTracker: Failed verification
> > of
> > > > > -ROOT-,,0 at address=hbase.rummycircle.com,60020,1369877672964;
> > > > > java.io.EOFException
> > > > > 2013-06-05 05:17:52,302 WARN
> > > > > org.apache.hadoop.hbase.master.SplitLogManager: error while
> splitting
> > > > logs
> > > > > in [hdfs://
> > > > >
> > > >
> > >
> >
> 192.168.20.30:9000/hbase/.logs/hbase.rummycircle.com,60020,1369877672964-splitting
> > > > ]
> > > > > installed = 19 but only 0 done
> > > > > 2013-06-05 05:17:52,321 FATAL
> org.apache.hadoop.hbase.master.HMaster:
> > > > > master:60000-0x13ef31264d00000 master:60000-0x13ef31264d00000
> > received
> > > > > expired from ZooKeeper, aborting
> > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > KeeperErrorCode = Session expired
> > > > > java.io.IOException: Giving up after tries=1
> > > > > Caused by: java.lang.InterruptedException: sleep interrupted
> > > > > 2013-06-05 05:17:52,381 ERROR
> > > > > org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start
> > > master
> > > > > java.lang.RuntimeException: HMaster Aborted
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Thanks and Regards,
> > > > > Vimal Jain
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Thanks and Regards,
> > > > Vimal Jain
> > > >
> > >
> >
> >
> >
> > --
> > Thanks and Regards,
> > Vimal Jain
> >
>



-- 
Thanks and Regards,
Vimal Jain

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message