hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Kent <Robert.K...@inps.co.uk>
Subject RE: HBase Regionserver randomly dies
Date Mon, 01 Dec 2014 16:17:56 GMT
> From: Ted Yu [yuzhihong@gmail.com]
> Sent: 01 December 2014 15:31
> To: user@hbase.apache.org
> Subject: Re: HBase Regionserver randomly dies

> Can you check namenode log around the time 'Failed to close inode' error
> was thrown ?

> Thanks

Here are the errors from the logs:

2014-11-29 21:12:59,277 WARN  [Thread-125058] hdfs.DFSClient (DFSOutputStream.java:run(639))
- DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
could only be replicat
ed to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and no node(s)
are excluded in this operation.
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3027)
[snip]
2014-11-29 21:12:59,306 WARN  [regionserver60020.logRoller] hdfs.DFSClient (DFSOutputStream.java:flushOrSync(2007))
- Error while syncing
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s)
running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,359 WARN  [regionserver60020.logRoller] wal.FSHLog (FSHLog.java:rollWriter(566))
- pre-sync failed
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s)
running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,459 INFO  [regionserver60020.logRoller] wal.FSHLog (FSHLog.java:rollWriter(588))
- Rolled WAL /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295456389
with entries=23544, filesize=121.9 M; new WAL /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
2014-11-29 21:12:59,488 ERROR [regionserver60020-WAL.AsyncWriter] wal.FSHLog (FSHLog.java:run(1140))
- Error while AsyncWriter write, request close of hlog 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s)
running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,489 ERROR [regionserver60020-WAL.AsyncWriter] wal.FSHLog (FSHLog.java:run(1140))
- Error while AsyncWriter write, request close of hlog 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s)
running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,490 FATAL [regionserver60020-WAL.AsyncSyncer0] wal.FSHLog (FSHLog.java:run(1255))
- Error while AsyncSyncer sync, request close of hlog 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s)
running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,491 FATAL [regionserver60020-WAL.AsyncSyncer1] wal.FSHLog (FSHLog.java:run(1255))
- Error while AsyncSyncer sync, request close of hlog 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s)
running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,506 ERROR [regionserver60020-WAL.AsyncWriter] wal.FSHLog (FSHLog.java:run(1140))
- Error while AsyncWriter write, request close of hlog 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s)
running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,507 FATAL [regionserver60020-WAL.AsyncSyncer0] wal.FSHLog (FSHLog.java:run(1255))
- Error while AsyncSyncer sync, request close of hlog 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s)
running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,524 WARN  [Thread-125063] hdfs.DFSClient (DFSOutputStream.java:run(639))
- DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s)
running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,525 WARN  [regionserver60020.logRoller] hdfs.DFSClient (DFSOutputStream.java:flushOrSync(2007))
- Error while syncing
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s)
running and no node(s) are excluded in this operation.
[snip]
[snip]
2014-11-29 21:12:59,538 WARN  [regionserver60020.logRoller] wal.FSHLog (FSHLog.java:cleanupCurrentWriter(779))
- Riding over HLog close failure! error count=1
2014-11-29 21:12:59,539 INFO  [regionserver60020.logRoller] wal.FSHLog (FSHLog.java:rollWriter(588))
- Rolled WAL /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295578898
with entries=5, filesize=0; new WAL /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
[snip]
2014-11-29 21:12:59,736 ERROR [regionserver60020-WAL.AsyncWriter] wal.FSHLog (FSHLog.java:run(1140))
- Error while AsyncWriter write, request close of hlog 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s)
running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,743 WARN  [Thread-125064] hdfs.DFSClient (DFSOutputStream.java:run(639))
- DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737
could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s)
running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,745 ERROR [regionserver60020.logRoller] wal.ProtobufLogWriter (ProtobufLogWriter.java:writeWALTrailer(157))
- Got IOException while writing trailer
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s)
running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,745 ERROR [regionserver60020.logRoller] wal.FSHLog (FSHLog.java:cleanupCurrentWriter(776))
- Failed close of HLog writer
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s)
running and no node(s) are excluded in this operation.
[snip]
2014-11-29 21:12:59,746 WARN  [regionserver60020.logRoller] wal.FSHLog (FSHLog.java:cleanupCurrentWriter(779))
- Riding over HLog close failure! error count=2
2014-11-29 21:12:59,748 INFO  [regionserver60020.logRoller] wal.FSHLog (FSHLog.java:rollWriter(588))
- Rolled WAL /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579489
with entries=5, filesize=0; new WAL /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737
2014-11-29 21:12:59,751 ERROR [regionserver60020-WAL.AsyncWriter] wal.FSHLog (FSHLog.java:run(1140))
- Error while AsyncWriter write, request close of hlog 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579737
could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s)
running and no node(s) are excluded in this operation.
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1492)
[snip]
2014-11-29 21:12:59,767 FATAL [regionserver60020.logRoller] regionserver.HRegionServer (HRegionServer.java:abort(1865))
- ABORTING region server extras1.ci.local,60020,1417171049368: Failed log close in log roller
org.apache.hadoop.hbase.regionserver.wal.FailedLogCloseException: #1417295579737
[snip]
2014-11-29 21:12:59,768 FATAL [regionserver60020.logRoller] regionserver.HRegionServer (HRegionServer.java:abort(1873))
- RegionServer abort: loaded coprocessors are: [org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint]
2014-11-29 21:13:00,522 INFO  [regionserver60020.logRoller] regionserver.HRegionServer (HRegionServer.java:stop(1798))
- STOPPED: Failed log close in log roller
[snip]
2014-11-29 21:13:02,132 INFO  [regionserver60020] regionserver.Leases (Leases.java:close(147))
- regionserver60020 closing leases
2014-11-29 21:13:02,132 INFO  [regionserver60020] regionserver.Leases (Leases.java:close(150))
- regionserver60020 closed leases
2014-11-29 21:13:02,179 INFO  [regionserver60020] regionserver.ReplicationSource (ReplicationSource.java:terminate(860))
- Closing source Indexer_mhsaudit because: Region server is closing
2014-11-29 21:13:02,179 INFO  [regionserver60020] client.HConnectionManager$HConnectionImplementation
(HConnectionManager.java:closeZooKeeperWatcher(1837)) - Closing zookeeper sessionid=0x149f0f13a300434
2014-11-29 21:13:02,180 INFO  [regionserver60020] zookeeper.ZooKeeper (ZooKeeper.java:close(684))
- Session: 0x149f0f13a300434 closed
2014-11-29 21:13:02,272 INFO  [regionserver60020-EventThread] zookeeper.ClientCnxn (ClientCnxn.java:run(512))
- EventThread shut down
2014-11-29 21:13:02,356 INFO  [regionserver60020] zookeeper.ZooKeeper (ZooKeeper.java:close(684))
- Session: 0x149f0f13a300431 closed
2014-11-29 21:13:02,356 INFO  [regionserver60020-EventThread] zookeeper.ClientCnxn (ClientCnxn.java:run(512))
- EventThread shut down
2014-11-29 21:13:02,356 INFO  [regionserver60020] regionserver.HRegionServer (HRegionServer.java:run(1058))
- stopping server extras1.ci.local,60020,1417171049368; zookeeper connection closed.
2014-11-29 21:13:02,417 INFO  [regionserver60020] regionserver.HRegionServer (HRegionServer.java:run(1061))
- regionserver60020 exiting
2014-11-29 21:13:02,494 ERROR [main] regionserver.HRegionServerCommandLine (HRegionServerCommandLine.java:start(70))
- Region server exiting
java.lang.RuntimeException: HRegionServer Aborted
        at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.start(HRegionServerCommandLine.java:66)
        at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.run(HRegionServerCommandLine.java:85)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.main(HRegionServer.java:2467)
2014-11-29 21:13:02,715 INFO  [Thread-10] regionserver.ShutdownHook (ShutdownHook.java:run(111))
- Shutdown hook starting; hbase.shutdown.hook=true; fsShutdownHook=org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer@511a1546
2014-11-29 21:13:02,716 INFO  [Thread-10] regionserver.ShutdownHook (ShutdownHook.java:run(120))
- Starting fs shutdown hook thread.
2014-11-29 21:13:02,717 ERROR [Thread-125066] hdfs.DFSClient (DFSClient.java:closeAllFilesBeingWritten(911))
- Failed to close inode 32621
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/WALs/extras1.ci.local,60020,1417171049368/extras1.ci.local%2C60020%2C1417171049368.1417295579753
could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s)
running and no node(s) are excluded in this operation.


On Mon, Dec 1, 2014 at 4:10 AM, Robert Kent <Robert.Kent@inps.co.uk> wrote:
> > Looks like an HDFS issue. Are you sure your HDFS is working fine?
>
> HDFS appears to be working correctly - HBase will process requests
> properly and everything appears to work correctly for hours/days, until the
> regionserver randomly falls over.  If there were HDFS issues I would expect
> to see these during normal operation, but I don't.
>
Mime
View raw message