hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: NPE on ProtobufLogWriter.sync()
Date Wed, 17 Feb 2016 06:43:53 GMT
What @Enis said, your NN seems to be in standby mode. You have HA NN? You
failed over? You need to clear standby state? Run fsck?
St.Ack

On Tue, Feb 16, 2016 at 6:32 PM, Pedro Gandola <pedro.gandola@gmail.com>
wrote:

> Hi Enis,
>
> At the time that the problem happened, the active NN log does not have any
> error or exceptions however the standby node has the following:
>
> 2016-02-16 14:21:51,039 INFO ha.EditLogTailer
> > (EditLogTailer.java:doTailEdits(238)) - Loaded 80 edits starting from
> txid
> > 4916546
> > 2016-02-16 14:22:37,633 WARN namenode.FSNamesystem
> > (FSNamesystem.java:getCorruptFiles(7263)) - Get corrupt file blocks
> > returned error: Operation category READ is not supported in state standby
> > 2016-02-16 14:22:37,635 WARN namenode.FSNamesystem
> > (FSNamesystem.java:getCorruptFiles(7263)) - Get corrupt file blocks
> > returned error: Operation category READ is not supported in state standby
> > 2016-02-16 14:22:37,766 INFO ipc.Server (Server.java:run(2151)) - IPC
> > Server handler 196 on 8020, call
> > org.apache.hadoop.hdfs.protocol.ClientProtocol.renewLease from
> > 10.5.2.11:54684 Call#22619 Retry#1:
> > org.apache.hadoop.ipc.StandbyException: *Operation category WRITE is not
> > supported in state standby*
> > 2016-02-16 14:22:37,783 INFO ipc.Server (Server.java:run(2151)) - IPC
> > Server handler 161 on 8020, call
> > org.apache.hadoop.hdfs.protocol.ClientProtocol.getAdditionalDatanode from
> > 10.5.2.11:54684 Call#22620 Retry#0:
> > org.apache.hadoop.ipc.StandbyException: Operation category READ is not
> > supported in state standby
>
>
> But the standby node log is full of messages like this:
>
> 2016-02-17 02:18:44,688 WARN namenode.FSNamesystem
> > (FSNamesystem.java:getCorruptFiles(7263)) - Get corrupt file blocks
> > returned error: Operation category READ is not supported in state standby
>
>
> Is this normal? or shall I run a fsck? I'm trusting in Ambari that says
> that HDFS is totally OK :).
>
> Thanks
> Cheers
> Pedro
>
> On Tue, Feb 16, 2016 at 11:10 PM, Enis Söztutar <enis.soz@gmail.com>
> wrote:
>
> > > Operation category WRITE is not supported in state standby
> > This looks to be coming from a NN that is in Standby state (or safe
> mode?).
> > Did you check whether underlying HDFS is healthy. Is this HA
> configuration
> > and hbase-site.xml contains the correct NN configuration?
> >
> > Enis
> >
> >
> > On Tue, Feb 16, 2016 at 6:51 AM, Pedro Gandola <pedro.gandola@gmail.com>
> > wrote:
> >
> > > Just lost another RegionServer... The cluster was very stable until
> > > yesterday, 2 region servers in less than 24 hours something might be
> > wrong
> > > in my configuration.
> > >
> > > Any insights?
> > >
> > > Thank you
> > > Pedro
> > >
> > > *(rs log)*
> > >
> > > > 2016-02-16 14:22:37,749 WARN  [ResponseProcessor for block
> > > > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937]
> > > > hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for
> block
> > > > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937
> > > > java.io.IOException: Bad response ERROR for block
> > > > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937 from
> > datanode
> > > > DatanodeInfoWithStorage[10.5.1.117:50010
> > > > ,DS-f1076321-e61d-4554-98aa-0801d64040ae,DISK]
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:786)
> > > > 2016-02-16 14:22:37,751 INFO  [LeaseRenewer:hbase@nameservice-prod]
> > > > retry.RetryInvocationHandler: Exception while invoking renewLease of
> > > class
> > > > ClientNamenodeProtocolTranslatorPB over hbase-ms5.localdomain/
> > > > 10.5.2.248:8020 after 1 fail over attempts. Trying to fail over
> after
> > > > sleeping for 1261ms.
> > > >
> > >
> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
> > > > Operation category WRITE is not supported in state standby
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1872)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1306)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewLease(FSNamesystem.java:4457)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewLease(NameNodeRpcServer.java:992)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewLease(ClientNamenodeProtocolServerSideTranslatorPB.java:652)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> > > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> > > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
> > > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
> > > > at java.security.AccessController.doPrivileged(Native Method)
> > > > at javax.security.auth.Subject.doAs(Subject.java:422)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> > > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
> > > > at org.apache.hadoop.ipc.Client.call(Client.java:1427)
> > > > at org.apache.hadoop.ipc.Client.call(Client.java:1358)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> > > > at com.sun.proxy.$Proxy16.renewLease(Unknown Source)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.renewLease(ClientNamenodeProtocolTranslatorPB.java:590)
> > > > at sun.reflect.GeneratedMethodAccessor33.invoke(Unknown Source)
> > > > at
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > > at java.lang.reflect.Method.invoke(Method.java:497)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> > > > at com.sun.proxy.$Proxy17.renewLease(Unknown Source)
> > > > at sun.reflect.GeneratedMethodAccessor33.invoke(Unknown Source)
> > > > at
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > > at java.lang.reflect.Method.invoke(Method.java:497)
> > > > at
> > org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279)
> > > > at com.sun.proxy.$Proxy18.renewLease(Unknown Source)
> > > > at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:892)
> > > > at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:417)
> > > > at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:442)
> > > > at
> org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71)
> > > > at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:298)
> > > > at java.lang.Thread.run(Thread.java:745)
> > > > 2016-02-16 14:22:37,753 WARN  [DataStreamer for file
> > > >
> > >
> >
> /apps/hbase/data/WALs/hbase-rs6.localdomain,16020,1455560986600/hbase-rs6.localdomain%2C16020%2C1455560986600.default.1455632424508
> > > > block BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937]
> > > > hdfs.DFSClient: Error Recovery for block
> > > > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937 in
> pipeline
> > > > DatanodeInfoWithStorage[10.5.2.11:50010
> > > ,DS-92794e46-b126-4aa8-bfd9-a8867cac42e4,DISK],
> > > > DatanodeInfoWithStorage[10.5.1.117:50010
> > > ,DS-f1076321-e61d-4554-98aa-0801d64040ae,DISK],
> > > > DatanodeInfoWithStorage[10.5.2.57:50010
> > > ,DS-3ddc1541-c052-4cc2-b8a4-65d84fd90cfb,DISK]:
> > > > bad datanode DatanodeInfoWithStorage[10.5.1.117:50010
> > > > ,DS-f1076321-e61d-4554-98aa-0801d64040ae,DISK]
> > > > 2016-02-16 14:22:37,769 INFO  [DataStreamer for file
> > > >
> > >
> >
> /apps/hbase/data/WALs/hbase-rs6.localdomain,16020,1455560986600/hbase-rs6.localdomain%2C16020%2C1455560986600.default.1455632424508
> > > > block BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937]
> > > > retry.RetryInvocationHandler: Exception while invoking
> > > > getAdditionalDatanode of class ClientNamenodeProtocolTranslatorPB
> over
> > > > hbase-ms5.localdomain/10.5.2.248:8020. Trying to fail over
> > immediately.
> > > >
> > >
> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
> > > > Operation category READ is not supported in state standby
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1872)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1306)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3326)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:759)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolServerSideTranslatorPB.java:515)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> > > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> > > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
> > > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
> > > > at java.security.AccessController.doPrivileged(Native Method)
> > > > at javax.security.auth.Subject.doAs(Subject.java:422)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> > > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
> > > > at org.apache.hadoop.ipc.Client.call(Client.java:1427)
> > > > at org.apache.hadoop.ipc.Client.call(Client.java:1358)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> > > > at com.sun.proxy.$Proxy16.getAdditionalDatanode(Unknown Source)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolTranslatorPB.java:443)
> > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > > at
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> > > > at
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > > at java.lang.reflect.Method.invoke(Method.java:497)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> > > > at com.sun.proxy.$Proxy17.getAdditionalDatanode(Unknown Source)
> > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > > at
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> > > > at
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > > at java.lang.reflect.Method.invoke(Method.java:497)
> > > > at
> > org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279)
> > > > at com.sun.proxy.$Proxy18.getAdditionalDatanode(Unknown Source)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1010)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1165)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> > > > 2016-02-16 14:22:37,781 WARN  [DataStreamer for file
> > > >
> > >
> >
> /apps/hbase/data/WALs/hbase-rs6.localdomain,16020,1455560986600/hbase-rs6.localdomain%2C16020%2C1455560986600.default.1455632424508
> > > > block BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937]
> > > > hdfs.DFSClient: DataStreamer Exception
> > > >
> > >
> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> > > > No lease on
> > > >
> > >
> >
> /apps/hbase/data/oldWALs/hbase-rs6.localdomain%2C16020%2C1455560986600.default.1455632424508
> > > > (inode 641732): File is not open for writing. Holder
> > > > DFSClient_NONMAPREDUCE_-127840820_1 does not have any open files.
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3445)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3347)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:759)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolServerSideTranslatorPB.java:515)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> > > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> > > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
> > > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
> > > > at java.security.AccessController.doPrivileged(Native Method)
> > > > at javax.security.auth.Subject.doAs(Subject.java:422)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> > > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
> > > > at org.apache.hadoop.ipc.Client.call(Client.java:1427)
> > > > at org.apache.hadoop.ipc.Client.call(Client.java:1358)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> > > > at com.sun.proxy.$Proxy16.getAdditionalDatanode(Unknown Source)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolTranslatorPB.java:443)
> > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > > at
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> > > > at
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > > at java.lang.reflect.Method.invoke(Method.java:497)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> > > > at com.sun.proxy.$Proxy17.getAdditionalDatanode(Unknown Source)
> > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > > at
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> > > > at
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > > at java.lang.reflect.Method.invoke(Method.java:497)
> > > > at
> > org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279)
> > > > at com.sun.proxy.$Proxy18.getAdditionalDatanode(Unknown Source)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1010)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1165)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> > > > 2016-02-16 14:22:38,134 INFO
> > > >  [main-SendThread(zookeeper3.localdomain:2181)] zookeeper.ClientCnxn:
> > > > Opening socket connection to server zookeeper3.localdomain/
> > > 10.5.2.166:2181.
> > > > Will not attempt to authenticate using SASL (unknown error)
> > > > 2016-02-16 14:22:38,137 INFO
> > > >  [main-SendThread(zookeeper3.localdomain:2181)] zookeeper.ClientCnxn:
> > > > Socket connection established to zookeeper3.localdomain/
> > 10.5.2.166:2181,
> > > > initiating session
> > > > 2016-02-16 14:22:38,144 INFO
> > > >  [main-SendThread(zookeeper3.localdomain:2181)] zookeeper.ClientCnxn:
> > > > Unable to reconnect to ZooKeeper service, session 0x25278d348e90c8d
> has
> > > > expired, closing socket connection
> > > > 2016-02-16 14:22:38,144 FATAL [main-EventThread]
> > > > regionserver.HRegionServer: ABORTING region server
> > > > hbase-rs6.localdomain,16020,1455560986600:
> > > > regionserver:16020-0x25278d348e90c8d,
> > > >
> > >
> >
> quorum=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181,
> > > > baseZNode=/hbase regionserver:16020-0x25278d348e90c8d received
> expired
> > > from
> > > > ZooKeeper, aborting
> > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > KeeperErrorCode = Session expired
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:606)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:517)
> > > > at
> > > >
> > >
> >
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
> > > > at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> > >
> > >
> > > On Tue, Feb 16, 2016 at 9:32 AM, Pedro Gandola <
> pedro.gandola@gmail.com>
> > > wrote:
> > >
> > > > Just missed one of your questions.
> > > >
> > > > Were you trying to restart it and it wouldn't come up? It kept doing
> > the
> > > >> NPE on each restart?
> > > >> Or it happened once and killed the regionserver?
> > > >
> > > >
> > > > I received an alarm for the RS and I went there I saw the cluster not
> > > > processing data and I tried to restart the RS. Everything went to
> > normal
> > > > but the RS crashed again then I decided to remove it completely until
> > > know
> > > > more about the problem. Now looking into the restarting code I see
> that
> > > the
> > > > problem might be other.
> > > >
> > > > *(log.out) *
> > > >
> > > >> /usr/hdp/current/hbase-regionserver/bin/hbase-daemon.sh: line 214:
> > 20748
> > > >> Killed                  nice -n $HBASE_NICENESS
> > "$HBASE_HOME"/bin/hbase
> > > >> --config "${HBASE_CONF_DIR}" $command "$@" start >> ${HBASE_LOGOUT}
> > 2>&1
> > > >> cat: /mnt/var/run/hbase/hbase-hbase-regionserver.pid: No such file
> or
> > > >> directory
> > > >
> > > >
> > > > *(log after restart)*
> > > >
> > > >> 2016-02-15 22:21:23,692 INFO  [main] zookeeper.ZooKeeper: Client
> > > >>
> > >
> >
> environment:java.library.path=:/usr/hdp/2.3.2.0-2950/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.2.0-2950/hadoop/lib/native
> > > >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > >> environment:java.io.tmpdir=/tmp
> > > >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > >> environment:java.compiler=<NA>
> > > >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > >> environment:os.name=Linux
> > > >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > >> environment:os.arch=amd64
> > > >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > >> environment:os.version=3.10.0-327.3.1.el7.x86_64
> > > >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > >> environment:user.name=hbase
> > > >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > >> environment:user.home=/home/hbase
> > > >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > >> environment:user.dir=/home/hbase
> > > >> 2016-02-15 22:21:23,694 INFO  [main] zookeeper.ZooKeeper: Initiating
> > > >> client connection,
> > > >>
> > >
> >
> connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181
> > > >> sessionTimeout=30000
> > > >> watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671
> > > >> 2016-02-15 22:21:23,694 INFO  [main] zookeeper.ZooKeeper: Initiating
> > > >> client connection,
> > > >>
> > >
> >
> connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181
> > > >> sessionTimeout=30000
> > > >> watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671
> > > >> 2016-02-15 22:21:23,754 INFO
> > > >>  [main-SendThread(zookeeper2.localdomain:2181)]
> zookeeper.ClientCnxn:
> > > >> Opening socket connection to server zookeeper2.localdomain/
> > > >> 10.5.1.105:2181. Will not attempt to authenticate using SASL
> (unknown
> > > >> error)
> > > >> 2016-02-15 22:21:23,755 INFO
> > > >>  [main-SendThread(zookeeper2.localdomain:2181)]
> zookeeper.ClientCnxn:
> > > >> Opening socket connection to server zookeeper2.localdomain/
> > > >> 10.5.1.105:2181. Will not attempt to authenticate using SASL
> (unknown
> > > >> error)
> > > >> 2016-02-15 22:21:23,767 INFO
> > > >>  [main-SendThread(zookeeper2.localdomain:2181)]
> zookeeper.ClientCnxn:
> > > >> Socket connection established to zookeeper2.localdomain/
> > 10.5.1.105:2181
> > > ,
> > > >> initiating session
> > > >> 2016-02-15 22:21:23,767 INFO
> > > >>  [main-SendThread(zookeeper2.localdomain:2181)]
> zookeeper.ClientCnxn:
> > > >> Socket connection established to zookeeper2.localdomain/
> > 10.5.1.105:2181
> > > ,
> > > >> initiating session
> > > >> 2016-02-15 22:21:23,777 INFO
> > > >>  [main-SendThread(zookeeper2.localdomain:2181)]
> zookeeper.ClientCnxn:
> > > >> Session establishment complete on server zookeeper2.localdomain/
> > > >> 10.5.1.105:2181, sessionid = 0x25278d348e90d5e, negotiated timeout
> =
> > > >> 30000
> > > >> 2016-02-15 22:21:23,777 INFO
> > > >>  [main-SendThread(zookeeper2.localdomain:2181)]
> zookeeper.ClientCnxn:
> > > >> Session establishment complete on server zookeeper2.localdomain/
> > > >> 10.5.1.105:2181, sessionid = 0x25278d348e90d5f, negotiated timeout
> =
> > > >> 30000
> > > >>
> > > >
> > > > I don't see any exception during the restart but for some reason the
> > > > process got killed and I don't see any SIGTERM or OOM on linux
> journal.
> > > >
> > > > Cheers
> > > > Pedro
> > > >
> > > >
> > > > On Tue, Feb 16, 2016 at 9:06 AM, Pedro Gandola <
> > pedro.gandola@gmail.com>
> > > > wrote:
> > > >
> > > >> Hi St.Ack,
> > > >>
> > > >> Thank you for your help. I have attached ~5 min worth of data before
> > the
> > > >> crash.
> > > >>
> > > >> I restarted the cluster to update some configurations after that I
> > moved
> > > >> some regions around to balance the cluster and just to ensure the
> data
> > > >> locality I ran a major compaction. After that I connected my App
> and 2
> > > >> hours later  this happened. So by the time of the crash I was not
> > > >> performing any operation over the cluster it was running normally.
> > > >>
> > > >> Thank you,
> > > >> Cheers
> > > >> Pedro
> > > >>
> > > >> On Tue, Feb 16, 2016 at 2:01 AM, Stack <stack@duboce.net> wrote:
> > > >>
> > > >>> On Mon, Feb 15, 2016 at 3:07 PM, Pedro Gandola <
> > > pedro.gandola@gmail.com>
> > > >>> wrote:
> > > >>>
> > > >>> > Hi Guys,
> > > >>> >
> > > >>> > One of my region servers got into a state where it was unable
to
> > > start
> > > >>> and
> > > >>> > the cluster was not receiving traffic for some time:
> > > >>>
> > > >>>
> > > >>>
> > > >>> Were you trying to restart it and it wouldn't come up? It kept
> doing
> > > the
> > > >>> NPE on each restart?
> > > >>>
> > > >>> Or it happened once and killed the regionserver?
> > > >>>
> > > >>>
> > > >>>
> > > >>> *(master log)*
> > > >>> >
> > > >>> > > 2016-02-15 22:04:33,186 ERROR
> > > >>> > > [PriorityRpcServer.handler=4,queue=0,port=16000]
> > > >>> > master.MasterRpcServices:
> > > >>> > > Region server hbase-rs9.localdomain,16020,1455560991134
> reported
> > a
> > > >>> fatal
> > > >>> > > error:
> > > >>> > > ABORTING region server
> hbase-rs9.localdomain,16020,1455560991134:
> > > >>> IOE in
> > > >>> > > log roller
> > > >>> > > Cause:
> > > >>> > > java.io.IOException: java.lang.NullPointerException
> > > >>> > > at
> > > >>> > >
> > > >>> >
> > > >>>
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> > > >>> > > at
> > > >>> > >
> > > >>> >
> > > >>>
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> > > >>> > > at java.lang.Thread.run(Thread.java:745)
> > > >>> > > Caused by: java.lang.NullPointerException
> > > >>> > > at
> > > >>> > >
> > > >>> >
> > > >>>
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> > > >>> > > ... 2 more 2016-02-15 22:05:45,678 WARN
> > > >>> > > [hbase-ms4.localdomain,16000,1455560972387_ChoreService_1]
> > > >>> > > master.CatalogJanitor: Failed scan of catalog table
> > > >>> > > org.apache.hadoop.hbase.client.RetriesExhaustedException:
> Failed
> > > >>> after
> > > >>> > > attempts=351, exceptions:
> > > >>> > > Mon Feb 15 22:05:45 UTC 2016, null,
> > > java.net.SocketTimeoutException:
> > > >>> > > callTimeout=60000, callDuration=68222: row '' on table
> > 'hbase:meta'
> > > >>> at
> > > >>> > > region=hbase:meta,,1.1588230740,
> > > >>> > > hostname=hbase-rs9.localdomain,16020,1455560991134,
seqNum=0 at
> > > >>> > >
> > > >>> >
> > > >>>
> > >
> >
> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:271)
> > > >>> > > at
> > > >>> >
> > > >>>
> > > >>> You running with read replicas enabled?
> > > >>>
> > > >>>
> > > >>> ....
> > > >>>
> > > >>>
> > > >>>
> > > >>> *(regionserver log)*
> > > >>>
> > > >>>
> > > >>> Anything before this log message about rolling the WAL?
> > > >>>
> > > >>>
> > > >>> > 2016-02-15 22:04:33,034 ERROR [sync.2] wal.FSHLog: Error
syncing,
> > > >>> request
> > > >>> > close of wal
> > > >>> > java.io.IOException: java.lang.NullPointerException
> > > >>> > at
> > > >>> >
> > > >>>
> > > >>>
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> > > >>> > at
> > > >>> >
> > > >>>
> > > >>>
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> > > >>> > at java.lang.Thread.run(Thread.java:745)
> > > >>> > Caused by: java.lang.NullPointerException
> > > >>> > at
> > > >>> >
> > > >>>
> > > >>>
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> > > >>> > ... 2 more
> > > >>> > 2016-02-15 22:04:33,036 INFO  [sync.2] wal.FSHLog: Slow sync
> cost:
> > > 234
> > > >>> ms,
> > > >>> > current pipeline: [DatanodeInfoWithStorage[10.5.2.169:50010
> > > >>> ,DS-7b9cfb3b-6c79-4e1b-ac90-19881e568518,DISK],
> > > >>> > DatanodeInfoWithStorage[10.5.2.95:50010
> > > >>> ,DS-40db3807-a850-4aeb-8529-d3ea3920e556,DISK],
> > > >>> > DatanodeInfoWithStorage[10.5.2.57:50010
> > > >>> > ,DS-3ddc1541-c052-4cc2-b8a4-65d84fd90cfb,DISK]]
> > > >>> > 2016-02-15 22:04:33,043 FATAL
> > > >>> > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> > > >>> > regionserver.HRegionServer: ABORTING region server
> > > >>> > hbase-rs9.localdomain,16020,1455560991134: IOE in log roller
> > > >>> > java.io.IOException: java.lang.NullPointerException
> > > >>> > at
> > > >>> >
> > > >>>
> > > >>>
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> > > >>> > at
> > > >>> >
> > > >>>
> > > >>>
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> > > >>> > at java.lang.Thread.run(Thread.java:745)
> > > >>> > Caused by: java.lang.NullPointerException
> > > >>> > at
> > > >>> >
> > > >>>
> > > >>>
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> > > >>> > ... 2 more
> > > >>> > 2016-02-15 22:04:33,043 FATAL
> > > >>> > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> > > >>> > regionserver.HRegionServer: RegionServer abort: loaded
> coprocessors
> > > >>> are:
> > > >>> > [org.apache.hadoop.hbase.regionserver.LocalIndexMerger,
> > > >>> >
> > > org.apache.hadoop.hbase.regionserver.IndexHalfStoreFileReaderGenerator,
> > > >>> > org.apache.phoenix.coprocessor.SequenceRegionObserver,
> > > >>> > org.apache.phoenix.coprocessor.ScanRegionObserver,
> > > >>> > org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver,
> > > >>> > org.apache.phoenix.hbase.index.Indexer,
> > > >>> > org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver,
> > > >>> > org.apache.hadoop.hbase.regionserver.LocalIndexSplitter,
> > > >>> > org.apache.phoenix.coprocessor.ServerCachingEndpointImpl,
> > > >>> > org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint,
> > > >>> > org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint]
> > > >>> > 2016-02-15 22:04:33,136 INFO
> > > >>> >  [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> > > >>> > regionserver.HRegionServer: Dump of metrics as JSON on abort:
{
> > > >>> >   "beans" : [ {
> > > >>> >     "name" : "java.lang:type=Memory",
> > > >>> >     "modelerType" : "sun.management.MemoryImpl",
> > > >>> >     "Verbose" : true,
> > > >>> >     "HeapMemoryUsage" : {
> > > >>> >       "committed" : 15032385536,
> > > >>> >       "init" : 15032385536,
> > > >>> >       "max" : 15032385536,
> > > >>> >       "used" : 3732843792
> > > >>> >     },
> > > >>> >     "ObjectPendingFinalizationCount" : 0,
> > > >>> >     "NonHeapMemoryUsage" : {
> > > >>> >       "committed" : 104660992,
> > > >>> >       "init" : 2555904,
> > > >>> >       "max" : -1,
> > > >>> >       "used" : 103018984
> > > >>> >     },
> > > >>> >     "ObjectName" : "java.lang:type=Memory"
> > > >>> >   } ],
> > > >>> >   "beans" : [ {
> > > >>> >     "name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC",
> > > >>> >     "modelerType" : "RegionServer,sub=IPC",
> > > >>> >     "tag.Context" : "regionserver",
> > > >>> >     "tag.Hostname" : "hbase-rs9"
> > > >>> >   } ],
> > > >>> >   "beans" : [ {
> > > >>> >     "name" :
> > > "Hadoop:service=HBase,name=RegionServer,sub=Replication",
> > > >>> >     "modelerType" : "RegionServer,sub=Replication",
> > > >>> >     "tag.Context" : "regionserver",
> > > >>> >     "tag.Hostname" : "hbase-rs9"
> > > >>> >   } ],
> > > >>> >   "beans" : [ {
> > > >>> >     "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server",
> > > >>> >     "modelerType" : "RegionServer,sub=Server",
> > > >>> >     "tag.Context" : "regionserver",
> > > >>> >     "tag.Hostname" : "hbase-rs9"
> > > >>> >   } ]
> > > >>> > }
> > > >>> > 2016-02-15 22:04:33,194 INFO
> > > >>> >  [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> > > >>> > regionserver.HRegionServer: STOPPED: IOE in log roller
> > > >>>
> > > >>> Looks like we were in log roller, closing the WAL to put up a
new
> one
> > > but
> > > >>> the a sync was trying to run at same time but stream undo under
> > it....
> > > >>> NPE... which caused server abort.
> > > >>>
> > > >>> ...
> > > >>>
> > > >>>
> > > >>> Any idea what happened? Today, I moved some regions around to
> balance
> > > the
> > > >>> cluster and I ran a major compaction after that and I added more
> > > threads
> > > >>> to
> > > >>> run small and large compactions, could this be related?
> > > >>>
> > > >>>
> > > >>>
> > > >>> Yeah, pass more of the regionserver log from before the NPE...
to
> see
> > > >>> what
> > > >>> was going on at the time.  I don't think the above changes related.
> > > >>>
> > > >>>
> > > >>>
> > > >>> I see that in the current branch the class *ProtobufLogWriter:176*
> > > >>> already
> > > >>> contains try..catch
> > > >>>
> > > >>> try {
> > > >>> >   return this.output.getPos();
> > > >>> > } catch (NullPointerException npe) {
> > > >>> >   // Concurrent close...
> > > >>> >   throw new IOException(npe);
> > > >>> > }
> > > >>>
> > > >>>
> > > >>> Yeah. Looks like the thought is an NPE could happen here because
> the
> > > >>> stream
> > > >>> is being closed out from under us by another thread (the syncers
> are
> > > off
> > > >>> on
> > > >>> their own just worried about syncing... This is just to get the
NPE
> > up
> > > >>> out
> > > >>> of the syncer thread and up into the general WAL subsystem.).
 It
> > looks
> > > >>> like this caused the regionserver abort. I would think it should
> just
> > > >>> provoke the WAL roll, not an abort. Put up more RS log please.
> > > >>>
> > > >>>
> > > >>>
> > > >>> > But I would be nice to understand the root cause of this
error
> and
> > if
> > > >>> there
> > > >>> > is any misconfiguration from my side.
> > > >>> >
> > > >>> >
> > > >>> Not your config I'd say (if your config brought it on, its a bug).
> > > >>>
> > > >>>
> > > >>>
> > > >>> > *Version: *HBase 1.1.2
> > > >>> >
> > > >>> >
> > > >>>
> > > >>> St.Ack
> > > >>>
> > > >>> P.S. It looks like a gentleman had similar issue end of last year:
> > > >>>
> > > >>>
> > >
> >
> http://mail-archives.apache.org/mod_mbox/hbase-user/201510.mbox/%3CCA+nge+rofDTXboLmC0s4BcgWVtyecyJaPZwXp5ZJu-gM4DAVEA@mail.gmail.com%3E
> > > >>> No response then.
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> > Thank you
> > > >>> > Cheers
> > > >>> > Pedro
> > > >>> >
> > > >>>
> > > >>
> > > >>
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message