hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pedro Gandola <pedro.gand...@gmail.com>
Subject Re: NPE on ProtobufLogWriter.sync()
Date Tue, 16 Feb 2016 14:51:41 GMT
Just lost another RegionServer... The cluster was very stable until
yesterday, 2 region servers in less than 24 hours something might be wrong
in my configuration.

Any insights?

Thank you
Pedro

*(rs log)*

> 2016-02-16 14:22:37,749 WARN  [ResponseProcessor for block
> BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937]
> hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block
> BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937
> java.io.IOException: Bad response ERROR for block
> BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937 from datanode
> DatanodeInfoWithStorage[10.5.1.117:50010
> ,DS-f1076321-e61d-4554-98aa-0801d64040ae,DISK]
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:786)
> 2016-02-16 14:22:37,751 INFO  [LeaseRenewer:hbase@nameservice-prod]
> retry.RetryInvocationHandler: Exception while invoking renewLease of class
> ClientNamenodeProtocolTranslatorPB over hbase-ms5.localdomain/
> 10.5.2.248:8020 after 1 fail over attempts. Trying to fail over after
> sleeping for 1261ms.
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
> Operation category WRITE is not supported in state standby
> at
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1872)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1306)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewLease(FSNamesystem.java:4457)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewLease(NameNodeRpcServer.java:992)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewLease(ClientNamenodeProtocolServerSideTranslatorPB.java:652)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
> at org.apache.hadoop.ipc.Client.call(Client.java:1427)
> at org.apache.hadoop.ipc.Client.call(Client.java:1358)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy16.renewLease(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.renewLease(ClientNamenodeProtocolTranslatorPB.java:590)
> at sun.reflect.GeneratedMethodAccessor33.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy17.renewLease(Unknown Source)
> at sun.reflect.GeneratedMethodAccessor33.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279)
> at com.sun.proxy.$Proxy18.renewLease(Unknown Source)
> at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:892)
> at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:417)
> at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:442)
> at org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71)
> at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:298)
> at java.lang.Thread.run(Thread.java:745)
> 2016-02-16 14:22:37,753 WARN  [DataStreamer for file
> /apps/hbase/data/WALs/hbase-rs6.localdomain,16020,1455560986600/hbase-rs6.localdomain%2C16020%2C1455560986600.default.1455632424508
> block BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937]
> hdfs.DFSClient: Error Recovery for block
> BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937 in pipeline
> DatanodeInfoWithStorage[10.5.2.11:50010,DS-92794e46-b126-4aa8-bfd9-a8867cac42e4,DISK],
> DatanodeInfoWithStorage[10.5.1.117:50010,DS-f1076321-e61d-4554-98aa-0801d64040ae,DISK],
> DatanodeInfoWithStorage[10.5.2.57:50010,DS-3ddc1541-c052-4cc2-b8a4-65d84fd90cfb,DISK]:
> bad datanode DatanodeInfoWithStorage[10.5.1.117:50010
> ,DS-f1076321-e61d-4554-98aa-0801d64040ae,DISK]
> 2016-02-16 14:22:37,769 INFO  [DataStreamer for file
> /apps/hbase/data/WALs/hbase-rs6.localdomain,16020,1455560986600/hbase-rs6.localdomain%2C16020%2C1455560986600.default.1455632424508
> block BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937]
> retry.RetryInvocationHandler: Exception while invoking
> getAdditionalDatanode of class ClientNamenodeProtocolTranslatorPB over
> hbase-ms5.localdomain/10.5.2.248:8020. Trying to fail over immediately.
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
> Operation category READ is not supported in state standby
> at
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1872)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1306)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3326)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:759)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolServerSideTranslatorPB.java:515)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
> at org.apache.hadoop.ipc.Client.call(Client.java:1427)
> at org.apache.hadoop.ipc.Client.call(Client.java:1358)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy16.getAdditionalDatanode(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolTranslatorPB.java:443)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy17.getAdditionalDatanode(Unknown Source)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279)
> at com.sun.proxy.$Proxy18.getAdditionalDatanode(Unknown Source)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1010)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1165)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> 2016-02-16 14:22:37,781 WARN  [DataStreamer for file
> /apps/hbase/data/WALs/hbase-rs6.localdomain,16020,1455560986600/hbase-rs6.localdomain%2C16020%2C1455560986600.default.1455632424508
> block BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937]
> hdfs.DFSClient: DataStreamer Exception
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> No lease on
> /apps/hbase/data/oldWALs/hbase-rs6.localdomain%2C16020%2C1455560986600.default.1455632424508
> (inode 641732): File is not open for writing. Holder
> DFSClient_NONMAPREDUCE_-127840820_1 does not have any open files.
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3445)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3347)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:759)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolServerSideTranslatorPB.java:515)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
> at org.apache.hadoop.ipc.Client.call(Client.java:1427)
> at org.apache.hadoop.ipc.Client.call(Client.java:1358)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy16.getAdditionalDatanode(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolTranslatorPB.java:443)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy17.getAdditionalDatanode(Unknown Source)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279)
> at com.sun.proxy.$Proxy18.getAdditionalDatanode(Unknown Source)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1010)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1165)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> 2016-02-16 14:22:38,134 INFO
>  [main-SendThread(zookeeper3.localdomain:2181)] zookeeper.ClientCnxn:
> Opening socket connection to server zookeeper3.localdomain/10.5.2.166:2181.
> Will not attempt to authenticate using SASL (unknown error)
> 2016-02-16 14:22:38,137 INFO
>  [main-SendThread(zookeeper3.localdomain:2181)] zookeeper.ClientCnxn:
> Socket connection established to zookeeper3.localdomain/10.5.2.166:2181,
> initiating session
> 2016-02-16 14:22:38,144 INFO
>  [main-SendThread(zookeeper3.localdomain:2181)] zookeeper.ClientCnxn:
> Unable to reconnect to ZooKeeper service, session 0x25278d348e90c8d has
> expired, closing socket connection
> 2016-02-16 14:22:38,144 FATAL [main-EventThread]
> regionserver.HRegionServer: ABORTING region server
> hbase-rs6.localdomain,16020,1455560986600:
> regionserver:16020-0x25278d348e90c8d,
> quorum=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181,
> baseZNode=/hbase regionserver:16020-0x25278d348e90c8d received expired from
> ZooKeeper, aborting
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired
> at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:606)
> at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:517)
> at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)


On Tue, Feb 16, 2016 at 9:32 AM, Pedro Gandola <pedro.gandola@gmail.com>
wrote:

> Just missed one of your questions.
>
> Were you trying to restart it and it wouldn't come up? It kept doing the
>> NPE on each restart?
>> Or it happened once and killed the regionserver?
>
>
> I received an alarm for the RS and I went there I saw the cluster not
> processing data and I tried to restart the RS. Everything went to normal
> but the RS crashed again then I decided to remove it completely until know
> more about the problem. Now looking into the restarting code I see that the
> problem might be other.
>
> *(log.out) *
>
>> /usr/hdp/current/hbase-regionserver/bin/hbase-daemon.sh: line 214: 20748
>> Killed                  nice -n $HBASE_NICENESS "$HBASE_HOME"/bin/hbase
>> --config "${HBASE_CONF_DIR}" $command "$@" start >> ${HBASE_LOGOUT} 2>&1
>> cat: /mnt/var/run/hbase/hbase-hbase-regionserver.pid: No such file or
>> directory
>
>
> *(log after restart)*
>
>> 2016-02-15 22:21:23,692 INFO  [main] zookeeper.ZooKeeper: Client
>> environment:java.library.path=:/usr/hdp/2.3.2.0-2950/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.2.0-2950/hadoop/lib/native
>> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
>> environment:java.io.tmpdir=/tmp
>> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
>> environment:java.compiler=<NA>
>> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
>> environment:os.name=Linux
>> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
>> environment:os.arch=amd64
>> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
>> environment:os.version=3.10.0-327.3.1.el7.x86_64
>> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
>> environment:user.name=hbase
>> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
>> environment:user.home=/home/hbase
>> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
>> environment:user.dir=/home/hbase
>> 2016-02-15 22:21:23,694 INFO  [main] zookeeper.ZooKeeper: Initiating
>> client connection,
>> connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181
>> sessionTimeout=30000
>> watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671
>> 2016-02-15 22:21:23,694 INFO  [main] zookeeper.ZooKeeper: Initiating
>> client connection,
>> connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181
>> sessionTimeout=30000
>> watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671
>> 2016-02-15 22:21:23,754 INFO
>>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
>> Opening socket connection to server zookeeper2.localdomain/
>> 10.5.1.105:2181. Will not attempt to authenticate using SASL (unknown
>> error)
>> 2016-02-15 22:21:23,755 INFO
>>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
>> Opening socket connection to server zookeeper2.localdomain/
>> 10.5.1.105:2181. Will not attempt to authenticate using SASL (unknown
>> error)
>> 2016-02-15 22:21:23,767 INFO
>>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
>> Socket connection established to zookeeper2.localdomain/10.5.1.105:2181,
>> initiating session
>> 2016-02-15 22:21:23,767 INFO
>>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
>> Socket connection established to zookeeper2.localdomain/10.5.1.105:2181,
>> initiating session
>> 2016-02-15 22:21:23,777 INFO
>>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
>> Session establishment complete on server zookeeper2.localdomain/
>> 10.5.1.105:2181, sessionid = 0x25278d348e90d5e, negotiated timeout =
>> 30000
>> 2016-02-15 22:21:23,777 INFO
>>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
>> Session establishment complete on server zookeeper2.localdomain/
>> 10.5.1.105:2181, sessionid = 0x25278d348e90d5f, negotiated timeout =
>> 30000
>>
>
> I don't see any exception during the restart but for some reason the
> process got killed and I don't see any SIGTERM or OOM on linux journal.
>
> Cheers
> Pedro
>
>
> On Tue, Feb 16, 2016 at 9:06 AM, Pedro Gandola <pedro.gandola@gmail.com>
> wrote:
>
>> Hi St.Ack,
>>
>> Thank you for your help. I have attached ~5 min worth of data before the
>> crash.
>>
>> I restarted the cluster to update some configurations after that I moved
>> some regions around to balance the cluster and just to ensure the data
>> locality I ran a major compaction. After that I connected my App and 2
>> hours later  this happened. So by the time of the crash I was not
>> performing any operation over the cluster it was running normally.
>>
>> Thank you,
>> Cheers
>> Pedro
>>
>> On Tue, Feb 16, 2016 at 2:01 AM, Stack <stack@duboce.net> wrote:
>>
>>> On Mon, Feb 15, 2016 at 3:07 PM, Pedro Gandola <pedro.gandola@gmail.com>
>>> wrote:
>>>
>>> > Hi Guys,
>>> >
>>> > One of my region servers got into a state where it was unable to start
>>> and
>>> > the cluster was not receiving traffic for some time:
>>>
>>>
>>>
>>> Were you trying to restart it and it wouldn't come up? It kept doing the
>>> NPE on each restart?
>>>
>>> Or it happened once and killed the regionserver?
>>>
>>>
>>>
>>> *(master log)*
>>> >
>>> > > 2016-02-15 22:04:33,186 ERROR
>>> > > [PriorityRpcServer.handler=4,queue=0,port=16000]
>>> > master.MasterRpcServices:
>>> > > Region server hbase-rs9.localdomain,16020,1455560991134 reported a
>>> fatal
>>> > > error:
>>> > > ABORTING region server hbase-rs9.localdomain,16020,1455560991134:
>>> IOE in
>>> > > log roller
>>> > > Cause:
>>> > > java.io.IOException: java.lang.NullPointerException
>>> > > at
>>> > >
>>> >
>>> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
>>> > > at
>>> > >
>>> >
>>> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
>>> > > at java.lang.Thread.run(Thread.java:745)
>>> > > Caused by: java.lang.NullPointerException
>>> > > at
>>> > >
>>> >
>>> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
>>> > > ... 2 more 2016-02-15 22:05:45,678 WARN
>>> > > [hbase-ms4.localdomain,16000,1455560972387_ChoreService_1]
>>> > > master.CatalogJanitor: Failed scan of catalog table
>>> > > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
>>> after
>>> > > attempts=351, exceptions:
>>> > > Mon Feb 15 22:05:45 UTC 2016, null, java.net.SocketTimeoutException:
>>> > > callTimeout=60000, callDuration=68222: row '' on table 'hbase:meta'
>>> at
>>> > > region=hbase:meta,,1.1588230740,
>>> > > hostname=hbase-rs9.localdomain,16020,1455560991134, seqNum=0 at
>>> > >
>>> >
>>> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:271)
>>> > > at
>>> >
>>>
>>> You running with read replicas enabled?
>>>
>>>
>>> ....
>>>
>>>
>>>
>>> *(regionserver log)*
>>>
>>>
>>> Anything before this log message about rolling the WAL?
>>>
>>>
>>> > 2016-02-15 22:04:33,034 ERROR [sync.2] wal.FSHLog: Error syncing,
>>> request
>>> > close of wal
>>> > java.io.IOException: java.lang.NullPointerException
>>> > at
>>> >
>>>
>>> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
>>> > at
>>> >
>>>
>>> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
>>> > at java.lang.Thread.run(Thread.java:745)
>>> > Caused by: java.lang.NullPointerException
>>> > at
>>> >
>>>
>>> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
>>> > ... 2 more
>>> > 2016-02-15 22:04:33,036 INFO  [sync.2] wal.FSHLog: Slow sync cost: 234
>>> ms,
>>> > current pipeline: [DatanodeInfoWithStorage[10.5.2.169:50010
>>> ,DS-7b9cfb3b-6c79-4e1b-ac90-19881e568518,DISK],
>>> > DatanodeInfoWithStorage[10.5.2.95:50010
>>> ,DS-40db3807-a850-4aeb-8529-d3ea3920e556,DISK],
>>> > DatanodeInfoWithStorage[10.5.2.57:50010
>>> > ,DS-3ddc1541-c052-4cc2-b8a4-65d84fd90cfb,DISK]]
>>> > 2016-02-15 22:04:33,043 FATAL
>>> > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
>>> > regionserver.HRegionServer: ABORTING region server
>>> > hbase-rs9.localdomain,16020,1455560991134: IOE in log roller
>>> > java.io.IOException: java.lang.NullPointerException
>>> > at
>>> >
>>>
>>> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
>>> > at
>>> >
>>>
>>> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
>>> > at java.lang.Thread.run(Thread.java:745)
>>> > Caused by: java.lang.NullPointerException
>>> > at
>>> >
>>>
>>> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
>>> > ... 2 more
>>> > 2016-02-15 22:04:33,043 FATAL
>>> > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
>>> > regionserver.HRegionServer: RegionServer abort: loaded coprocessors
>>> are:
>>> > [org.apache.hadoop.hbase.regionserver.LocalIndexMerger,
>>> > org.apache.hadoop.hbase.regionserver.IndexHalfStoreFileReaderGenerator,
>>> > org.apache.phoenix.coprocessor.SequenceRegionObserver,
>>> > org.apache.phoenix.coprocessor.ScanRegionObserver,
>>> > org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver,
>>> > org.apache.phoenix.hbase.index.Indexer,
>>> > org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver,
>>> > org.apache.hadoop.hbase.regionserver.LocalIndexSplitter,
>>> > org.apache.phoenix.coprocessor.ServerCachingEndpointImpl,
>>> > org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint,
>>> > org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint]
>>> > 2016-02-15 22:04:33,136 INFO
>>> >  [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
>>> > regionserver.HRegionServer: Dump of metrics as JSON on abort: {
>>> >   "beans" : [ {
>>> >     "name" : "java.lang:type=Memory",
>>> >     "modelerType" : "sun.management.MemoryImpl",
>>> >     "Verbose" : true,
>>> >     "HeapMemoryUsage" : {
>>> >       "committed" : 15032385536,
>>> >       "init" : 15032385536,
>>> >       "max" : 15032385536,
>>> >       "used" : 3732843792
>>> >     },
>>> >     "ObjectPendingFinalizationCount" : 0,
>>> >     "NonHeapMemoryUsage" : {
>>> >       "committed" : 104660992,
>>> >       "init" : 2555904,
>>> >       "max" : -1,
>>> >       "used" : 103018984
>>> >     },
>>> >     "ObjectName" : "java.lang:type=Memory"
>>> >   } ],
>>> >   "beans" : [ {
>>> >     "name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC",
>>> >     "modelerType" : "RegionServer,sub=IPC",
>>> >     "tag.Context" : "regionserver",
>>> >     "tag.Hostname" : "hbase-rs9"
>>> >   } ],
>>> >   "beans" : [ {
>>> >     "name" : "Hadoop:service=HBase,name=RegionServer,sub=Replication",
>>> >     "modelerType" : "RegionServer,sub=Replication",
>>> >     "tag.Context" : "regionserver",
>>> >     "tag.Hostname" : "hbase-rs9"
>>> >   } ],
>>> >   "beans" : [ {
>>> >     "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server",
>>> >     "modelerType" : "RegionServer,sub=Server",
>>> >     "tag.Context" : "regionserver",
>>> >     "tag.Hostname" : "hbase-rs9"
>>> >   } ]
>>> > }
>>> > 2016-02-15 22:04:33,194 INFO
>>> >  [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
>>> > regionserver.HRegionServer: STOPPED: IOE in log roller
>>>
>>> Looks like we were in log roller, closing the WAL to put up a new one but
>>> the a sync was trying to run at same time but stream undo under it....
>>> NPE... which caused server abort.
>>>
>>> ...
>>>
>>>
>>> Any idea what happened? Today, I moved some regions around to balance the
>>> cluster and I ran a major compaction after that and I added more threads
>>> to
>>> run small and large compactions, could this be related?
>>>
>>>
>>>
>>> Yeah, pass more of the regionserver log from before the NPE... to see
>>> what
>>> was going on at the time.  I don't think the above changes related.
>>>
>>>
>>>
>>> I see that in the current branch the class *ProtobufLogWriter:176*
>>> already
>>> contains try..catch
>>>
>>> try {
>>> >   return this.output.getPos();
>>> > } catch (NullPointerException npe) {
>>> >   // Concurrent close...
>>> >   throw new IOException(npe);
>>> > }
>>>
>>>
>>> Yeah. Looks like the thought is an NPE could happen here because the
>>> stream
>>> is being closed out from under us by another thread (the syncers are off
>>> on
>>> their own just worried about syncing... This is just to get the NPE up
>>> out
>>> of the syncer thread and up into the general WAL subsystem.).  It looks
>>> like this caused the regionserver abort. I would think it should just
>>> provoke the WAL roll, not an abort. Put up more RS log please.
>>>
>>>
>>>
>>> > But I would be nice to understand the root cause of this error and if
>>> there
>>> > is any misconfiguration from my side.
>>> >
>>> >
>>> Not your config I'd say (if your config brought it on, its a bug).
>>>
>>>
>>>
>>> > *Version: *HBase 1.1.2
>>> >
>>> >
>>>
>>> St.Ack
>>>
>>> P.S. It looks like a gentleman had similar issue end of last year:
>>>
>>> http://mail-archives.apache.org/mod_mbox/hbase-user/201510.mbox/%3CCA+nge+rofDTXboLmC0s4BcgWVtyecyJaPZwXp5ZJu-gM4DAVEA@mail.gmail.com%3E
>>> No response then.
>>>
>>>
>>>
>>>
>>> > Thank you
>>> > Cheers
>>> > Pedro
>>> >
>>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message