hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pedro Gandola <pedro.gand...@gmail.com>
Subject NPE on ProtobufLogWriter.sync()
Date Mon, 15 Feb 2016 23:07:39 GMT
Hi Guys,

One of my region servers got into a state where it was unable to start and
the cluster was not receiving traffic for some time:

*(master log)*

> 2016-02-15 22:04:33,186 ERROR
> [PriorityRpcServer.handler=4,queue=0,port=16000] master.MasterRpcServices:
> Region server hbase-rs9.localdomain,16020,1455560991134 reported a fatal
> error:
> ABORTING region server hbase-rs9.localdomain,16020,1455560991134: IOE in
> log roller
> Cause:
> java.io.IOException: java.lang.NullPointerException
> at
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> at
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
> at
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> ... 2 more 2016-02-15 22:05:45,678 WARN
> [hbase-ms4.localdomain,16000,1455560972387_ChoreService_1]
> master.CatalogJanitor: Failed scan of catalog table
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
> attempts=351, exceptions:
> Mon Feb 15 22:05:45 UTC 2016, null, java.net.SocketTimeoutException:
> callTimeout=60000, callDuration=68222: row '' on table 'hbase:meta' at
> region=hbase:meta,,1.1588230740,
> hostname=hbase-rs9.localdomain,16020,1455560991134, seqNum=0 at
> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:271)
> at
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:195)
> at
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:59)
> at
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
> at
> org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:320)
> at
> org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:295)
> at
> org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:160)
> at
> org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.java:155)
> at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:821)
> at
> org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:193)
> at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:89)
> at
> org.apache.hadoop.hbase.master.CatalogJanitor.getMergedRegionsAndSplitParents(CatalogJanitor.java:168)
> at
> org.apache.hadoop.hbase.master.CatalogJanitor.getMergedRegionsAndSplitParents(CatalogJanitor.java:120)
> at
> org.apache.hadoop.hbase.master.CatalogJanitor.scan(CatalogJanitor.java:221)
> at
> org.apache.hadoop.hbase.master.CatalogJanitor.chore(CatalogJanitor.java:102)
> at org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:185)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.net.SocketTimeoutException: callTimeout=60000,
> callDuration=68222: row '' on table 'hbase:meta' at
> region=hbase:meta,,1.1588230740,
> hostname=hbase-rs9.localdomain,16020,1455560991134, seqNum=0
> at
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:159)
> at
> org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:64)
> ... 3 more
> Caused by:
> org.apache.hadoop.hbase.regionserver.RegionServerAbortedException:
> org.apache.hadoop.hbase.regionserver.RegionServerAbortedException: Server
> hbase-rs9.localdomain,16020,1455560991134 aborting
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:980)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2185)
> at
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32205)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101)
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
> at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
> at java.lang.Thread.run(Thread.java:745) at
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
> at
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
> at
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:325)
> at
> org.apache.hadoop.hbase.client.ScannerCallable.openScanner(ScannerCallable.java:380)
> at
> org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:199)
> at
> org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:62)
> at
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
> at
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:346)
> at
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:320)
> at
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:126)
> ... 4 more


*(regionserver log)*

> 2016-02-15 22:04:33,034 ERROR [sync.2] wal.FSHLog: Error syncing, request
> close of wal
> java.io.IOException: java.lang.NullPointerException
> at
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> at
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
> at
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> ... 2 more
> 2016-02-15 22:04:33,036 INFO  [sync.2] wal.FSHLog: Slow sync cost: 234 ms,
> current pipeline: [DatanodeInfoWithStorage[10.5.2.169:50010,DS-7b9cfb3b-6c79-4e1b-ac90-19881e568518,DISK],
> DatanodeInfoWithStorage[10.5.2.95:50010,DS-40db3807-a850-4aeb-8529-d3ea3920e556,DISK],
> DatanodeInfoWithStorage[10.5.2.57:50010
> ,DS-3ddc1541-c052-4cc2-b8a4-65d84fd90cfb,DISK]]
> 2016-02-15 22:04:33,043 FATAL
> [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> regionserver.HRegionServer: ABORTING region server
> hbase-rs9.localdomain,16020,1455560991134: IOE in log roller
> java.io.IOException: java.lang.NullPointerException
> at
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> at
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
> at
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> ... 2 more
> 2016-02-15 22:04:33,043 FATAL
> [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> regionserver.HRegionServer: RegionServer abort: loaded coprocessors are:
> [org.apache.hadoop.hbase.regionserver.LocalIndexMerger,
> org.apache.hadoop.hbase.regionserver.IndexHalfStoreFileReaderGenerator,
> org.apache.phoenix.coprocessor.SequenceRegionObserver,
> org.apache.phoenix.coprocessor.ScanRegionObserver,
> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver,
> org.apache.phoenix.hbase.index.Indexer,
> org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver,
> org.apache.hadoop.hbase.regionserver.LocalIndexSplitter,
> org.apache.phoenix.coprocessor.ServerCachingEndpointImpl,
> org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint,
> org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint]
> 2016-02-15 22:04:33,136 INFO
>  [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> regionserver.HRegionServer: Dump of metrics as JSON on abort: {
>   "beans" : [ {
>     "name" : "java.lang:type=Memory",
>     "modelerType" : "sun.management.MemoryImpl",
>     "Verbose" : true,
>     "HeapMemoryUsage" : {
>       "committed" : 15032385536,
>       "init" : 15032385536,
>       "max" : 15032385536,
>       "used" : 3732843792
>     },
>     "ObjectPendingFinalizationCount" : 0,
>     "NonHeapMemoryUsage" : {
>       "committed" : 104660992,
>       "init" : 2555904,
>       "max" : -1,
>       "used" : 103018984
>     },
>     "ObjectName" : "java.lang:type=Memory"
>   } ],
>   "beans" : [ {
>     "name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC",
>     "modelerType" : "RegionServer,sub=IPC",
>     "tag.Context" : "regionserver",
>     "tag.Hostname" : "hbase-rs9"
>   } ],
>   "beans" : [ {
>     "name" : "Hadoop:service=HBase,name=RegionServer,sub=Replication",
>     "modelerType" : "RegionServer,sub=Replication",
>     "tag.Context" : "regionserver",
>     "tag.Hostname" : "hbase-rs9"
>   } ],
>   "beans" : [ {
>     "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server",
>     "modelerType" : "RegionServer,sub=Server",
>     "tag.Context" : "regionserver",
>     "tag.Hostname" : "hbase-rs9"
>   } ]
> }
> 2016-02-15 22:04:33,194 INFO
>  [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> regionserver.HRegionServer: STOPPED: IOE in log roller
> 2016-02-15 22:04:33,194 INFO
>  [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> regionserver.LogRoller: LogRoller exiting.
> 2016-02-15 22:04:33,195 INFO  [regionserver/hbase-rs9.localdomain/
> 10.5.2.169:16020] regionserver.SplitLogWorker: Sending interrupt to stop
> the worker thread
> 2016-02-15 22:04:33,197 INFO  [regionserver/hbase-rs9.localdomain/
> 10.5.2.169:16020] regionserver.HRegionServer: Stopping infoServer
> 2016-02-15 22:04:33,197 INFO  [SplitLogWorker-hbase-rs9:16020]
> regionserver.SplitLogWorker: SplitLogWorker interrupted. Exiting.


Any idea what happened? Today, I moved some regions around to balance the
cluster and I ran a major compaction after that and I added more threads to
run small and large compactions, could this be related?

I see that in the current branch the class *ProtobufLogWriter:176* already
contains try..catch

try {
>   return this.output.getPos();
> } catch (NullPointerException npe) {
>   // Concurrent close...
>   throw new IOException(npe);
> }


But I would be nice to understand the root cause of this error and if there
is any misconfiguration from my side.

*Version: *HBase 1.1.2

Thank you
Cheers
Pedro

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message