hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: NPE on ProtobufLogWriter.sync()
Date Tue, 16 Feb 2016 02:01:12 GMT
On Mon, Feb 15, 2016 at 3:07 PM, Pedro Gandola <pedro.gandola@gmail.com>
wrote:

> Hi Guys,
>
> One of my region servers got into a state where it was unable to start and
> the cluster was not receiving traffic for some time:



Were you trying to restart it and it wouldn't come up? It kept doing the
NPE on each restart?

Or it happened once and killed the regionserver?



*(master log)*
>
> > 2016-02-15 22:04:33,186 ERROR
> > [PriorityRpcServer.handler=4,queue=0,port=16000]
> master.MasterRpcServices:
> > Region server hbase-rs9.localdomain,16020,1455560991134 reported a fatal
> > error:
> > ABORTING region server hbase-rs9.localdomain,16020,1455560991134: IOE in
> > log roller
> > Cause:
> > java.io.IOException: java.lang.NullPointerException
> > at
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> > at
> >
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> > at java.lang.Thread.run(Thread.java:745)
> > Caused by: java.lang.NullPointerException
> > at
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> > ... 2 more 2016-02-15 22:05:45,678 WARN
> > [hbase-ms4.localdomain,16000,1455560972387_ChoreService_1]
> > master.CatalogJanitor: Failed scan of catalog table
> > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
> > attempts=351, exceptions:
> > Mon Feb 15 22:05:45 UTC 2016, null, java.net.SocketTimeoutException:
> > callTimeout=60000, callDuration=68222: row '' on table 'hbase:meta' at
> > region=hbase:meta,,1.1588230740,
> > hostname=hbase-rs9.localdomain,16020,1455560991134, seqNum=0 at
> >
> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:271)
> > at
>

You running with read replicas enabled?


....



*(regionserver log)*


Anything before this log message about rolling the WAL?


> 2016-02-15 22:04:33,034 ERROR [sync.2] wal.FSHLog: Error syncing, request
> close of wal
> java.io.IOException: java.lang.NullPointerException
> at
>
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> at
>
org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
> at
>
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> ... 2 more
> 2016-02-15 22:04:33,036 INFO  [sync.2] wal.FSHLog: Slow sync cost: 234 ms,
> current pipeline: [DatanodeInfoWithStorage[10.5.2.169:50010
,DS-7b9cfb3b-6c79-4e1b-ac90-19881e568518,DISK],
> DatanodeInfoWithStorage[10.5.2.95:50010
,DS-40db3807-a850-4aeb-8529-d3ea3920e556,DISK],
> DatanodeInfoWithStorage[10.5.2.57:50010
> ,DS-3ddc1541-c052-4cc2-b8a4-65d84fd90cfb,DISK]]
> 2016-02-15 22:04:33,043 FATAL
> [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> regionserver.HRegionServer: ABORTING region server
> hbase-rs9.localdomain,16020,1455560991134: IOE in log roller
> java.io.IOException: java.lang.NullPointerException
> at
>
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> at
>
org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
> at
>
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> ... 2 more
> 2016-02-15 22:04:33,043 FATAL
> [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> regionserver.HRegionServer: RegionServer abort: loaded coprocessors are:
> [org.apache.hadoop.hbase.regionserver.LocalIndexMerger,
> org.apache.hadoop.hbase.regionserver.IndexHalfStoreFileReaderGenerator,
> org.apache.phoenix.coprocessor.SequenceRegionObserver,
> org.apache.phoenix.coprocessor.ScanRegionObserver,
> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver,
> org.apache.phoenix.hbase.index.Indexer,
> org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver,
> org.apache.hadoop.hbase.regionserver.LocalIndexSplitter,
> org.apache.phoenix.coprocessor.ServerCachingEndpointImpl,
> org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint,
> org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint]
> 2016-02-15 22:04:33,136 INFO
>  [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> regionserver.HRegionServer: Dump of metrics as JSON on abort: {
>   "beans" : [ {
>     "name" : "java.lang:type=Memory",
>     "modelerType" : "sun.management.MemoryImpl",
>     "Verbose" : true,
>     "HeapMemoryUsage" : {
>       "committed" : 15032385536,
>       "init" : 15032385536,
>       "max" : 15032385536,
>       "used" : 3732843792
>     },
>     "ObjectPendingFinalizationCount" : 0,
>     "NonHeapMemoryUsage" : {
>       "committed" : 104660992,
>       "init" : 2555904,
>       "max" : -1,
>       "used" : 103018984
>     },
>     "ObjectName" : "java.lang:type=Memory"
>   } ],
>   "beans" : [ {
>     "name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC",
>     "modelerType" : "RegionServer,sub=IPC",
>     "tag.Context" : "regionserver",
>     "tag.Hostname" : "hbase-rs9"
>   } ],
>   "beans" : [ {
>     "name" : "Hadoop:service=HBase,name=RegionServer,sub=Replication",
>     "modelerType" : "RegionServer,sub=Replication",
>     "tag.Context" : "regionserver",
>     "tag.Hostname" : "hbase-rs9"
>   } ],
>   "beans" : [ {
>     "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server",
>     "modelerType" : "RegionServer,sub=Server",
>     "tag.Context" : "regionserver",
>     "tag.Hostname" : "hbase-rs9"
>   } ]
> }
> 2016-02-15 22:04:33,194 INFO
>  [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> regionserver.HRegionServer: STOPPED: IOE in log roller

Looks like we were in log roller, closing the WAL to put up a new one but
the a sync was trying to run at same time but stream undo under it....
NPE... which caused server abort.

...


Any idea what happened? Today, I moved some regions around to balance the
cluster and I ran a major compaction after that and I added more threads to
run small and large compactions, could this be related?



Yeah, pass more of the regionserver log from before the NPE... to see what
was going on at the time.  I don't think the above changes related.



I see that in the current branch the class *ProtobufLogWriter:176* already
contains try..catch

try {
>   return this.output.getPos();
> } catch (NullPointerException npe) {
>   // Concurrent close...
>   throw new IOException(npe);
> }


Yeah. Looks like the thought is an NPE could happen here because the stream
is being closed out from under us by another thread (the syncers are off on
their own just worried about syncing... This is just to get the NPE up out
of the syncer thread and up into the general WAL subsystem.).  It looks
like this caused the regionserver abort. I would think it should just
provoke the WAL roll, not an abort. Put up more RS log please.



> But I would be nice to understand the root cause of this error and if there
> is any misconfiguration from my side.
>
>
Not your config I'd say (if your config brought it on, its a bug).



> *Version: *HBase 1.1.2
>
>

St.Ack

P.S. It looks like a gentleman had similar issue end of last year:
http://mail-archives.apache.org/mod_mbox/hbase-user/201510.mbox/%3CCA+nge+rofDTXboLmC0s4BcgWVtyecyJaPZwXp5ZJu-gM4DAVEA@mail.gmail.com%3E
No response then.




> Thank you
> Cheers
> Pedro
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message