hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pedro Gandola <pedro.gand...@gmail.com>
Subject Re: NPE on ProtobufLogWriter.sync()
Date Wed, 17 Feb 2016 01:57:14 GMT
Hi St.Ack

I attached the log in my previous response.

Thanks
Cheers

On Tue, Feb 16, 2016 at 11:51 PM, Stack <stack@duboce.net> wrote:

> On Tue, Feb 16, 2016 at 1:32 AM, Pedro Gandola <pedro.gandola@gmail.com>
> wrote:
>
> > Just missed one of your questions.
> >
> > Were you trying to restart it and it wouldn't come up? It kept doing the
> > > NPE on each restart?
> > > Or it happened once and killed the regionserver?
> >
> >
> > I received an alarm for the RS and I went there I saw the cluster not
> > processing data and I tried to restart the RS. Everything went to normal
> > but the RS crashed again then I decided to remove it completely until
> know
> > more about the problem. Now looking into the restarting code I see that
> the
> > problem might be other.
> >
> > *(log.out) *
> >
> > > /usr/hdp/current/hbase-regionserver/bin/hbase-daemon.sh: line 214:
> 20748
> > > Killed                  nice -n $HBASE_NICENESS "$HBASE_HOME"/bin/hbase
> > > --config "${HBASE_CONF_DIR}" $command "$@" start >> ${HBASE_LOGOUT}
> 2>&1
> > > cat: /mnt/var/run/hbase/hbase-hbase-regionserver.pid: No such file or
> > > directory
> >
> >
> > *(log after restart)*
> >
> > > 2016-02-15 22:21:23,692 INFO  [main] zookeeper.ZooKeeper: Client
> > >
> >
> environment:java.library.path=:/usr/hdp/2.3.2.0-2950/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.2.0-2950/hadoop/lib/native
> > > 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > environment:java.io.tmpdir=/tmp
> > > 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > environment:java.compiler=<NA>
> > > 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > environment:os.name=Linux
> > > 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > environment:os.arch=amd64
> > > 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > environment:os.version=3.10.0-327.3.1.el7.x86_64
> > > 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > environment:user.name=hbase
> > > 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > environment:user.home=/home/hbase
> > > 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > environment:user.dir=/home/hbase
> > > 2016-02-15 22:21:23,694 INFO  [main] zookeeper.ZooKeeper: Initiating
> > > client connection,
> > >
> >
> connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181
> > > sessionTimeout=30000
> > > watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671
> > > 2016-02-15 22:21:23,694 INFO  [main] zookeeper.ZooKeeper: Initiating
> > > client connection,
> > >
> >
> connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181
> > > sessionTimeout=30000
> > > watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671
> > > 2016-02-15 22:21:23,754 INFO
> > >  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> > > Opening socket connection to server zookeeper2.localdomain/
> > 10.5.1.105:2181.
> > > Will not attempt to authenticate using SASL (unknown error)
> > > 2016-02-15 22:21:23,755 INFO
> > >  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> > > Opening socket connection to server zookeeper2.localdomain/
> > 10.5.1.105:2181.
> > > Will not attempt to authenticate using SASL (unknown error)
> > > 2016-02-15 22:21:23,767 INFO
> > >  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> > > Socket connection established to zookeeper2.localdomain/
> 10.5.1.105:2181,
> > > initiating session
> > > 2016-02-15 22:21:23,767 INFO
> > >  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> > > Socket connection established to zookeeper2.localdomain/
> 10.5.1.105:2181,
> > > initiating session
> > > 2016-02-15 22:21:23,777 INFO
> > >  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> > > Session establishment complete on server zookeeper2.localdomain/
> > > 10.5.1.105:2181, sessionid = 0x25278d348e90d5e, negotiated timeout =
> > 30000
> > > 2016-02-15 22:21:23,777 INFO
> > >  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> > > Session establishment complete on server zookeeper2.localdomain/
> > > 10.5.1.105:2181, sessionid = 0x25278d348e90d5f, negotiated timeout =
> > 30000
> > >
> >
> > I don't see any exception during the restart but for some reason the
> > process got killed and I don't see any SIGTERM or OOM on linux journal.
> >
> >
> Is that the last of the log? Just like that?
> St.Ack
>
>
> > Cheers
> > Pedro
> >
> >
> > On Tue, Feb 16, 2016 at 9:06 AM, Pedro Gandola <pedro.gandola@gmail.com>
> > wrote:
> >
> > > Hi St.Ack,
> > >
> > > Thank you for your help. I have attached ~5 min worth of data before
> the
> > > crash.
> > >
> > > I restarted the cluster to update some configurations after that I
> moved
> > > some regions around to balance the cluster and just to ensure the data
> > > locality I ran a major compaction. After that I connected my App and 2
> > > hours later  this happened. So by the time of the crash I was not
> > > performing any operation over the cluster it was running normally.
> > >
> > > Thank you,
> > > Cheers
> > > Pedro
> > >
> > > On Tue, Feb 16, 2016 at 2:01 AM, Stack <stack@duboce.net> wrote:
> > >
> > >> On Mon, Feb 15, 2016 at 3:07 PM, Pedro Gandola <
> pedro.gandola@gmail.com
> > >
> > >> wrote:
> > >>
> > >> > Hi Guys,
> > >> >
> > >> > One of my region servers got into a state where it was unable to
> start
> > >> and
> > >> > the cluster was not receiving traffic for some time:
> > >>
> > >>
> > >>
> > >> Were you trying to restart it and it wouldn't come up? It kept doing
> the
> > >> NPE on each restart?
> > >>
> > >> Or it happened once and killed the regionserver?
> > >>
> > >>
> > >>
> > >> *(master log)*
> > >> >
> > >> > > 2016-02-15 22:04:33,186 ERROR
> > >> > > [PriorityRpcServer.handler=4,queue=0,port=16000]
> > >> > master.MasterRpcServices:
> > >> > > Region server hbase-rs9.localdomain,16020,1455560991134 reported
a
> > >> fatal
> > >> > > error:
> > >> > > ABORTING region server hbase-rs9.localdomain,16020,1455560991134:
> > IOE
> > >> in
> > >> > > log roller
> > >> > > Cause:
> > >> > > java.io.IOException: java.lang.NullPointerException
> > >> > > at
> > >> > >
> > >> >
> > >>
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> > >> > > at
> > >> > >
> > >> >
> > >>
> >
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> > >> > > at java.lang.Thread.run(Thread.java:745)
> > >> > > Caused by: java.lang.NullPointerException
> > >> > > at
> > >> > >
> > >> >
> > >>
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> > >> > > ... 2 more 2016-02-15 22:05:45,678 WARN
> > >> > > [hbase-ms4.localdomain,16000,1455560972387_ChoreService_1]
> > >> > > master.CatalogJanitor: Failed scan of catalog table
> > >> > > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
> > after
> > >> > > attempts=351, exceptions:
> > >> > > Mon Feb 15 22:05:45 UTC 2016, null,
> java.net.SocketTimeoutException:
> > >> > > callTimeout=60000, callDuration=68222: row '' on table
> 'hbase:meta'
> > at
> > >> > > region=hbase:meta,,1.1588230740,
> > >> > > hostname=hbase-rs9.localdomain,16020,1455560991134, seqNum=0
at
> > >> > >
> > >> >
> > >>
> >
> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:271)
> > >> > > at
> > >> >
> > >>
> > >> You running with read replicas enabled?
> > >>
> > >>
> > >> ....
> > >>
> > >>
> > >>
> > >> *(regionserver log)*
> > >>
> > >>
> > >> Anything before this log message about rolling the WAL?
> > >>
> > >>
> > >> > 2016-02-15 22:04:33,034 ERROR [sync.2] wal.FSHLog: Error syncing,
> > >> request
> > >> > close of wal
> > >> > java.io.IOException: java.lang.NullPointerException
> > >> > at
> > >> >
> > >>
> > >>
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> > >> > at
> > >> >
> > >>
> > >>
> >
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> > >> > at java.lang.Thread.run(Thread.java:745)
> > >> > Caused by: java.lang.NullPointerException
> > >> > at
> > >> >
> > >>
> > >>
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> > >> > ... 2 more
> > >> > 2016-02-15 22:04:33,036 INFO  [sync.2] wal.FSHLog: Slow sync cost:
> 234
> > >> ms,
> > >> > current pipeline: [DatanodeInfoWithStorage[10.5.2.169:50010
> > >> ,DS-7b9cfb3b-6c79-4e1b-ac90-19881e568518,DISK],
> > >> > DatanodeInfoWithStorage[10.5.2.95:50010
> > >> ,DS-40db3807-a850-4aeb-8529-d3ea3920e556,DISK],
> > >> > DatanodeInfoWithStorage[10.5.2.57:50010
> > >> > ,DS-3ddc1541-c052-4cc2-b8a4-65d84fd90cfb,DISK]]
> > >> > 2016-02-15 22:04:33,043 FATAL
> > >> > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> > >> > regionserver.HRegionServer: ABORTING region server
> > >> > hbase-rs9.localdomain,16020,1455560991134: IOE in log roller
> > >> > java.io.IOException: java.lang.NullPointerException
> > >> > at
> > >> >
> > >>
> > >>
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> > >> > at
> > >> >
> > >>
> > >>
> >
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> > >> > at java.lang.Thread.run(Thread.java:745)
> > >> > Caused by: java.lang.NullPointerException
> > >> > at
> > >> >
> > >>
> > >>
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> > >> > ... 2 more
> > >> > 2016-02-15 22:04:33,043 FATAL
> > >> > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> > >> > regionserver.HRegionServer: RegionServer abort: loaded coprocessors
> > are:
> > >> > [org.apache.hadoop.hbase.regionserver.LocalIndexMerger,
> > >> >
> > org.apache.hadoop.hbase.regionserver.IndexHalfStoreFileReaderGenerator,
> > >> > org.apache.phoenix.coprocessor.SequenceRegionObserver,
> > >> > org.apache.phoenix.coprocessor.ScanRegionObserver,
> > >> > org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver,
> > >> > org.apache.phoenix.hbase.index.Indexer,
> > >> > org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver,
> > >> > org.apache.hadoop.hbase.regionserver.LocalIndexSplitter,
> > >> > org.apache.phoenix.coprocessor.ServerCachingEndpointImpl,
> > >> > org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint,
> > >> > org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint]
> > >> > 2016-02-15 22:04:33,136 INFO
> > >> >  [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> > >> > regionserver.HRegionServer: Dump of metrics as JSON on abort: {
> > >> >   "beans" : [ {
> > >> >     "name" : "java.lang:type=Memory",
> > >> >     "modelerType" : "sun.management.MemoryImpl",
> > >> >     "Verbose" : true,
> > >> >     "HeapMemoryUsage" : {
> > >> >       "committed" : 15032385536,
> > >> >       "init" : 15032385536,
> > >> >       "max" : 15032385536,
> > >> >       "used" : 3732843792
> > >> >     },
> > >> >     "ObjectPendingFinalizationCount" : 0,
> > >> >     "NonHeapMemoryUsage" : {
> > >> >       "committed" : 104660992,
> > >> >       "init" : 2555904,
> > >> >       "max" : -1,
> > >> >       "used" : 103018984
> > >> >     },
> > >> >     "ObjectName" : "java.lang:type=Memory"
> > >> >   } ],
> > >> >   "beans" : [ {
> > >> >     "name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC",
> > >> >     "modelerType" : "RegionServer,sub=IPC",
> > >> >     "tag.Context" : "regionserver",
> > >> >     "tag.Hostname" : "hbase-rs9"
> > >> >   } ],
> > >> >   "beans" : [ {
> > >> >     "name" :
> "Hadoop:service=HBase,name=RegionServer,sub=Replication",
> > >> >     "modelerType" : "RegionServer,sub=Replication",
> > >> >     "tag.Context" : "regionserver",
> > >> >     "tag.Hostname" : "hbase-rs9"
> > >> >   } ],
> > >> >   "beans" : [ {
> > >> >     "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server",
> > >> >     "modelerType" : "RegionServer,sub=Server",
> > >> >     "tag.Context" : "regionserver",
> > >> >     "tag.Hostname" : "hbase-rs9"
> > >> >   } ]
> > >> > }
> > >> > 2016-02-15 22:04:33,194 INFO
> > >> >  [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> > >> > regionserver.HRegionServer: STOPPED: IOE in log roller
> > >>
> > >> Looks like we were in log roller, closing the WAL to put up a new one
> > but
> > >> the a sync was trying to run at same time but stream undo under it....
> > >> NPE... which caused server abort.
> > >>
> > >> ...
> > >>
> > >>
> > >> Any idea what happened? Today, I moved some regions around to balance
> > the
> > >> cluster and I ran a major compaction after that and I added more
> threads
> > >> to
> > >> run small and large compactions, could this be related?
> > >>
> > >>
> > >>
> > >> Yeah, pass more of the regionserver log from before the NPE... to see
> > what
> > >> was going on at the time.  I don't think the above changes related.
> > >>
> > >>
> > >>
> > >> I see that in the current branch the class *ProtobufLogWriter:176*
> > already
> > >> contains try..catch
> > >>
> > >> try {
> > >> >   return this.output.getPos();
> > >> > } catch (NullPointerException npe) {
> > >> >   // Concurrent close...
> > >> >   throw new IOException(npe);
> > >> > }
> > >>
> > >>
> > >> Yeah. Looks like the thought is an NPE could happen here because the
> > >> stream
> > >> is being closed out from under us by another thread (the syncers are
> off
> > >> on
> > >> their own just worried about syncing... This is just to get the NPE up
> > out
> > >> of the syncer thread and up into the general WAL subsystem.).  It
> looks
> > >> like this caused the regionserver abort. I would think it should just
> > >> provoke the WAL roll, not an abort. Put up more RS log please.
> > >>
> > >>
> > >>
> > >> > But I would be nice to understand the root cause of this error and
> if
> > >> there
> > >> > is any misconfiguration from my side.
> > >> >
> > >> >
> > >> Not your config I'd say (if your config brought it on, its a bug).
> > >>
> > >>
> > >>
> > >> > *Version: *HBase 1.1.2
> > >> >
> > >> >
> > >>
> > >> St.Ack
> > >>
> > >> P.S. It looks like a gentleman had similar issue end of last year:
> > >>
> > >>
> >
> http://mail-archives.apache.org/mod_mbox/hbase-user/201510.mbox/%3CCA+nge+rofDTXboLmC0s4BcgWVtyecyJaPZwXp5ZJu-gM4DAVEA@mail.gmail.com%3E
> > >> No response then.
> > >>
> > >>
> > >>
> > >>
> > >> > Thank you
> > >> > Cheers
> > >> > Pedro
> > >> >
> > >>
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message