hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: HBase resgionServer crashed with no gc detected
Date Thu, 20 Oct 2016 13:43:05 GMT
I haven't found more clue from the latest log.
I noticed DEBUG log was not turned on.

Please keep monitoring and get back if you encounter region server crash in
the future.

On Thu, Oct 20, 2016 at 3:07 AM, who.cat <who.cat@qq.com> wrote:

> Thanks Ted.I uploaded another log https://github.com/eswidy/
> waterspider/tree/master/rscase/rs-more.log
> Followed you advice i increased the tickTime and works well at present.
> Maybe the problem caused by
> he bad I/O,I found the CPU I/O idle always more than 70% during the  heavy
> load. But that make  JVM pause ?
>
>
>
>
> ------------------ Original ------------------
> From:  "Ted Yu";<yuzhihong@gmail.com>;
> Send time: Thursday, Oct 20, 2016 10:27 AM
> To: "user@hbase.apache.org"<user@hbase.apache.org>;
>
> Subject:  Re: HBase resgionServer crashed with no gc detected
>
>
>
> Your zookeeper.session.timeout is set as 90000 but tickTime=2000.
> The max timeout is bounded by 20 times tickTime.
>
> Please increase the tickTime in zoo.cfg
>
> I don't see region server log prior to 18:14:14,928
>
> On Wed, Oct 19, 2016 at 7:13 PM, who.cat <who.cat@qq.com> wrote:
>
> > ok.i have posted  the more detail RS,Gc log and the ZK ,HBase config,
> > https://github.com/eswidy/waterspider/tree/master/rscase
> > Thanks
> >
> >
> >
> >
> > ------------------ Original ------------------
> > From:  "Ted Yu";<yuzhihong@gmail.com>;
> > Date:  Oct 20, 2016
> > To:  "user@hbase.apache.org"<user@hbase.apache.org>;
> >
> > Subject:  Re: HBase resgionServer crashed with no gc detected
> >
> >
> >
> > There was one 25 second pause before the abort.
> >
> > Can you pastebin your hbase-site.xml (and zookeeper configs) ?
> >
> > Do you have more of the region server log (prior to 18:14:14,928) ?
> >
> > Thanks
> >
> > On Wed, Oct 19, 2016 at 6:01 PM, who.cat <who.cat@qq.com> wrote:
> >
> > > i've upload the file to git hub ,and the url is :
> > > https://github.com/eswidy/waterspider/blob/master/regionServer.log
> > >
> > > thanks so much.
> > >
> > >
> > >
> > >
> > > ------------------ Original ------------------
> > > From:  "Ted Yu";<yuzhihong@gmail.com>;
> > > Date:  Oct 19, 2016
> > > To:  "user@hbase.apache.org"<user@hbase.apache.org>;
> > >
> > > Subject:  Re: HBase resgionServer crashed with no gc detected
> > >
> > >
> > >
> > > The log file was not delivered by the mailing list.
> > >
> > > Consider using pastebin or third party site.
> > >
> > > On Tue, Oct 18, 2016 at 10:38 PM, who.cat <who.cat@qq.com> wrote:
> > >
> > > > thanks fyi.Yes,i did not turn the debug and try it now .I also doubt
> > the
> > > > heavy cpu load  caused ,then checked cpu highest  Utilization is
> > 60%(Cpu
> > > > user )
> > > > My region server  gc parameter is :export SERVER_GC_OPTS="-verbose:gc
> > > > -XX:+PrintGCDetails -XX:+PrintGCDateStamps
> -Xloggc:{{log_dir}}/gc.log-`
> > > date
> > > > +'%Y%m%d%H%M'`"
> > > > The 10/12 log was rolled .i  got the same crash log yesterday(10/18).
> > > > Details in the attachment 'regionServer.log', and the JVM pause at
> > > > "2016-10-17 18:44:07,232" in line 82 .
> > > > Thanks so much.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > ------------------ 原始邮件 ------------------
> > > > *发件人:* "Ted Yu";<yuzhihong@gmail.com>;
> > > > *发送时间:* 2016年10月19日(星期三) 中午11:17
> > > > *收件人:* "user@hbase.apache.org"<user@hbase.apache.org>;
> > > > *主题:* Re: HBase resgionServer crashed with no gc detected
> > > >
> > > > Can you show more of the region server log prior to 23:48:13
> (including
> > > the
> > > > pause) ?
> > > >
> > > > Was the region server under heavy load during the pause ?
> > > >
> > > > Consider turning on DEBUG logging if you haven't.
> > > >
> > > > Please also share GC parameters.
> > > >
> > > > Thanks
> > > >
> > > > On Tue, Oct 18, 2016 at 7:58 PM, who.cat <who.cat@qq.com> wrote:
> > > >
> > > > > Hi all:
> > > > > I've a  HDP big data cluster with 4 nodes and create by Ambari  the
> > > HBase
> > > > > is        1.1.2.
> > > > > As running YCSB for benchmark the RegionServer instance or the
> > Hmaster
> > > > > instance crashes which it's logs shows:
> > > > >
> > > > > ---------------------log start ---------------------
> > > > > 2016-10-12 23:48:13,591 INFO  [main-SendThread(Node1:2181)]
> > > > > zookeeper.ClientCnxn: Unable to read additional data from server
> > > > sessionid
> > > > > 0x157b7f5f0bc0005, likely server has closed socket, closing socket
> > > > > connection and attempting reconnect
> > > > > 2016-10-12 23:48:13,595 INFO  [HBase-Metrics2-1]
> > > impl.MetricsSinkAdapter:
> > > > > Sink timeline started
> > > > > 2016-10-12 23:48:13,606 INFO  [HBase-Metrics2-1]
> > > impl.MetricsSystemImpl:
> > > > > Scheduled snapshot period at 10 second(s).
> > > > > 2016-10-12 23:48:13,606 INFO  [HBase-Metrics2-1]
> > > impl.MetricsSystemImpl:
> > > > > HBase metrics system started
> > > > > 2016-10-12 23:48:14,496 INFO  [main-SendThread(Node4:2181)]
> > > > > zookeeper.ClientCnxn: Opening socket connection to server Node4/
> > > > > 1.1.6.104:2181. Will not attempt to authenticate using SASL
> (unknown
> > > > > error)
> > > > > 2016-10-12 23:48:14,506 INFO  [main-SendThread(Node4:2181)]
> > > > > zookeeper.ClientCnxn: Socket connection established to Node4/
> > > > > 1.17.6.104:2181, initiating session
> > > > > 2016-10-12 23:48:14,517 INFO  [main-SendThread(Node4:2181)]
> > > > > zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service,
> > session
> > > > > 0x157b7f5f0bc0005 has expired, closing socket connection
> > > > > 2016-10-12 23:48:14,517 FATAL [main-EventThread]
> > > > > regionserver.HRegionServer: ABORTING region server
> > > > > node1,16020,1476260847716: regionserver:16020-0x157b7f5f0bc0005,
> > > > > quorum=node2:2181,node1:2181,node4:2181, baseZNode=/hbase-unsecure
> > > > > regionserver:16020-0x157b7f5f0bc0005 received expired from
> > ZooKeeper,
> > > > > aborting
> > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > KeeperErrorCode = Session expired
> > > > >         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.
> > > > > connectionEvent(ZooKeeperWatcher.java:585)
> > > > >         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.
> > > > > process(ZooKeeperWatcher.java:517)
> > > > >         at org.apache.zookeeper.ClientCnxn$EventThread.
> > > > > processEvent(ClientCnxn.java:534)
> > > > >         at org.apache.zookeeper.ClientCnxn$EventThread.run(
> > > > > ClientCnxn.java:510)
> > > > > 2016-10-12 23:48:14,518 FATAL [main-EventThread]
> > > > > regionserver.HRegionServer: RegionServer abort: loaded coprocessors
> > > are:
> > > > > [org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint]
> > > > > ---------------------log end---------------------
> > > > >
> > > > > After checked the log ,it shows  that the region server jvm paused
> a
> > > long
> > > > > time and the zkclient cannot send heartbeats, the session times out
> > > Which
> > > > > the 'reference guide' had descripted http://hbase.apache.org/book.
> > > > > html#trouble.rs.runtime.zkexpired  .So a read the log detail and
> to
> > > find
> > > > > the  java GC event  but there's no  full gc occurred.
> > > > > And more a found the same symptom in the  DataNode instance .
> > > > >
> > > > > The node os is Centos7 maybe the  kernel  futex bug  ,after
> checking
> > > the
> > > > > bug was fixed in my OS .
> > > > >  There's any other factor caused the problem except java GC?
> > > > > Anyone who got the same problem ? Any ideas ?
> > > > > Thank you .
> > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message