hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 伍照坤 <tonywu...@gmail.com>
Subject Re: Balance to dead region server?
Date Wed, 09 Sep 2015 04:04:58 GMT
Hi, Ted

Thanks for your time on this issue.

Correct, the dead region server never came live, and no indication in
hmaster log show it came live. But the hmaster start to balance regions to
dead region server.

Will enable debug in future.


On Tuesday, September 8, 2015, Ted Yu <yuzhihong@gmail.com> wrote:

> This was the first occurrence of balancing onto e3ecmrhdp24 :
>
> 2015-09-03 18:00:31,137 INFO org.apache.hadoop.hbase.master.HMaster:
> balance
> hri=ecitem:IM_ItemBase,69,1440541971138.93a12ec8a63d6954e0432e8b9d7c0922.,
> src=e3ecmrhdp33.mercury.  corp,60020,1438626881418,
> dest=e3ecmrhdp24.mercury.corp,60020,1438626879309
>
> Prior to the above, there was no indication that e3ecmrhdp24 came back to
> life - cause it didn't.
>
> I noticed that DEBUG logging was off. Is it possible to turn on DEBUG
> logging ?
>
> BTW please redact server names in the logs you upload in the future (e.g.
> you can call e3ecmrhdp24 X as long as all occurrences of e3ecmrhdp24 are
> called X but no other server is called X).
>
> Cheers
>
> On Tue, Sep 8, 2015 at 6:14 PM, 伍照坤 <tonywutao@gmail.com <javascript:;>>
> wrote:
>
> > Hi, Ted
> >
> > Thanks, i attached the log in tar.gz in dropbox.
> > https://www.dropbox.com/s/czes89w5r3rr1wa/hbase-log.tar.gz?dl=0
> >
> >
> > the dead server name: e3ecmrhdp24
> >
> > it looks after i truncate another table, the master start to balance
> > regions to dead node.
> >
> > ------------------------------
> > 2015-09-03 17:57:28,689 INFO org.apache.hadoop.hbase.master.HMaster:
> > Client=tw79//172.16.31.133 truncate ecitem:IVT_ItemInventory
> >
> >
> > -----------------
> >
> > 2015-09-08 17:47 GMT-07:00 Ted Yu <yuzhihong@gmail.com <javascript:;>>:
> >
> > > Can you pastebin more of the master log after 15:29:33,856 w.r.t.
> > > e3ecmrhdp24 ?
> > >
> > > I wonder how master thought e3ecmrhdp24 became live again.
> > >
> > > On Tue, Sep 8, 2015 at 5:37 PM, 伍照坤 <tonywutao@gmail.com
> <javascript:;>> wrote:
> > >
> > > > Hi, Ted
> > > >
> > > > Thanks for reply.
> > > >
> > > > here is the log the master shutdown this region server, it never
> starts
> > > > again.
> > > > -----------------------
> > > > 2015-09-03 15:29:33,738 INFO
> > > > org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer
> > > > ephemeral node deleted, processing expiration
> > > > [e3ecmrhdp24.mercury.corp,60020,1441316616368]
> > > > 2015-09-03 15:29:33,848 INFO
> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
> Splitting
> > > > logs for e3ecmrhdp24.mercury.corp,60020,1441316616368 before
> > assignment;
> > > > region count=0
> > > > 2015-09-03 15:29:33,851 INFO
> > > > org.apache.hadoop.hbase.master.SplitLogManager: dead splitlog workers
> > > > [e3ecmrhdp24.mercury.corp,60020,1441316616368]
> > > > 2015-09-03 15:29:33,853 INFO
> > > > org.apache.hadoop.hbase.master.SplitLogManager:
> > > >
> > > >
> > >
> >
> hdfs://nameservice1/hbase/WALs/e3ecmrhdp24.mercury.corp,60020,1441316616368-splitting
> > > > is empty dir, no logs to split
> > > > 2015-09-03 15:29:33,853 INFO
> > > > org.apache.hadoop.hbase.master.SplitLogManager: started splitting 0
> > logs
> > > in
> > > >
> > > >
> > >
> >
> [hdfs://nameservice1/hbase/WALs/e3ecmrhdp24.mercury.corp,60020,1441316616368-splitting]
> > > > for [e3ecmrhdp24.mercury.corp,60020,1441316616368]
> > > > 2015-09-03 15:29:33,855 INFO
> > > > org.apache.hadoop.hbase.master.SplitLogManager: finished splitting
> > (more
> > > > than or equal to) 0 bytes in 0 log files in
> > > >
> > > >
> > >
> >
> [hdfs://nameservice1/hbase/WALs/e3ecmrhdp24.mercury.corp,60020,1441316616368-splitting]
> > > > in 2ms
> > > > 2015-09-03 15:29:33,856 INFO
> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
> > > Reassigning 0
> > > > region(s) that e3ecmrhdp24.mercury.corp,60020,1441316616368 was
> > carrying
> > > > (and 0 regions(s) that were opening on this server)
> > > > 2015-09-03 15:29:33,856 INFO
> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
> Finished
> > > > processing of shutdown of
> e3ecmrhdp24.mercury.corp,60020,1441316616368
> > > > 2015-09-03 15:29:36,399 INFO
> > > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: totalSize=417.02 KB,
> > > > freeSize=395.54 MB, max=395.95 MB, blockCount=0, accesses=0, hits=0,
> > > > hitRatio=0, cachingAccesses=0, cachingHits=0,
> > > > cachingHitsRatio=0,evictions=269245, evicted=0, evictedPerRun=0.0
> > > >
> > > >
> > > > 2015-09-08 17:25 GMT-07:00 Ted Yu <yuzhihong@gmail.com
> <javascript:;>>:
> > > >
> > > > > Can you pastebin master log snippet with regard to the dead server
> ?
> > > > >
> > > > >
> > > > >
> > > > > > On Sep 8, 2015, at 5:16 PM, 伍照坤 <tonywutao@gmail.com
> <javascript:;>> wrote:
> > > > > >
> > > > > > Hi, Guys
> > > > > >
> > > > > > I encountered a serious problem in Production, the HMaster
> schedule
> > > > lots
> > > > > of balance jobs to a dead node.
> > > > > >
> > > > > > Environment: hbase-1.0.0-cdh.4.0, hadoop-2.6.0-cdh5.4.0,
> > > > > zookeeper-3.4.5-cdh5.4.0
> > > > > >
> > > > > > the region server e3ecmrhdp24 is dead from 09/03/2015.
> > > > > > I checked the Zookeeper /hbase/rs, and HBase WebUI, this server
> is
> > > dead
> > > > > node.
> > > > > >
> > > > > > But the hmaster still schedule lots of balance jobs to
> e3ecmrhdp24
> > > > after
> > > > > this region server is dead.
> > > > > >
> > > > > > the balance job runs every 5 minutes, which schedules 60000+
> region
> > > > > balance on this dead region server.
> > > > > >
> > > > > > #1 the balancer on hmaster will schedule region to balance to
> > > > > e3ecmrhdp24.
> > > > > > #2 after 1 seconds, the hmaster assign this region to another
> > region
> > > > > server
> > > > > >
> > > > > > I guess
> > > > > > #1 e3ecmrhdp24 is still a live node in HMaster memory.
> > > > > > #2 the number of regions on e3ecmrhdp24 is less than the balance
> > > ratio,
> > > > > so the balancer always schedule region to this dead server.
> > > > > >
> > > > > > After I restarted the HMaster, this problem is gone.
> > > > > >
> > > > > > It looks a critical bug in HBase, any hints?
> > > > > >
> > > > > >
> > > > > >
> > > > > > ​
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>


-- 
Sincerely.
伍 涛 | Tony Wu

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message