hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 伍照坤 <tonywu...@gmail.com>
Subject Re: Balance to dead region server?
Date Wed, 09 Sep 2015 01:14:15 GMT
Hi, Ted

Thanks, i attached the log in tar.gz in dropbox.
https://www.dropbox.com/s/czes89w5r3rr1wa/hbase-log.tar.gz?dl=0


the dead server name: e3ecmrhdp24

it looks after i truncate another table, the master start to balance
regions to dead node.

------------------------------
2015-09-03 17:57:28,689 INFO org.apache.hadoop.hbase.master.HMaster:
Client=tw79//172.16.31.133 truncate ecitem:IVT_ItemInventory


-----------------

2015-09-08 17:47 GMT-07:00 Ted Yu <yuzhihong@gmail.com>:

> Can you pastebin more of the master log after 15:29:33,856 w.r.t.
> e3ecmrhdp24 ?
>
> I wonder how master thought e3ecmrhdp24 became live again.
>
> On Tue, Sep 8, 2015 at 5:37 PM, 伍照坤 <tonywutao@gmail.com> wrote:
>
> > Hi, Ted
> >
> > Thanks for reply.
> >
> > here is the log the master shutdown this region server, it never starts
> > again.
> > -----------------------
> > 2015-09-03 15:29:33,738 INFO
> > org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer
> > ephemeral node deleted, processing expiration
> > [e3ecmrhdp24.mercury.corp,60020,1441316616368]
> > 2015-09-03 15:29:33,848 INFO
> > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting
> > logs for e3ecmrhdp24.mercury.corp,60020,1441316616368 before assignment;
> > region count=0
> > 2015-09-03 15:29:33,851 INFO
> > org.apache.hadoop.hbase.master.SplitLogManager: dead splitlog workers
> > [e3ecmrhdp24.mercury.corp,60020,1441316616368]
> > 2015-09-03 15:29:33,853 INFO
> > org.apache.hadoop.hbase.master.SplitLogManager:
> >
> >
> hdfs://nameservice1/hbase/WALs/e3ecmrhdp24.mercury.corp,60020,1441316616368-splitting
> > is empty dir, no logs to split
> > 2015-09-03 15:29:33,853 INFO
> > org.apache.hadoop.hbase.master.SplitLogManager: started splitting 0 logs
> in
> >
> >
> [hdfs://nameservice1/hbase/WALs/e3ecmrhdp24.mercury.corp,60020,1441316616368-splitting]
> > for [e3ecmrhdp24.mercury.corp,60020,1441316616368]
> > 2015-09-03 15:29:33,855 INFO
> > org.apache.hadoop.hbase.master.SplitLogManager: finished splitting (more
> > than or equal to) 0 bytes in 0 log files in
> >
> >
> [hdfs://nameservice1/hbase/WALs/e3ecmrhdp24.mercury.corp,60020,1441316616368-splitting]
> > in 2ms
> > 2015-09-03 15:29:33,856 INFO
> > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
> Reassigning 0
> > region(s) that e3ecmrhdp24.mercury.corp,60020,1441316616368 was carrying
> > (and 0 regions(s) that were opening on this server)
> > 2015-09-03 15:29:33,856 INFO
> > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Finished
> > processing of shutdown of e3ecmrhdp24.mercury.corp,60020,1441316616368
> > 2015-09-03 15:29:36,399 INFO
> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: totalSize=417.02 KB,
> > freeSize=395.54 MB, max=395.95 MB, blockCount=0, accesses=0, hits=0,
> > hitRatio=0, cachingAccesses=0, cachingHits=0,
> > cachingHitsRatio=0,evictions=269245, evicted=0, evictedPerRun=0.0
> >
> >
> > 2015-09-08 17:25 GMT-07:00 Ted Yu <yuzhihong@gmail.com>:
> >
> > > Can you pastebin master log snippet with regard to the dead server ?
> > >
> > >
> > >
> > > > On Sep 8, 2015, at 5:16 PM, 伍照坤 <tonywutao@gmail.com> wrote:
> > > >
> > > > Hi, Guys
> > > >
> > > > I encountered a serious problem in Production, the HMaster schedule
> > lots
> > > of balance jobs to a dead node.
> > > >
> > > > Environment: hbase-1.0.0-cdh.4.0, hadoop-2.6.0-cdh5.4.0,
> > > zookeeper-3.4.5-cdh5.4.0
> > > >
> > > > the region server e3ecmrhdp24 is dead from 09/03/2015.
> > > > I checked the Zookeeper /hbase/rs, and HBase WebUI, this server is
> dead
> > > node.
> > > >
> > > > But the hmaster still schedule lots of balance jobs to e3ecmrhdp24
> > after
> > > this region server is dead.
> > > >
> > > > the balance job runs every 5 minutes, which schedules 60000+ region
> > > balance on this dead region server.
> > > >
> > > > #1 the balancer on hmaster will schedule region to balance to
> > > e3ecmrhdp24.
> > > > #2 after 1 seconds, the hmaster assign this region to another region
> > > server
> > > >
> > > > I guess
> > > > #1 e3ecmrhdp24 is still a live node in HMaster memory.
> > > > #2 the number of regions on e3ecmrhdp24 is less than the balance
> ratio,
> > > so the balancer always schedule region to this dead server.
> > > >
> > > > After I restarted the HMaster, this problem is gone.
> > > >
> > > > It looks a critical bug in HBase, any hints?
> > > >
> > > >
> > > >
> > > > ​
> > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message