hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ramkrishna.S.Vasudevan" <ramkrishna.vasude...@huawei.com>
Subject RE: one RegionServer crashed and the whole cluster was blocked
Date Thu, 18 Oct 2012 12:15:22 GMT
>   For 1, I knew the cluster began to split log and recover the data on
> the
> crashed RegionServer, will the recovery operation block all the
> requests
> from the client side?


Ideally should not.  But if your client was generating data for the regions
that were dead at that time then client requests willnot be served till the
regions are online after
Log splitting on some other region server.
Any client requests going to other region servers should ideally be working.
Did you see the threaddumps at that time on the other RS? That should give
some clue.

>   For 2, Is there any solution to reduce the recovery time?
The recovery time depends on the amount of data and particularly on the size
of the HLog file.  By default every HLog file is of size 256MB.
In 0.94.0 some good no of changes have gone in to make the recovery faster
in terms of HLog Splitting.


> 3.       I have set hbase.regionserver.restart.on.zk.expire to true,
> but it
> does not work.
I am not very sure how the code works with this property.  Will check this
part.

Regards
Ram



> -----Original Message-----
> From: 张磊 [mailto:zhanglei@youku.com]
> Sent: Thursday, October 18, 2012 5:01 PM
> To: user@hbase.apache.org
> Subject: one RegionServer crashed and the whole cluster was blocked
> 
> Hi, All
> 
>   One of the RegionServer of our company’s cluster was crashed. At this
> time, I found:
> 
> 1.       All the RegionServer stopped handling the requests from the
> client
> side( requestsPerSecond=0 at the master-status UI page).
> 
> 2.       It takes about 12-15 minutes to recovery.
> 
> 3.       I have set hbase.regionserver.restart.on.zk.expire to true,
> but it
> does not work.
> 
>   For 1, I knew the cluster began to split log and recover the data on
> the
> crashed RegionServer, will the recovery operation block all the
> requests
> from the client side?
> 
>   For 2, Is there any solution to reduce the recovery time?
> 
>   For 3, I checked the log, found “session is timeout” exception, maybe
> for full gc and the session was timeout. But why the
> hbase.regionserver.restart.on.zk.expire does not work? My HBase version
> is
> 0.94.0.
> 
> 
> 
>   Thanks for any suggestions and feedback!
> 
> 
> 
> Fowler Zhang
> 
> 



Mime
View raw message