Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of
 ramkrishna.vasudevan@huawei.com designates 119.145.14.64 as permitted sender)
From: "Ramkrishna.S.Vasudevan" <ramkrishna.vasudevan@huawei.com>
To: <user@hbase.apache.org>
References: <001201cdad24$05a10900$10e31b00$@youku.com>
In-Reply-To: <001201cdad24$05a10900$10e31b00$@youku.com>
Subject: RE: one RegionServer crashed and the whole cluster was blocked
Date: Thu, 18 Oct 2012 17:45:22 +0530
Message-ID: <00e901cdad2a$44de3f60$ce9abe20$@vasudevan@huawei.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="gb2312"
Content-Transfer-Encoding: quoted-printable
Thread-Index: Ac2tHvAS0bzjEQt1RwiYnmKCgHPAXQACi1rw
Content-Language: en-us

>   For 1, I knew the cluster began to split log and recover the data on
> the
> crashed RegionServer, will the recovery operation block all the
> requests
> from the client side?


Ideally should not.  But if your client was generating data for the =
regions
that were dead at that time then client requests willnot be served till =
the
regions are online after
Log splitting on some other region server.
Any client requests going to other region servers should ideally be =
working.
Did you see the threaddumps at that time on the other RS? That should =
give
some clue.

>   For 2, Is there any solution to reduce the recovery time?
The recovery time depends on the amount of data and particularly on the =
size
of the HLog file.  By default every HLog file is of size 256MB.
In 0.94.0 some good no of changes have gone in to make the recovery =
faster
in terms of HLog Splitting.


> 3.       I have set hbase.regionserver.restart.on.zk.expire to true,
> but it
> does not work.
I am not very sure how the code works with this property.  Will check =
this
part.

Regards
Ram


> -----Original Message-----
> From: =D5=C5=C0=DA [mailto:zhanglei@youku.com]
> Sent: Thursday, October 18, 2012 5:01 PM
> To: user@hbase.apache.org
> Subject: one RegionServer crashed and the whole cluster was blocked
>=20
> Hi, All
>=20
>   One of the RegionServer of our company=A1=AFs cluster was crashed. =
At this
> time, I found:
>=20
> 1.       All the RegionServer stopped handling the requests from the
> client
> side( requestsPerSecond=3D0 at the master-status UI page).
>=20
> 2.       It takes about 12-15 minutes to recovery.
>=20
> 3.       I have set hbase.regionserver.restart.on.zk.expire to true,
> but it
> does not work.
>=20
>   For 1, I knew the cluster began to split log and recover the data on
> the
> crashed RegionServer, will the recovery operation block all the
> requests
> from the client side?
>=20
>   For 2, Is there any solution to reduce the recovery time?
>=20
>   For 3, I checked the log, found =A1=B0session is timeout=A1=B1 =
exception, maybe
> for full gc and the session was timeout. But why the
> hbase.regionserver.restart.on.zk.expire does not work? My HBase =
version
> is
> 0.94.0.
>=20
>=20
>=20
>   Thanks for any suggestions and feedback!
>=20
>=20
>=20
> Fowler Zhang
>=20
>=20