Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1FCB4D8B6 for ; Thu, 18 Oct 2012 12:16:13 +0000 (UTC) Received: (qmail 36710 invoked by uid 500); 18 Oct 2012 12:16:11 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 36240 invoked by uid 500); 18 Oct 2012 12:16:07 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 35947 invoked by uid 99); 18 Oct 2012 12:16:04 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Oct 2012 12:16:04 +0000 X-ASF-Spam-Status: No, hits=-1.3 required=5.0 tests=MSGID_MULTIPLE_AT,RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ramkrishna.vasudevan@huawei.com designates 119.145.14.64 as permitted sender) Received: from [119.145.14.64] (HELO szxga01-in.huawei.com) (119.145.14.64) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Oct 2012 12:15:58 +0000 Received: from 172.24.2.119 (EHLO szxeml207-edg.china.huawei.com) ([172.24.2.119]) by szxrg01-dlp.huawei.com (MOS 4.3.4-GA FastPath queued) with ESMTP id AQW32469; Thu, 18 Oct 2012 20:15:36 +0800 (CST) Received: from SZXEML414-HUB.china.huawei.com (10.82.67.153) by szxeml207-edg.china.huawei.com (172.24.2.56) with Microsoft SMTP Server (TLS) id 14.1.323.3; Thu, 18 Oct 2012 20:15:35 +0800 Received: from blrprnc05ns (10.18.96.94) by SZXEML414-HUB.china.huawei.com (10.82.67.153) with Microsoft SMTP Server id 14.1.323.3; Thu, 18 Oct 2012 20:15:31 +0800 From: "Ramkrishna.S.Vasudevan" To: References: <001201cdad24$05a10900$10e31b00$@youku.com> In-Reply-To: <001201cdad24$05a10900$10e31b00$@youku.com> Subject: RE: one RegionServer crashed and the whole cluster was blocked Date: Thu, 18 Oct 2012 17:45:22 +0530 Message-ID: <00e901cdad2a$44de3f60$ce9abe20$@vasudevan@huawei.com> MIME-Version: 1.0 Content-Type: text/plain; charset="gb2312" Content-Transfer-Encoding: quoted-printable X-Mailer: Microsoft Office Outlook 12.0 Thread-Index: Ac2tHvAS0bzjEQt1RwiYnmKCgHPAXQACi1rw Content-Language: en-us X-Originating-IP: [10.18.96.94] X-CFilter-Loop: Reflected X-Virus-Checked: Checked by ClamAV on apache.org > For 1, I knew the cluster began to split log and recover the data on > the > crashed RegionServer, will the recovery operation block all the > requests > from the client side? Ideally should not. But if your client was generating data for the = regions that were dead at that time then client requests willnot be served till = the regions are online after Log splitting on some other region server. Any client requests going to other region servers should ideally be = working. Did you see the threaddumps at that time on the other RS? That should = give some clue. > For 2, Is there any solution to reduce the recovery time? The recovery time depends on the amount of data and particularly on the = size of the HLog file. By default every HLog file is of size 256MB. In 0.94.0 some good no of changes have gone in to make the recovery = faster in terms of HLog Splitting. > 3. I have set hbase.regionserver.restart.on.zk.expire to true, > but it > does not work. I am not very sure how the code works with this property. Will check = this part. Regards Ram > -----Original Message----- > From: =D5=C5=C0=DA [mailto:zhanglei@youku.com] > Sent: Thursday, October 18, 2012 5:01 PM > To: user@hbase.apache.org > Subject: one RegionServer crashed and the whole cluster was blocked >=20 > Hi, All >=20 > One of the RegionServer of our company=A1=AFs cluster was crashed. = At this > time, I found: >=20 > 1. All the RegionServer stopped handling the requests from the > client > side( requestsPerSecond=3D0 at the master-status UI page). >=20 > 2. It takes about 12-15 minutes to recovery. >=20 > 3. I have set hbase.regionserver.restart.on.zk.expire to true, > but it > does not work. >=20 > For 1, I knew the cluster began to split log and recover the data on > the > crashed RegionServer, will the recovery operation block all the > requests > from the client side? >=20 > For 2, Is there any solution to reduce the recovery time? >=20 > For 3, I checked the log, found =A1=B0session is timeout=A1=B1 = exception, maybe > for full gc and the session was timeout. But why the > hbase.regionserver.restart.on.zk.expire does not work? My HBase = version > is > 0.94.0. >=20 >=20 >=20 > Thanks for any suggestions and feedback! >=20 >=20 >=20 > Fowler Zhang >=20 >=20