Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of zhoushuaifeng@huawei.com
 designates 119.145.14.64 as permitted sender)
Date: Tue, 21 Dec 2010 11:08:05 +0800
From: Zhou Shuaifeng <zhoushuaifeng@huawei.com>
Subject: Re: all regionserver shutdown after close hdfs datanode
In-reply-to: <4D0F7A43.9070007@1and1.ro>
To: user@hbase.apache.org
Cc: yanlijun@huawei.com, syang@huawei.com
Message-id: <010001cba0bc$4b8256c0$e2870440$@com>
MIME-version: 1.0
Content-type: text/plain; charset=gb2312
Content-language: zh-cn
Content-transfer-encoding: quoted-printable
Thread-index: AcugXSObUULdRyceQs2viTJFf/yhqQAW/f4Q
References: <00be01cb9ffc$8e36eb90$aaa4c2b0$@com> <4D0F7A43.9070007@1and1.ro>

Hi,
I checked the log, It's not the master caused the regionserver shutdown, =
but
the regionserver log rolling failed caused regionserver shutdown.

According the log, error occurred in the pipeline, but why hdfs are not =
able
to select another good data node when one datanode in the pipeline is =
not
available?


The log:
2010-12-20 09:15:41,769 FATAL
org.apache.hadoop.hbase.regionserver.LogRoller: Log rolling failed with =
ioe:

java.io.IOException: Error Recovery for block blk_1292656843439_2494096
failed  because recovery from primary datanode 167.6.5.17:50010 failed 6
times.  Pipeline was 167.6.5.17:50010. Aborting...
	at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFS=
Cli
ent.java:3249)
	at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.ja=
va:
2654)
	at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClie=
nt.
java:2837)

the corresponding code in regionserver:
        LOG.fatal("Log rolling failed with ioe: ",
          RemoteExceptionHandler.checkIOException(ex));
        server.checkFileSystem();
        // Abort if we get here.  We probably won't recover an IOE.
HBASE-1132
        server.abort();

the abort() code:
  public void abort() {
    this.abortRequested =3D true;
    this.reservedSpace.clear();
    LOG.info("Dump of metrics: " + this.metrics.toString());
    stop();
  }

The corresponding log:
2010-12-20 09:15:41,777 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
request=3D9.666667, regions=3D1512, stores=3D1512, storefiles=3D5833,
storefileIndexSize=3D1833, memstoreSize=3D2941, =
compactionQueueSize=3D1228,
usedHeap=3D6849, maxHeap=3D8165, blockCacheSize=3D14047672,
blockCacheFree=3D1698276936, blockCacheCount=3D0, =
blockCacheHitRatio=3D0,
fsReadLatency=3D0, fsWriteLatency=3D59, fsSyncLatency=3D0


Zhou Shuaifeng(Frank)
HUAWEI TECHNOLOGIES CO.,LTD.  huawei_logo


-----=D3=CA=BC=FE=D4=AD=BC=FE-----
=B7=A2=BC=FE=C8=CB: Daniel Iancu [mailto:daniel.iancu@1and1.ro]=20
=B7=A2=CB=CD=CA=B1=BC=E4: 2010=C4=EA12=D4=C220=C8=D5 23:46
=CA=D5=BC=FE=C8=CB: user@hbase.apache.org
=D6=F7=CC=E2: Re: all regionserver shutdown after close hdfs datanode

Hi Zhou
You should check if the HMaster is still up. If not, check its logs, if=20
for some reason HMaster thinks HDFS is not available it will
shutdown the HBase cluster.
Regards
Daniel

On 12/20/2010 06:15 AM, Zhou Shuaifeng wrote:
> Hi,
>
>
>
> I have a cluster of 8  hdfs datanodes and 8 hbase regionservers. When =
I
> shutdown one node(a pc with one datanode and one regionserver =
running),
all
> hbase regionservers shutdown after a while.
>
> Other 7 hdfs datanodes is OK.
>
>
>
> I think it's not reasionable. Hbase is a distribute system that should
> tolerance some nodes abnormal. So, what's the matter? Is there any
configure
> that can solve this problem or is a bug?
>
>
>
> Thanks and best Regards.
>
>
>
> Zhou
>
>
-------------------------------------------------------------------------=
---
> ---------------------------------------------------------
> This e-mail and its attachments contain confidential information from
> HUAWEI, which
> is intended only for the person or entity whose address is listed =
above.
Any
> use of the
> information contained herein in any way (including, but not limited =
to,
> total or partial
> disclosure, reproduction, or dissemination) by persons other than the
> intended
> recipient(s) is prohibited. If you receive this e-mail in error, =
please
> notify the sender by
> phone or email immediately and delete it!
>

--=20
Daniel Iancu
Java Developer,Web Components Romania
1&1 Internet Development srl.
18 Mircea Eliade St
Sect 1, Bucharest
RO Bucharest, 012015
www.1and1.ro
Phone:+40-031-223-9081
Email:daniel.iancu@1and1.ro
IM:diancu@united.domain