Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of yuzhihong@gmail.com designates
 209.85.192.173 as permitted sender)
References: <2014110712005908077159@sina.cn>
 <CAPQV63VOSMqeG2USs_3LK9ucweJbEQJ1NUy-vj43HOHUgVvm5g@mail.gmail.com>
Mime-Version: 1.0 (1.0)
In-Reply-To: 
 <CAPQV63VOSMqeG2USs_3LK9ucweJbEQJ1NUy-vj43HOHUgVvm5g@mail.gmail.com>
Content-Type: text/plain;
	charset=us-ascii
Content-Transfer-Encoding: quoted-printable
Message-Id: <98A7BD8A-9217-45AC-9802-03C0C93580A0@gmail.com>
Cc: user <user@hbase.apache.org>
From: Ted Yu <yuzhihong@gmail.com>
Subject: Re: hbase cannot normally start regionserver in the environment of
 big data.
Date: Fri, 7 Nov 2014 05:28:04 -0800
To: "user@hbase.apache.org" <user@hbase.apache.org>

Please pastebin log from region server around the time it became dead.=20

What hbase / Hadoop version are you using ?

Anything interesting in master log ?

Thanks

On Nov 7, 2014, at 4:57 AM, Jean-Marc Spaggiari <jean-marc@spaggiari.org> wr=
ote:

> Hi,
>=20
> Have you checked that your Hadoop is running fine? Have you checked that
> network between your servers is fine to?
>=20
> JM
>=20
> 2014-11-07 5:22 GMT-05:00 hankedang@sina.cn <hankedang@sina.cn>:
>=20
>>     I've deploied a "2+4" cluster which has been normally running for a
>> long time.
>> The cluster has got more than 40T data.When I initiatively shut the hbase=

>> service
>> and try to restart it,the regionserver will be dead.
>>=20
>>    The log of regionserver shows that all the regions are opened. But in
>> the logs of the datanode can see WARN and ERROR logs.
>>    Bellow is the log for details:
>>=20
>>    2014-11-07 14:47:21,584 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>> 10.230.63.12:50010, dest: /10.230.63.9:39405, bytes: 4696, op: HDFS_READ,=

>> cliID:                     DFSClient_hb_rs_salve1,60020,1415342303886_-
>> 2037622978_29, offset: 31996928, srvID:
>> bb0032a3-1170-4a34-b85b-e2cfa0d56cb2, blockid: BP-1731746090-10.230.63.3-=

>>  1406195669990:blk_1078709392_4968828, duration: 7978822
>>    2014-11-07 14:47:21,596 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: exception:
>>    java.net.SocketTimeoutException: 480000 millis timeout while waiting
>> for channel to be ready for write. ch :
>> java.nio.channels.SocketChannel[connected local=3D/10.230.63.12:50010
>> remote=3D/10.230.63.11:41511]
>>    at
>> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.j=
ava:246)
>>    at
>> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStre=
am.java:172)
>>    at
>> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStre=
am.java:220)
>>    at
>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender=
.java:547)
>>    at
>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.=
java:712)
>>    at
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.=
java:479)
>>    at
>> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receive=
r.java:110)
>>    at
>> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.=
java:68)
>>    at
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:2=
29)
>>    at java.lang.Thread.run(Thread.java:744)
>> 2014-11-07 14:47:21,599 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>> 10.230.63.12:50010, dest: /10.230.63.11:41511, bytes: 726528, op:
>> HDFS_READ, cliID: DFSClient_hb_rs_salve3,60020,1415342303807_1094119849_2=
9,
>> offset: 0, srvID: bb0032a3-1170-4a34-b85b-e2cfa0d56cb2, blockid:
>> BP-1731746090-10.230.63.3-1406195669990:blk_1078034913_4294168, duration:=

>> 480190668115
>> 2014-11-07 14:47:21,599 WARN
>> org.apache.hadoop.hdfs.server.datanode.DataNode:
>> DatanodeRegistration(10.230.63.12,
>> datanodeUuid=3Dbb0032a3-1170-4a34-b85b-e2cfa0d56cb2, infoPort=3D50075,
>> ipcPort=3D50020, storageInfo=3Dlv=3D-55;cid=3Dcluster12;nsid=3D395652542;=
c=3D0):Got
>> exception while serving
>> BP-1731746090-10.230.63.3-1406195669990:blk_1078034913_4294168 to /
>> 10.230.63.11:41511
>> java.net.SocketTimeoutException: 480000 millis timeout while waiting for
>> channel to be ready for write. ch :
>> java.nio.channels.SocketChannel[connected local=3D/10.230.63.12:50010
>> remote=3D/10.230.63.11:41511]
>> at
>> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.j=
ava:246)
>> at
>> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStre=
am.java:172)
>> at
>> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStre=
am.java:220)
>> at
>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender=
.java:547)
>> at
>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.=
java:712)
>> at
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.=
java:479)
>> at
>> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receive=
r.java:110)
>> at
>> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.=
java:68)
>> at
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:2=
29)
>> at java.lang.Thread.run(Thread.java:744)
>> 2014-11-07 14:47:21,600 ERROR
>> org.apache.hadoop.hdfs.server.datanode.DataNode: salve4:50010:DataXceiver=

>> error processing READ_BLOCK operation src: /10.230.63.11:41511 dest: /
>> 10.230.63.12:50010
>>=20
>>=20
>>    I personally think it was caused on the load on open stage,where the
>> disk IO of the cluster can
>> be very high and the pressure can be huge.
>>=20
>>    I wonder what results in reading error while reading hfile,and what
>> leads to timeout.
>> Are there any solutions that can control the speed of loading on open and=

>> reduce
>> pressure of the cluster?
>>=20
>> I need help !
>>=20
>> Thanks!
>>=20
>>=20
>>=20
>>=20
>> hankedang@sina.cn
>>=20