Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of nkeywal@gmail.com designates
 74.125.82.47 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAFAWafMt1boosHJossLN5h1F2O+CXtf=kpBEQtx0_KK=WjRmEA@mail.gmail.com>
References: 
 <CAFAWafOzNt9+=0igfdGxr+RPAmBDEFkd+bO-z5-1gNQwvrNeaA@mail.gmail.com>
 <CAPcDmSv5gYgG1f+z6nAoF03vN51Yf+omqkR-HnWvOeG_VA724g@mail.gmail.com>
 <CAFAWafMt1boosHJossLN5h1F2O+CXtf=kpBEQtx0_KK=WjRmEA@mail.gmail.com>
From: Nicolas Liochon <nkeywal@gmail.com>
Date: Mon, 25 Feb 2013 11:07:27 +0100
Message-ID: 
 <CAPcDmSswex5smfTOrv7sF=nrMGf5sOPB0nXMfE2NssMQwmg=rQ@mail.gmail.com>
Subject: Re: Datanodes shutdown and HBase's regionservers not working
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=bcaec5555040846ff604d689b67a

--bcaec5555040846ff604d689b67a
Content-Type: text/plain; charset=ISO-8859-1

I agree.
Then for HDFS, ...
The first thing to check is the network I would say.


On Mon, Feb 25, 2013 at 10:46 AM, Davey Yan <davey.yan@gmail.com> wrote:

> Thanks for reply, Nicolas.
>
> My question: What can lead to shutdown of all of the datanodes?
> I believe that the regionservers will be OK if the HDFS is OK.
>
>
> On Mon, Feb 25, 2013 at 5:31 PM, Nicolas Liochon <nkeywal@gmail.com>
> wrote:
> > Ok, what's your question?
> > When you say the datanode went down, was it the datanode processes or the
> > machines, with both the datanodes and the regionservers?
> >
> > The NameNode pings its datanodes every 3 seconds. However it will
> internally
> > mark the datanodes as dead after 10:30 minutes (even if in the gui you
> have
> > 'no answer for x minutes').
> > HBase monitoring is done by ZooKeeper. By default, a regionserver is
> > considered as dead after 180s with no answer. Before, well, it's
> considered
> > as live.
> > When you stop a regionserver, it tries to flush its data to the disk
> (i.e.
> > hdfs, i.e. the datanodes). That's why if you have no datanodes, or if a
> high
> > ratio of your datanodes are dead, it can't shutdown. Connection refused &
> > socket timeouts come from the fact that before the 10:30 minutes hdfs
> does
> > not declare the nodes as dead, so hbase tries to use them (and,
> obviously,
> > fails). Note that there is now  an intermediate state for hdfs datanodes,
> > called "stale": an intermediary state where the datanode is used only if
> you
> > have to (i.e. it's the only datanode with a block replica you need). It
> will
> > be documented in HBase for the 0.96 release. But if all your datanodes
> are
> > down it won't change much.
> >
> > Cheers,
> >
> > Nicolas
> >
> >
> >
> > On Mon, Feb 25, 2013 at 10:10 AM, Davey Yan <davey.yan@gmail.com> wrote:
> >>
> >> Hey guys,
> >>
> >> We have a cluster with 5 nodes(1 NN and 4 DNs) running for more than 1
> >> year, and it works fine.
> >> But the datanodes got shutdown twice in the last month.
> >>
> >> When the datanodes got shutdown, all of them became "Dead Nodes" in
> >> the NN web admin UI(http://ip:50070/dfshealth.jsp),
> >> but regionservers of HBase were still live in the HBase web
> >> admin(http://ip:60010/master-status), of course, they were zombies.
> >> All of the processes of jvm were still running, including
> >> hmaster/namenode/regionserver/datanode.
> >>
> >> When the datanodes got shutdown, the load (using the "top" command) of
> >> slaves became very high, more than 10, higher than normal running.
> >> From the "top" command, we saw that the processes of datanode and
> >> regionserver were comsuming CPU.
> >>
> >> We could not stop the HBase or Hadoop cluster through normal
> >> commands(stop-*.sh/*-daemon.sh stop *).
> >> So we stopped datanodes and regionservers by kill -9 PID, then the
> >> load of slaves returned to normal level, and we start the cluster
> >> again.
> >>
> >>
> >> Log of NN at the shutdown point(All of the DNs were removed):
> >> 2013-02-22 11:10:02,278 INFO org.apache.hadoop.net.NetworkTopology:
> >> Removing a node: /default-rack/192.168.1.152:50010
> >> 2013-02-22 11:10:02,278 INFO org.apache.hadoop.hdfs.StateChange:
> >> BLOCK* NameSystem.heartbeatCheck: lost heartbeat from
> >> 192.168.1.149:50010
> >> 2013-02-22 11:10:02,693 INFO org.apache.hadoop.net.NetworkTopology:
> >> Removing a node: /default-rack/192.168.1.149:50010
> >> 2013-02-22 11:10:02,693 INFO org.apache.hadoop.hdfs.StateChange:
> >> BLOCK* NameSystem.heartbeatCheck: lost heartbeat from
> >> 192.168.1.150:50010
> >> 2013-02-22 11:10:03,004 INFO org.apache.hadoop.net.NetworkTopology:
> >> Removing a node: /default-rack/192.168.1.150:50010
> >> 2013-02-22 11:10:03,004 INFO org.apache.hadoop.hdfs.StateChange:
> >> BLOCK* NameSystem.heartbeatCheck: lost heartbeat from
> >> 192.168.1.148:50010
> >> 2013-02-22 11:10:03,339 INFO org.apache.hadoop.net.NetworkTopology:
> >> Removing a node: /default-rack/192.168.1.148:50010
> >>
> >>
> >> Logs in DNs indicated there were many IOException and
> >> SocketTimeoutException:
> >> 2013-02-22 11:02:52,354 ERROR
> >> org.apache.hadoop.hdfs.server.datanode.DataNode:
> >> DatanodeRegistration(192.168.1.148:50010,
> >> storageID=DS-970284113-117.25.149.160-50010-1328074119937,
> >> infoPort=50075, ipcPort=50020):DataXceiver
> >> java.io.IOException: Interrupted receiveBlock
> >>         at
> >>
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:577)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:398)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107)
> >>         at java.lang.Thread.run(Thread.java:662)
> >> 2013-02-22 11:03:44,823 WARN
> >> org.apache.hadoop.hdfs.server.datanode.DataNode:
> >> DatanodeRegistration(192.168.1.148:50010,
> >> storageID=DS-970284113-117.25.149.160-50010-1328074119937,
> >> infoPort=50075, ipcPort=50020):Got exception while serving
> >> blk_-1985405101514576650_247001 to /192.168.1.148:
> >> java.net.SocketTimeoutException: 480000 millis timeout while waiting
> >> for channel to be ready for write. ch :
> >> java.nio.channels.SocketChannel[connected local=/192.168.1.148:50010
> >> remote=/192.168.1.148:48654]
> >>         at
> >>
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
> >>         at
> >>
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
> >>         at
> >>
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:197)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
> >>         at java.lang.Thread.run(Thread.java:662)
> >> 2013-02-22 11:09:42,294 ERROR
> >> org.apache.hadoop.hdfs.server.datanode.DataNode:
> >> DatanodeRegistration(192.168.1.148:50010,
> >> storageID=DS-970284113-117.25.149.160-50010-1328074119937,
> >> infoPort=50075, ipcPort=50020):DataXceiver
> >> java.net.SocketTimeoutException: 480000 millis timeout while waiting
> >> for channel to be ready for write. ch :
> >> java.nio.channels.SocketChannel[connected local=/192.168.1.148:50010
> >> remote=/192.168.1.148:37188]
> >>         at
> >>
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
> >>         at
> >>
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
> >>         at
> >>
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:197)
> >>         at
> >>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
> >>         at java.lang.Thread.run(Thread.java:662)
> >> 2013-02-22 11:12:41,892 INFO
> >> org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification
> >> succeeded for blk_-2674357249542194287_43419
> >>
> >>
> >> Here is our env:
> >> hadoop 1.0.3
> >> hbase 0.94.1(snappy enabled)
> >>
> >> java version "1.6.0_31"
> >> Java(TM) SE Runtime Environment (build 1.6.0_31-b04)
> >> Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01, mixed mode)
> >>
> >> # ulimit -a
> >> core file size          (blocks, -c) 0
> >> data seg size           (kbytes, -d) unlimited
> >> scheduling priority             (-e) 20
> >> file size               (blocks, -f) unlimited
> >> pending signals                 (-i) 16382
> >> max locked memory       (kbytes, -l) 64
> >> max memory size         (kbytes, -m) unlimited
> >> open files                      (-n) 32768
> >> pipe size            (512 bytes, -p) 8
> >> POSIX message queues     (bytes, -q) 819200
> >> real-time priority              (-r) 0
> >> stack size              (kbytes, -s) 8192
> >> cpu time               (seconds, -t) unlimited
> >> max user processes              (-u) 32768
> >> virtual memory          (kbytes, -v) unlimited
> >> file locks                      (-x) unlimited
> >>
> >> # uname -a
> >> Linux ubuntu6401 2.6.32-33-server #70-Ubuntu SMP Thu Jul 7 22:28:30
> >> UTC 2011 x86_64 GNU/Linux
> >>
> >>
> >> # free(master)
> >>              total       used       free     shared    buffers
> cached
> >> Mem:      24732936    8383708   16349228          0     490584
>  2580356
> >> -/+ buffers/cache:    5312768   19420168
> >> Swap:     72458232          0   72458232
> >>
> >>
> >> # free(slaves)
> >>              total       used       free     shared    buffers
> cached
> >> Mem:      24733000   22824276    1908724          0     862556
> 15303304
> >> -/+ buffers/cache:    6658416   18074584
> >> Swap:     72458232        264   72457968
> >>
> >>
> >> Some important conf:
> >> core-site.xml
> >>         <property>
> >>                 <name>io.file.buffer.size</name>
> >>                 <value>65536</value>
> >>         </property>
> >>
> >> hdfs-site.xml
> >>         <property>
> >>                 <name>dfs.block.size</name>
> >>                 <value>134217728</value>
> >>         </property>
> >>         <property>
> >>                 <name>dfs.datanode.max.xcievers</name>
> >>                 <value>4096</value>
> >>         </property>
> >>         <property>
> >>                 <name>dfs.support.append</name>
> >>                 <value>true</value>
> >>         </property>
> >>         <property>
> >>                 <name>dfs.replication</name>
> >>                 <value>2</value>
> >>         </property>
> >>
> >>
> >> Hope you can help us.
> >> Thanks in advance.
> >>
> >>
> >>
> >> --
> >> Davey Yan
> >
> >
>
>
>
> --
> Davey Yan
>

--bcaec5555040846ff604d689b67a
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><div>I agree.<br></div>Then for HDFS, ...<br></div>Th=
e first thing to check is the network I would say. <br><div><br><div><div><=
br></div></div></div></div><div class=3D"gmail_extra"><br><br><div class=3D=
"gmail_quote">

On Mon, Feb 25, 2013 at 10:46 AM, Davey Yan <span dir=3D"ltr">&lt;<a href=
=3D"mailto:davey.yan@gmail.com" target=3D"_blank">davey.yan@gmail.com</a>&g=
t;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0=
 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Thanks for reply, Nicolas.<br>
<br>
My question: What can lead to shutdown of all of the datanodes?<br>
I believe that the regionservers will be OK if the HDFS is OK.<br>
<div class=3D"HOEnZb"><div class=3D"h5"><br>
<br>
On Mon, Feb 25, 2013 at 5:31 PM, Nicolas Liochon &lt;<a href=3D"mailto:nkey=
wal@gmail.com">nkeywal@gmail.com</a>&gt; wrote:<br>
&gt; Ok, what&#39;s your question?<br>
&gt; When you say the datanode went down, was it the datanode processes or =
the<br>
&gt; machines, with both the datanodes and the regionservers?<br>
&gt;<br>
&gt; The NameNode pings its datanodes every 3 seconds. However it will inte=
rnally<br>
&gt; mark the datanodes as dead after 10:30 minutes (even if in the gui you=
 have<br>
&gt; &#39;no answer for x minutes&#39;).<br>
&gt; HBase monitoring is done by ZooKeeper. By default, a regionserver is<b=
r>
&gt; considered as dead after 180s with no answer. Before, well, it&#39;s c=
onsidered<br>
&gt; as live.<br>
&gt; When you stop a regionserver, it tries to flush its data to the disk (=
i.e.<br>
&gt; hdfs, i.e. the datanodes). That&#39;s why if you have no datanodes, or=
 if a high<br>
&gt; ratio of your datanodes are dead, it can&#39;t shutdown. Connection re=
fused &amp;<br>
&gt; socket timeouts come from the fact that before the 10:30 minutes hdfs =
does<br>
&gt; not declare the nodes as dead, so hbase tries to use them (and, obviou=
sly,<br>
&gt; fails). Note that there is now =A0an intermediate state for hdfs datan=
odes,<br>
&gt; called &quot;stale&quot;: an intermediary state where the datanode is =
used only if you<br>
&gt; have to (i.e. it&#39;s the only datanode with a block replica you need=
). It will<br>
&gt; be documented in HBase for the 0.96 release. But if all your datanodes=
 are<br>
&gt; down it won&#39;t change much.<br>
&gt;<br>
&gt; Cheers,<br>
&gt;<br>
&gt; Nicolas<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; On Mon, Feb 25, 2013 at 10:10 AM, Davey Yan &lt;<a href=3D"mailto:dave=
y.yan@gmail.com">davey.yan@gmail.com</a>&gt; wrote:<br>
&gt;&gt;<br>
&gt;&gt; Hey guys,<br>
&gt;&gt;<br>
&gt;&gt; We have a cluster with 5 nodes(1 NN and 4 DNs) running for more th=
an 1<br>
&gt;&gt; year, and it works fine.<br>
&gt;&gt; But the datanodes got shutdown twice in the last month.<br>
&gt;&gt;<br>
&gt;&gt; When the datanodes got shutdown, all of them became &quot;Dead Nod=
es&quot; in<br>
&gt;&gt; the NN web admin UI(<a href=3D"http://ip:50070/dfshealth.jsp" targ=
et=3D"_blank">http://ip:50070/dfshealth.jsp</a>),<br>
&gt;&gt; but regionservers of HBase were still live in the HBase web<br>
&gt;&gt; admin(<a href=3D"http://ip:60010/master-status" target=3D"_blank">=
http://ip:60010/master-status</a>), of course, they were zombies.<br>
&gt;&gt; All of the processes of jvm were still running, including<br>
&gt;&gt; hmaster/namenode/regionserver/datanode.<br>
&gt;&gt;<br>
&gt;&gt; When the datanodes got shutdown, the load (using the &quot;top&quo=
t; command) of<br>
&gt;&gt; slaves became very high, more than 10, higher than normal running.=
<br>
&gt;&gt; From the &quot;top&quot; command, we saw that the processes of dat=
anode and<br>
&gt;&gt; regionserver were comsuming CPU.<br>
&gt;&gt;<br>
&gt;&gt; We could not stop the HBase or Hadoop cluster through normal<br>
&gt;&gt; commands(stop-*.sh/*-daemon.sh stop *).<br>
&gt;&gt; So we stopped datanodes and regionservers by kill -9 PID, then the=
<br>
&gt;&gt; load of slaves returned to normal level, and we start the cluster<=
br>
&gt;&gt; again.<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; Log of NN at the shutdown point(All of the DNs were removed):<br>
&gt;&gt; 2013-02-22 11:10:02,278 INFO org.apache.hadoop.net.NetworkTopology=
:<br>
&gt;&gt; Removing a node: /default-rack/<a href=3D"http://192.168.1.152:500=
10" target=3D"_blank">192.168.1.152:50010</a><br>
&gt;&gt; 2013-02-22 11:10:02,278 INFO org.apache.hadoop.hdfs.StateChange:<b=
r>
&gt;&gt; BLOCK* NameSystem.heartbeatCheck: lost heartbeat from<br>
&gt;&gt; <a href=3D"http://192.168.1.149:50010" target=3D"_blank">192.168.1=
.149:50010</a><br>
&gt;&gt; 2013-02-22 11:10:02,693 INFO org.apache.hadoop.net.NetworkTopology=
:<br>
&gt;&gt; Removing a node: /default-rack/<a href=3D"http://192.168.1.149:500=
10" target=3D"_blank">192.168.1.149:50010</a><br>
&gt;&gt; 2013-02-22 11:10:02,693 INFO org.apache.hadoop.hdfs.StateChange:<b=
r>
&gt;&gt; BLOCK* NameSystem.heartbeatCheck: lost heartbeat from<br>
&gt;&gt; <a href=3D"http://192.168.1.150:50010" target=3D"_blank">192.168.1=
.150:50010</a><br>
&gt;&gt; 2013-02-22 11:10:03,004 INFO org.apache.hadoop.net.NetworkTopology=
:<br>
&gt;&gt; Removing a node: /default-rack/<a href=3D"http://192.168.1.150:500=
10" target=3D"_blank">192.168.1.150:50010</a><br>
&gt;&gt; 2013-02-22 11:10:03,004 INFO org.apache.hadoop.hdfs.StateChange:<b=
r>
&gt;&gt; BLOCK* NameSystem.heartbeatCheck: lost heartbeat from<br>
&gt;&gt; <a href=3D"http://192.168.1.148:50010" target=3D"_blank">192.168.1=
.148:50010</a><br>
&gt;&gt; 2013-02-22 11:10:03,339 INFO org.apache.hadoop.net.NetworkTopology=
:<br>
&gt;&gt; Removing a node: /default-rack/<a href=3D"http://192.168.1.148:500=
10" target=3D"_blank">192.168.1.148:50010</a><br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; Logs in DNs indicated there were many IOException and<br>
&gt;&gt; SocketTimeoutException:<br>
&gt;&gt; 2013-02-22 11:02:52,354 ERROR<br>
&gt;&gt; org.apache.hadoop.hdfs.server.datanode.DataNode:<br>
&gt;&gt; DatanodeRegistration(<a href=3D"http://192.168.1.148:50010" target=
=3D"_blank">192.168.1.148:50010</a>,<br>
&gt;&gt; storageID=3DDS-970284113-117.25.149.160-50010-1328074119937,<br>
&gt;&gt; infoPort=3D50075, ipcPort=3D50020):DataXceiver<br>
&gt;&gt; java.io.IOException: Interrupted receiveBlock<br>
&gt;&gt; =A0 =A0 =A0 =A0 at<br>
&gt;&gt; org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(=
BlockReceiver.java:577)<br>
&gt;&gt; =A0 =A0 =A0 =A0 at<br>
&gt;&gt; org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(Data=
Xceiver.java:398)<br>
&gt;&gt; =A0 =A0 =A0 =A0 at<br>
&gt;&gt; org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver=
.java:107)<br>
&gt;&gt; =A0 =A0 =A0 =A0 at java.lang.Thread.run(Thread.java:662)<br>
&gt;&gt; 2013-02-22 11:03:44,823 WARN<br>
&gt;&gt; org.apache.hadoop.hdfs.server.datanode.DataNode:<br>
&gt;&gt; DatanodeRegistration(<a href=3D"http://192.168.1.148:50010" target=
=3D"_blank">192.168.1.148:50010</a>,<br>
&gt;&gt; storageID=3DDS-970284113-117.25.149.160-50010-1328074119937,<br>
&gt;&gt; infoPort=3D50075, ipcPort=3D50020):Got exception while serving<br>
&gt;&gt; blk_-1985405101514576650_247001 to /<a href=3D"http://192.168.1.14=
8" target=3D"_blank">192.168.1.148</a>:<br>
&gt;&gt; java.net.SocketTimeoutException: 480000 millis timeout while waiti=
ng<br>
&gt;&gt; for channel to be ready for write. ch :<br>
&gt;&gt; java.nio.channels.SocketChannel[connected local=3D/<a href=3D"http=
://192.168.1.148:50010" target=3D"_blank">192.168.1.148:50010</a><br>
&gt;&gt; remote=3D/<a href=3D"http://192.168.1.148:48654" target=3D"_blank"=
>192.168.1.148:48654</a>]<br>
&gt;&gt; =A0 =A0 =A0 =A0 at<br>
&gt;&gt; org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTi=
meout.java:246)<br>
&gt;&gt; =A0 =A0 =A0 =A0 at<br>
&gt;&gt; org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOut=
putStream.java:159)<br>
&gt;&gt; =A0 =A0 =A0 =A0 at<br>
&gt;&gt; org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOut=
putStream.java:198)<br>
&gt;&gt; =A0 =A0 =A0 =A0 at<br>
&gt;&gt; org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(Bloc=
kSender.java:350)<br>
&gt;&gt; =A0 =A0 =A0 =A0 at<br>
&gt;&gt; org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(Block=
Sender.java:436)<br>
&gt;&gt; =A0 =A0 =A0 =A0 at<br>
&gt;&gt; org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataX=
ceiver.java:197)<br>
&gt;&gt; =A0 =A0 =A0 =A0 at<br>
&gt;&gt; org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver=
.java:99)<br>
&gt;&gt; =A0 =A0 =A0 =A0 at java.lang.Thread.run(Thread.java:662)<br>
&gt;&gt; 2013-02-22 11:09:42,294 ERROR<br>
&gt;&gt; org.apache.hadoop.hdfs.server.datanode.DataNode:<br>
&gt;&gt; DatanodeRegistration(<a href=3D"http://192.168.1.148:50010" target=
=3D"_blank">192.168.1.148:50010</a>,<br>
&gt;&gt; storageID=3DDS-970284113-117.25.149.160-50010-1328074119937,<br>
&gt;&gt; infoPort=3D50075, ipcPort=3D50020):DataXceiver<br>
&gt;&gt; java.net.SocketTimeoutException: 480000 millis timeout while waiti=
ng<br>
&gt;&gt; for channel to be ready for write. ch :<br>
&gt;&gt; java.nio.channels.SocketChannel[connected local=3D/<a href=3D"http=
://192.168.1.148:50010" target=3D"_blank">192.168.1.148:50010</a><br>
&gt;&gt; remote=3D/<a href=3D"http://192.168.1.148:37188" target=3D"_blank"=
>192.168.1.148:37188</a>]<br>
&gt;&gt; =A0 =A0 =A0 =A0 at<br>
&gt;&gt; org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTi=
meout.java:246)<br>
&gt;&gt; =A0 =A0 =A0 =A0 at<br>
&gt;&gt; org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOut=
putStream.java:159)<br>
&gt;&gt; =A0 =A0 =A0 =A0 at<br>
&gt;&gt; org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOut=
putStream.java:198)<br>
&gt;&gt; =A0 =A0 =A0 =A0 at<br>
&gt;&gt; org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(Bloc=
kSender.java:350)<br>
&gt;&gt; =A0 =A0 =A0 =A0 at<br>
&gt;&gt; org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(Block=
Sender.java:436)<br>
&gt;&gt; =A0 =A0 =A0 =A0 at<br>
&gt;&gt; org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataX=
ceiver.java:197)<br>
&gt;&gt; =A0 =A0 =A0 =A0 at<br>
&gt;&gt; org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver=
.java:99)<br>
&gt;&gt; =A0 =A0 =A0 =A0 at java.lang.Thread.run(Thread.java:662)<br>
&gt;&gt; 2013-02-22 11:12:41,892 INFO<br>
&gt;&gt; org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verificat=
ion<br>
&gt;&gt; succeeded for blk_-2674357249542194287_43419<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; Here is our env:<br>
&gt;&gt; hadoop 1.0.3<br>
&gt;&gt; hbase 0.94.1(snappy enabled)<br>
&gt;&gt;<br>
&gt;&gt; java version &quot;1.6.0_31&quot;<br>
&gt;&gt; Java(TM) SE Runtime Environment (build 1.6.0_31-b04)<br>
&gt;&gt; Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01, mixed mode)<br>
&gt;&gt;<br>
&gt;&gt; # ulimit -a<br>
&gt;&gt; core file size =A0 =A0 =A0 =A0 =A0(blocks, -c) 0<br>
&gt;&gt; data seg size =A0 =A0 =A0 =A0 =A0 (kbytes, -d) unlimited<br>
&gt;&gt; scheduling priority =A0 =A0 =A0 =A0 =A0 =A0 (-e) 20<br>
&gt;&gt; file size =A0 =A0 =A0 =A0 =A0 =A0 =A0 (blocks, -f) unlimited<br>
&gt;&gt; pending signals =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (-i) 16382<br>
&gt;&gt; max locked memory =A0 =A0 =A0 (kbytes, -l) 64<br>
&gt;&gt; max memory size =A0 =A0 =A0 =A0 (kbytes, -m) unlimited<br>
&gt;&gt; open files =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(-n) 32768<b=
r>
&gt;&gt; pipe size =A0 =A0 =A0 =A0 =A0 =A0(512 bytes, -p) 8<br>
&gt;&gt; POSIX message queues =A0 =A0 (bytes, -q) 819200<br>
&gt;&gt; real-time priority =A0 =A0 =A0 =A0 =A0 =A0 =A0(-r) 0<br>
&gt;&gt; stack size =A0 =A0 =A0 =A0 =A0 =A0 =A0(kbytes, -s) 8192<br>
&gt;&gt; cpu time =A0 =A0 =A0 =A0 =A0 =A0 =A0 (seconds, -t) unlimited<br>
&gt;&gt; max user processes =A0 =A0 =A0 =A0 =A0 =A0 =A0(-u) 32768<br>
&gt;&gt; virtual memory =A0 =A0 =A0 =A0 =A0(kbytes, -v) unlimited<br>
&gt;&gt; file locks =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(-x) unlimit=
ed<br>
&gt;&gt;<br>
&gt;&gt; # uname -a<br>
&gt;&gt; Linux ubuntu6401 2.6.32-33-server #70-Ubuntu SMP Thu Jul 7 22:28:3=
0<br>
&gt;&gt; UTC 2011 x86_64 GNU/Linux<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; # free(master)<br>
&gt;&gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0total =A0 =A0 =A0 used =A0 =A0 =A0 free=
 =A0 =A0 shared =A0 =A0buffers =A0 =A0 cached<br>
&gt;&gt; Mem: =A0 =A0 =A024732936 =A0 =A08383708 =A0 16349228 =A0 =A0 =A0 =
=A0 =A00 =A0 =A0 490584 =A0 =A02580356<br>
&gt;&gt; -/+ buffers/cache: =A0 =A05312768 =A0 19420168<br>
&gt;&gt; Swap: =A0 =A0 72458232 =A0 =A0 =A0 =A0 =A00 =A0 72458232<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; # free(slaves)<br>
&gt;&gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0total =A0 =A0 =A0 used =A0 =A0 =A0 free=
 =A0 =A0 shared =A0 =A0buffers =A0 =A0 cached<br>
&gt;&gt; Mem: =A0 =A0 =A024733000 =A0 22824276 =A0 =A01908724 =A0 =A0 =A0 =
=A0 =A00 =A0 =A0 862556 =A0 15303304<br>
&gt;&gt; -/+ buffers/cache: =A0 =A06658416 =A0 18074584<br>
&gt;&gt; Swap: =A0 =A0 72458232 =A0 =A0 =A0 =A0264 =A0 72457968<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; Some important conf:<br>
&gt;&gt; core-site.xml<br>
&gt;&gt; =A0 =A0 =A0 =A0 &lt;property&gt;<br>
&gt;&gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 &lt;name&gt;io.file.buffer.size&lt=
;/name&gt;<br>
&gt;&gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 &lt;value&gt;65536&lt;/value&gt;<b=
r>
&gt;&gt; =A0 =A0 =A0 =A0 &lt;/property&gt;<br>
&gt;&gt;<br>
&gt;&gt; hdfs-site.xml<br>
&gt;&gt; =A0 =A0 =A0 =A0 &lt;property&gt;<br>
&gt;&gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 &lt;name&gt;dfs.block.size&lt;/nam=
e&gt;<br>
&gt;&gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 &lt;value&gt;134217728&lt;/value&g=
t;<br>
&gt;&gt; =A0 =A0 =A0 =A0 &lt;/property&gt;<br>
&gt;&gt; =A0 =A0 =A0 =A0 &lt;property&gt;<br>
&gt;&gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 &lt;name&gt;dfs.datanode.max.xciev=
ers&lt;/name&gt;<br>
&gt;&gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 &lt;value&gt;4096&lt;/value&gt;<br=
>
&gt;&gt; =A0 =A0 =A0 =A0 &lt;/property&gt;<br>
&gt;&gt; =A0 =A0 =A0 =A0 &lt;property&gt;<br>
&gt;&gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 &lt;name&gt;dfs.support.append&lt;=
/name&gt;<br>
&gt;&gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 &lt;value&gt;true&lt;/value&gt;<br=
>
&gt;&gt; =A0 =A0 =A0 =A0 &lt;/property&gt;<br>
&gt;&gt; =A0 =A0 =A0 =A0 &lt;property&gt;<br>
&gt;&gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 &lt;name&gt;dfs.replication&lt;/na=
me&gt;<br>
&gt;&gt; =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 &lt;value&gt;2&lt;/value&gt;<br>
&gt;&gt; =A0 =A0 =A0 =A0 &lt;/property&gt;<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; Hope you can help us.<br>
&gt;&gt; Thanks in advance.<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; --<br>
&gt;&gt; Davey Yan<br>
&gt;<br>
&gt;<br>
<br>
<br>
<br>
</div></div><span class=3D"HOEnZb"><font color=3D"#888888">--<br>
Davey Yan<br>
</font></span></blockquote></div><br></div>

--bcaec5555040846ff604d689b67a--