hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: Hadoop data nodes failing to start
Date Wed, 08 Apr 2009 13:29:51 GMT
Kevin,

I'm glad it worked for you.

We talked a bit about 5114 yesterday, any chance of trying 0.18 branch
on that same cluster without the socket timeout thing?

Thx,

J-D

On Wed, Apr 8, 2009 at 9:24 AM, Kevin Eppinger
<keppinger@adknowledge.com> wrote:
> FYI:  Problem fixed.  It was apparently a timeout condition present in 0.18.3 that
only popped up when the additional nodes were added.  The solution was to put the following
entry in hadoop-site.xml:
>
> <property>
>   <name>dfs.datanode.socket.write.timeout</name>
>   <value>0</value>
> </property>
>
> Thanks to 'jdcryans' and 'digarok' from IRC for the help.
>
> -kevin
>
> -----Original Message-----
> From: Kevin Eppinger [mailto:keppinger@adknowledge.com]
> Sent: Tuesday, April 07, 2009 1:05 PM
> To: core-user@hadoop.apache.org
> Subject: Hadoop data nodes failing to start
>
> Hello everyone-
>
> So I have a 5 node cluster that I've been running for a few weeks with no problems.  Today
I decided to add nodes and double its size to 10.  After doing all the setup and starting
the cluster, I discovered that four out of the 10 nodes had failed to startup.  Specifically,
the data nodes didn't start.  The task trackers seemed to start fine.  Thinking it was something
I did incorrectly with the expansion, I then reverted back to the 5 node configuration but
I'm experiencing the same problem...with only 2 of 5 nodes starting correctly.  Here is what
I'm seeing in the hadoop-*-datanode*.log files:
>
> 2009-04-07 12:35:40,628 INFO org.apache.hadoop.dfs.DataNode: Starting Periodic block
scanner.
> 2009-04-07 12:35:45,548 INFO org.apache.hadoop.dfs.DataNode: BlockReport of 9269 blocks
got processed in 1128 msecs
> 2009-04-07 12:35:45,584 ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(10.254.165.223:50010,
storageID=DS-202528624-10.254.13
> 1.244-50010-1238604807366, infoPort=50075, ipcPort=50020):DataXceiveServer: Exiting due
to:java.nio.channels.ClosedSelectorException
>        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:66)
>        at sun.nio.ch.SelectorImpl.selectNow(SelectorImpl.java:88)
>        at sun.nio.ch.Util.releaseTemporarySelector(Util.java:135)
>        at sun.nio.ch.ServerSocketAdaptor.accept(ServerSocketAdaptor.java:120)
>        at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:997)
>        at java.lang.Thread.run(Thread.java:619)
>
> After this the data node shuts down.  This same message is appearing on all the failed
nodes.  Help!
>
> -kevin
>

Mime
View raw message