hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Foss User <foss...@gmail.com>
Subject Re: After a node goes down, I can't run jobs
Date Sun, 05 Apr 2009 09:52:51 GMT
On Sun, Apr 5, 2009 at 3:18 PM, Foss User <fossist@gmail.com> wrote:
> I have a Hadoop cluster of 5 nodes: (1) Namenode (2) Job tracker (3)
> First slave (4) Second Slave (5) Client from where I submit jobs
>
> I brought system no. 4 down by running:
>
> bin/hadoop-daemon.sh stop datanode
> bin/hadoop-daemon.sh stop tasktracker
>
> After this I tried running my word count job again and I got this error:
>
> fossist@hadoop-client:~/mcr-wordcount$ hadoop jar
> dist/mcr-wordcount-0.1.jar com.fossist.examples.WordCountJob
> /fossist/inputs /fossist/output7                       09/04/05
> 15:13:03 WARN mapred.JobClient: Use GenericOptionsParser for parsing
> the arguments. Applications should implement Tool for the same.
> 09/04/05 15:13:03 INFO hdfs.DFSClient: Exception in
> createBlockOutputStream java.io.IOException: Bad connect ack with
> firstBadLink 192.168.1.5:50010
> 09/04/05 15:13:03 INFO hdfs.DFSClient: Abandoning block
> blk_-6478273736277251749_1034
> 09/04/05 15:13:09 INFO hdfs.DFSClient: Exception in
> createBlockOutputStream java.net.ConnectException: Connection refused
> 09/04/05 15:13:09 INFO hdfs.DFSClient: Abandoning block
> blk_-7054779688981181941_1034
> 09/04/05 15:13:15 INFO hdfs.DFSClient: Exception in
> createBlockOutputStream java.net.ConnectException: Connection refused
> 09/04/05 15:13:15 INFO hdfs.DFSClient: Abandoning block
> blk_-6231549606860519001_1034
> 09/04/05 15:13:21 INFO hdfs.DFSClient: Exception in
> createBlockOutputStream java.io.IOException: Bad connect ack with
> firstBadLink 192.168.1.5:50010
> 09/04/05 15:13:21 INFO hdfs.DFSClient: Abandoning block
> blk_-7060117896593271410_1034
> 09/04/05 15:13:27 WARN hdfs.DFSClient: DataStreamer Exception:
> java.io.IOException: Unable to create new block.
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2722)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
>
> 09/04/05 15:13:27 WARN hdfs.DFSClient: Error Recovery for block
> blk_-7060117896593271410_1034 bad datanode[1] nodes == null
> 09/04/05 15:13:27 WARN hdfs.DFSClient: Could not get block locations.
> Source file "/tmp/hadoop-hadoop/mapred/system/job_200904042051_0011/job.jar"
> - Aborting...
> java.io.IOException: Bad connect ack with firstBadLink 192.168.1.5:50010
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2780)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2703)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
>
> Note that 192.168.1.5 is the Hadoop slave where I stopped datanode and
> tasktracker. This is a serious concern for me because if I am unable
> to run jobs after a certain node goes down, then the purpose of the
> cluster is defeated.
>
> Could someone please help me in understanding whether it is a human
> error by me or it is a problem in Hadoop? Is there any way to avoid
> this?
>
> Please note that I can still read all my data in 'inputs' directory
> using the commands like:
>
> fossist@hadoop-client:~/mcr-wordcount$ hadoop dfs -cat
> /fossist/inputs/input1.txt
>
> Please help.
>

Here is an update. After waiting for sometime, don't know exactly how
much, the namenode web page on port 50070 showed the down node as
'dead node' and I was able to run jobs again like before. Does this
mean that Hadoop takes a while to accept that a node is dead?

Is this good by design? In the first five minutes or so when Hadoop is
in denial that a node is dead, all new jobs start failing. Is there a
way, I as a user, can tell Hadoop to start using the other available
other nodes in this denial period?

Mime
View raw message