Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of amitkabraiiit@gmail.com
 designates 209.85.219.47 as permitted sender)
MIME-Version: 1.0
From: Amit Kabra <amitkabraiiit@gmail.com>
Date: Sun, 20 Apr 2014 15:27:49 +0530
Message-ID: 
 <CA+MNucdazsUTdxSTA-=-JN86mmOEvqPyAfjZ=k9y8AMDB8u=Yw@mail.gmail.com>
Subject: All datanodes are bad. Aborting ...
To: user@hadoop.apache.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hello,

I am facing one issue where while running map reduce job ( terasort ),
I see few task failing with the error "All datanodes
10.230.229.76:50010 are bad."
The job however finishes successfully since failed tasks are spawned
again. But yet I have seen these tasks failing multiple times.

Test Setup :
=3D=3D=3D=3D=3D=3D=3D=3D=3D

Terasort for 300 GB , Number of reducers : 100, number of maps : 2400.
Cluster is running fine ( before / after the error )
12dn setup ( only zookeeper / hdfs / mapreduce running )
Map/Reduce container memory : 5gb
Only few full gc's (2-3) that too with less than 0.02 sec pause time.


Debugging:
=3D=3D=3D=3D=3D=3D=3D=3D

Error on console:

14/04/19 10:17:09 [main] INFO  mapreduce.Job(1425): Task Id :
attempt_1397901041097_0001_r_000025_0, Status : FAILED

Error: java.io.IOException: All datanodes 10.230.229.76:50010 are bad. Abor=
ting=E2=80=A6

atorg.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppen=
dOrRecovery(DFSOutputStream.java:960)

atorg.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(=
DFSOutputStream.java:780)

atorg.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.j=
ava:449)


Node Manager / Resource Manager shows no error.


Datanode / NameNode  shows following error


Though job is finished , I cann't access my job since it gives the
following "Not Found: job_1397901041097_0001".This on debugging found
that it could be because of following lines in Resource Manager log
which I am not sure why is it happening


19-Apr-2014 10:26:12  [1086033982@qtp-802928030-12] INFO
org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet[327] - sfdc
is accessing unchecked
http://blitzhbase02-mnds1-1-crd.eng.sfdc.net:19888/jobhistory/job/job_13979=
01041097_0001/mapreduce/job/job_1397901041097_0001
which is the app master GUI of application_1397901041097_0001 owned by
sfdc

19-Apr-2014 10:32:18  [501790067@qtp-802928030-22] INFO
org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet[327] - sfdc
is accessing unchecked
http://blitzhbase02-mnds1-1-crd.eng.sfdc.net:19888/jobhistory/job/job_13979=
01041097_0001/mapreduce/job/job_1397901041097_0001
which is the app master GUI of application_1397901041097_0001 owned by
sfdc

19-Apr-2014 10:33:43  [Delegation Token Canceler] INFO
org.apache.hadoop.hdfs.DFSClient[898] - Cancelling
HDFS_DELEGATION_TOKEN token 235 for sfdc on
ha-hdfs:crd-dev-blitzhbase02


Has anyone seen this earlier or any input on this would be helpful.

Note : Sometimes , I also see these errors , which seems to be known
one. I restarted the cluster for this one.

Amit.