hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amit Kabra <amitkabrai...@gmail.com>
Subject All datanodes are bad. Aborting ...
Date Sun, 20 Apr 2014 09:57:49 GMT
Hello,

I am facing one issue where while running map reduce job ( terasort ),
I see few task failing with the error "All datanodes
10.230.229.76:50010 are bad."
The job however finishes successfully since failed tasks are spawned
again. But yet I have seen these tasks failing multiple times.

Test Setup :
=========

Terasort for 300 GB , Number of reducers : 100, number of maps : 2400.
Cluster is running fine ( before / after the error )
12dn setup ( only zookeeper / hdfs / mapreduce running )
Map/Reduce container memory : 5gb
Only few full gc's (2-3) that too with less than 0.02 sec pause time.



Debugging:
========

Error on console:

14/04/19 10:17:09 [main] INFO  mapreduce.Job(1425): Task Id :
attempt_1397901041097_0001_r_000025_0, Status : FAILED

Error: java.io.IOException: All datanodes 10.230.229.76:50010 are bad. Aborting…

atorg.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:960)

atorg.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:780)

atorg.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:449)


Node Manager / Resource Manager shows no error.


Datanode / NameNode  shows following error


Though job is finished , I cann't access my job since it gives the
following "Not Found: job_1397901041097_0001".This on debugging found
that it could be because of following lines in Resource Manager log
which I am not sure why is it happening


19-Apr-2014 10:26:12  [1086033982@qtp-802928030-12] INFO
org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet[327] - sfdc
is accessing unchecked
http://blitzhbase02-mnds1-1-crd.eng.sfdc.net:19888/jobhistory/job/job_1397901041097_0001/mapreduce/job/job_1397901041097_0001
which is the app master GUI of application_1397901041097_0001 owned by
sfdc

19-Apr-2014 10:32:18  [501790067@qtp-802928030-22] INFO
org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet[327] - sfdc
is accessing unchecked
http://blitzhbase02-mnds1-1-crd.eng.sfdc.net:19888/jobhistory/job/job_1397901041097_0001/mapreduce/job/job_1397901041097_0001
which is the app master GUI of application_1397901041097_0001 owned by
sfdc

19-Apr-2014 10:33:43  [Delegation Token Canceler] INFO
org.apache.hadoop.hdfs.DFSClient[898] - Cancelling
HDFS_DELEGATION_TOKEN token 235 for sfdc on
ha-hdfs:crd-dev-blitzhbase02



Has anyone seen this earlier or any input on this would be helpful.

Note : Sometimes , I also see these errors , which seems to be known
one. I restarted the cluster for this one.

Amit.

Mime
View raw message