Hello,
I am facing one issue where while running map reduce job ( terasort ),
I see few task failing with the error "All datanodes
10.230.229.76:50010 are bad."
The job however finishes successfully since failed tasks are spawned
again. But yet I have seen these tasks failing multiple times.
Test Setup :
=========
Terasort for 300 GB , Number of reducers : 100, number of maps : 2400.
Cluster is running fine ( before / after the error )
12dn setup ( only zookeeper / hdfs / mapreduce running )
Map/Reduce container memory : 5gb
Only few full gc's (2-3) that too with less than 0.02 sec pause time.
Debugging:
========
Error on console:
14/04/19 10:17:09 [main] INFO mapreduce.Job(1425): Task Id :
attempt_1397901041097_0001_r_000025_0, Status : FAILED
Error: java.io.IOException: All datanodes 10.230.229.76:50010 are bad. Aborting…
atorg.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:960)
atorg.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:780)
atorg.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:449)
Node Manager / Resource Manager shows no error.
Datanode / NameNode shows following error
Though job is finished , I cann't access my job since it gives the
following "Not Found: job_1397901041097_0001".This on debugging found
that it could be because of following lines in Resource Manager log
which I am not sure why is it happening
19-Apr-2014 10:26:12 [1086033982@qtp-802928030-12] INFO
org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet[327] - sfdc
is accessing unchecked
http://blitzhbase02-mnds1-1-crd.eng.sfdc.net:19888/jobhistory/job/job_1397901041097_0001/mapreduce/job/job_1397901041097_0001
which is the app master GUI of application_1397901041097_0001 owned by
sfdc
19-Apr-2014 10:32:18 [501790067@qtp-802928030-22] INFO
org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet[327] - sfdc
is accessing unchecked
http://blitzhbase02-mnds1-1-crd.eng.sfdc.net:19888/jobhistory/job/job_1397901041097_0001/mapreduce/job/job_1397901041097_0001
which is the app master GUI of application_1397901041097_0001 owned by
sfdc
19-Apr-2014 10:33:43 [Delegation Token Canceler] INFO
org.apache.hadoop.hdfs.DFSClient[898] - Cancelling
HDFS_DELEGATION_TOKEN token 235 for sfdc on
ha-hdfs:crd-dev-blitzhbase02
Has anyone seen this earlier or any input on this would be helpful.
Note : Sometimes , I also see these errors , which seems to be known
one. I restarted the cluster for this one.
Amit.
|