hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Trevor <tre...@scurrilous.com>
Subject MRv2 jobs fail when run with more than one slave
Date Tue, 17 Jul 2012 21:24:22 GMT
Hi all,

I recently upgraded from CDH4b2 (0.23.1) to CDH4 (2.0.0). Now for some
strange reason, my MRv2 jobs (TeraGen, specifically) fail if I run with
more than one slave. For every slave except the one running the Application
Master, I get the following failed tasks and warnings repeatedly:

12/07/13 14:21:55 INFO mapreduce.Job: Running job: job_1342207265272_0001
12/07/13 14:22:17 INFO mapreduce.Job: Job job_1342207265272_0001 running in
uber mode : false
12/07/13 14:22:17 INFO mapreduce.Job:  map 0% reduce 0%
12/07/13 14:22:46 INFO mapreduce.Job:  map 1% reduce 0%
12/07/13 14:22:52 INFO mapreduce.Job:  map 2% reduce 0%
12/07/13 14:22:55 INFO mapreduce.Job:  map 3% reduce 0%
12/07/13 14:22:58 INFO mapreduce.Job:  map 4% reduce 0%
12/07/13 14:23:04 INFO mapreduce.Job:  map 5% reduce 0%
12/07/13 14:23:07 INFO mapreduce.Job:  map 6% reduce 0%
12/07/13 14:23:07 INFO mapreduce.Job: Task Id :
attempt_1342207265272_0001_m_000004_0, Status : FAILED
12/07/13 14:23:08 WARN mapreduce.Job: Error reading task output Server
returned HTTP response code: 400 for URL: http://
perfgb0n0:8080/tasklog?plaintext=true&attemptid=attempt_1342207265272_0001_m_000004_0&filter=stdout
12/07/13 14:23:08 WARN mapreduce.Job: Error reading task output Server
returned HTTP response code: 400 for URL: http://
perfgb0n0:8080/tasklog?plaintext=true&attemptid=attempt_1342207265272_0001_m_000004_0&filter=stderr
12/07/13 14:23:08 INFO mapreduce.Job: Task Id :
attempt_1342207265272_0001_m_000003_0, Status : FAILED
12/07/13 14:23:08 WARN mapreduce.Job: Error reading task output Server
returned HTTP response code: 400 for URL: http://
perfgb0n0:8080/tasklog?plaintext=true&attemptid=attempt_1342207265272_0001_m_000003_0&filter=stdout
...
12/07/13 14:25:12 INFO mapreduce.Job:  map 25% reduce 0%
12/07/13 14:25:12 INFO mapreduce.Job: Job job_1342207265272_0001 failed
with state FAILED due to:
...
                Failed map tasks=19
                Launched map tasks=31

The HTTP 400 error appears to be generated by the ShuffleHandler, which is
configured to run on port 8080 of the slaves, and doesn't understand that
URL. What I've been able to piece together so far is that /tasklog is
handled by the TaskLogServlet, which is part of the TaskTracker. However,
isn't this an MRv1 class that shouldn't even be running in my
configuration? Also, the TaskTracker appears to run on port 50060, so I
don't know where port 8080 is coming from.

Though it could be a red herring, this warning seems to be related to the
job failing, despite the fact that the job makes progress on the slave
running the AM. The Node Manager logs on both AM and non-AM slaves appear
fairly similar, and I don't see any errors in the non-AM logs.

Another strange data point: These failures occur running the slaves on ARM
systems. Running the slaves on x86 with the same configuration works. I'm
using the same tarball on both, which means that the native-hadoop library
isn't loaded on ARM. The master/client is the same x86 system in both
scenarios. All nodes are running Ubuntu 12.04.

Thanks for any guidance,
Trevor

Mime
View raw message