hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joshi, Shrinivas" <Shrinivas.Jo...@amd.com>
Subject RE: Map Reduce "Child Error" task failure
Date Tue, 21 Aug 2012 19:21:16 GMT
Hi Matt,

You are most probably seeing this https://issues.apache.org/jira/browse/MAPREDUCE-2374 

There is a single line fix for this issue. See the latest patch attached to the above JIRA
entry.

-Shrinivas

-----Original Message-----
From: Matt Kennedy [mailto:stinkymatt@gmail.com] 
Sent: Tuesday, August 21, 2012 2:15 PM
To: user@hadoop.apache.org
Subject: Map Reduce "Child Error" task failure

I'm encountering a sporadic error while running MapReduce jobs, it shows up in the console
output as follows:

12/08/21 14:56:05 INFO mapred.JobClient: Task Id :
attempt_201208211430_0001_m_003538_0, Status : FAILED
java.lang.Throwable: Child Error
	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status of 126.
	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

12/08/21 14:56:05 WARN mapred.JobClient: Error reading task outputhttp://<hostname_removed>:50060/tasklog?plaintext=true&attemptid=attempt_201208211430_0001_m_003538_0&filter=stdout
12/08/21 14:56:05 WARN mapred.JobClient: Error reading task outputhttp://<hostname_removed>:50060/tasklog?plaintext=true&attemptid=attempt_201208211430_0001_m_003538_0&filter=stderr

The conditions look exactly like those described in:
https://issues.apache.org/jira/browse/MAPREDUCE-4003

Unfortunately, this issue is marked as closed for Apache Hadoop version 1.0.3, but that's
the version that I'm running into this issue with.

There does seem to be a correlation between the frequency of these errors and the number of
concurrent map tasks being executed, however the hardware resources on the cluster do not
appear to be near their limits. I'm assuming that there is a knob somewhere that is maladjusted
that is causing this error, however I haven't found it.

I did find this discussion
(https://groups.google.com/a/cloudera.org/d/topic/cdh-user/NlhvHapf3pk/discussion)
on CDH users list describing the exact same problem and the advice was to increase the value
of the mapred.child.ulimit setting. However, I had this value initially unset, which should
mean that the value is unlimited if my research is correct. Then I set the value to 3 GB (3x
my setting for mapred.map.child.java.opts) and it still did not resolve the problem. Finally,
out of frustration, I just added a zero at the end and now the value is 31457280 (the unit
for the setting is in KB) which is 30GB. I'm still having the problem.

Is anybody else seeing this issue or have an idea for a workaround?
Right now my workaround is to set the allowed failures to be very high before a tasktracker
is blacklisted, but this has the unintended side effect of taking a very long time to evict
legitimately messed up tasktrackers. If this error is indicative of some other configuration
problem, I'd like to try to resolve it.

Ideas? Or should I re-open the JIRA?

Thank you for your time,
Matt



Mime
View raw message