hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Feng, Ao" <aof...@amazon.com>
Subject RE: Task process exit with nonzero status of 1
Date Fri, 09 Oct 2009 17:47:21 GMT
I probably know what the problem it, as we are encountering the same issue on our prod cluster.
Every once a while jobs start failing on the same task trackers, and the only error message
is this exit status 1.

Go to the userlogs directory on the host where your tasks fail, and verify if there are 31,999
directories all looking like attempt_... Once you get to that point, JVM would run out of
file descriptors, as it tries to create the 32,000 one. I confirmed that cleaning up the userlogs
directory solves the problem... temporarily.

So my questions are:

1. Where is the 32,000 limit imposed, and how do we change it?
2. In the Hadoop configuration, is there any parameter to specify when to automatically delete
those old user logs (like after a month)?

As we are running 0.19, I was hoping 0.20 has fixed it, but seems it is not the case. Can
someone file a bug report/feature request, if there is no elegant solution at this time?

Thanks,
Ao

-----Original Message-----
From: Marc Limotte [mailto:mlimotte@feeva.com] 
Sent: Thursday, September 24, 2009 11:24 AM
To: common-user@hadoop.apache.org
Subject: RE: Task process exit with nonzero status of 1

Hi Todd.

No userlogs seem to be created.  I'm guessing, because the map task never actually starts.

I don't see any other errors in the tasktracker log, other than the one I put in the first
message ("java.io.IOException: Task process exit with nonzero status of 1...").  I've included
the output from one of the nodes' tasktracker logs below.

Any other suggestions?

Marc

2009-09-24 18:15:36,955 INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction (registerTask):
attempt_200909221656_0006_m_000003_0 task's state:UNASSIGNED
2009-09-24 18:15:36,959 INFO org.apache.hadoop.mapred.TaskTracker: Trying to launch : attempt_200909221656_0006_m_000003_0
2009-09-24 18:15:36,960 INFO org.apache.hadoop.mapred.TaskTracker: In TaskLauncher, current
free slots : 2 and trying to launch
 attempt_200909221656_0006_m_000003_02009-09-24 18:15:37,483 INFO org.apache.hadoop.mapred.JvmManager:
In JvmRunner constructed JVM ID: jvm_200909221656_0006_m_-145
18051982009-09-24 18:15:37,483 INFO org.apache.hadoop.mapred.JvmManager: JVM Runner jvm_200909221656_0006_m_-1451805198
spawned.
2009-09-24 18:15:37,511 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_200909221656_0006_m_-1451805198
exited. Number of t
asks it ran: 02009-09-24 18:15:37,512 WARN org.apache.hadoop.mapred.TaskRunner: attempt_200909221656_0006_m_000003_0
Child Error
java.io.IOException: Task process exit with nonzero status of 1.
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
2009-09-24 18:15:40,518 INFO org.apache.hadoop.mapred.TaskRunner: attempt_200909221656_0006_m_000003_0
done; removing files.
2009-09-24 18:15:40,519 INFO org.apache.hadoop.mapred.TaskTracker: addFreeSlot : current free
slots : 2
2009-09-24 18:15:42,964 INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction (registerTask):
attempt_200909221656_0006_r
_000001_0 task's state:UNASSIGNED2009-09-24 18:15:42,964 INFO org.apache.hadoop.mapred.TaskTracker:
Trying to launch : attempt_200909221656_0006_r_000001_0
2009-09-24 18:15:42,964 INFO org.apache.hadoop.mapred.TaskTracker: In TaskLauncher, current
free slots : 2 and trying to launch
 attempt_200909221656_0006_r_000001_02009-09-24 18:15:43,000 INFO org.apache.hadoop.mapred.JvmManager:
In JvmRunner constructed JVM ID: jvm_200909221656_0006_r_7885
020722009-09-24 18:15:43,000 INFO org.apache.hadoop.mapred.JvmManager: JVM Runner jvm_200909221656_0006_r_788502072
spawned.
2009-09-24 18:15:43,026 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_200909221656_0006_r_788502072
exited. Number of tas
ks it ran: 0
2009-09-24 18:15:43,026 WARN org.apache.hadoop.mapred.TaskRunner: attempt_200909221656_0006_r_000001_0
Child Error
java.io.IOException: Task process exit with nonzero status of 1.
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)2009-09-24 18:15:46,034
INFO org.apache.hadoop.mapred.TaskRunner: attempt_200909221656_0006_r_000001_0 done; removing
files.
2009-09-24 18:15:46,039 INFO org.apache.hadoop.mapred.TaskTracker: addFreeSlot : current free
slots : 2
2009-09-24 18:16:34,022 INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction (registerTask):
attempt_200909221656_0006_m
_000002_1 task's state:UNASSIGNED
2009-09-24 18:16:34,022 INFO org.apache.hadoop.mapred.TaskTracker: Trying to launch : attempt_200909221656_0006_m_000002_1
2009-09-24 18:16:34,022 INFO org.apache.hadoop.mapred.TaskTracker: In TaskLauncher, current
free slots : 2 and trying to launch attempt_200909221656_0006_m_000002_1
2009-09-24 18:16:34,060 INFO org.apache.hadoop.mapred.JvmManager: In JvmRunner constructed
JVM ID: jvm_200909221656_0006_m_-2120349138
2009-09-24 18:16:34,060 INFO org.apache.hadoop.mapred.JvmManager: JVM Runner jvm_200909221656_0006_m_-2120349138
spawned.
2009-09-24 18:16:34,086 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_200909221656_0006_m_-2120349138
exited. Number of tasks it ran: 0
2009-09-24 18:16:34,087 WARN org.apache.hadoop.mapred.TaskRunner: attempt_200909221656_0006_m_000002_1
Child Error
java.io.IOException: Task process exit with nonzero status of 1.
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
2009-09-24 18:16:37,094 INFO org.apache.hadoop.mapred.TaskRunner: attempt_200909221656_0006_m_000002_1
done; removing files.
2009-09-24 18:16:37,095 INFO org.apache.hadoop.mapred.TaskTracker: addFreeSlot : current free
slots : 2
2009-09-24 18:16:40,032 INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction (registerTask):
attempt_200909221656_0006_r_000000_1 task's state:UNASSIGNED
2009-09-24 18:16:40,032 INFO org.apache.hadoop.mapred.TaskTracker: Trying to launch : attempt_200909221656_0006_r_000000_1
2009-09-24 18:16:40,032 INFO org.apache.hadoop.mapred.TaskTracker: In TaskLauncher, current
free slots : 2 and trying to launch attempt_200909221656_0006_r_000000_1
2009-09-24 18:16:40,057 INFO org.apache.hadoop.mapred.JvmManager: In JvmRunner constructed
JVM ID: jvm_200909221656_0006_r_-1417908695
2009-09-24 18:16:40,057 INFO org.apache.hadoop.mapred.JvmManager: JVM Runner jvm_200909221656_0006_r_-1417908695
spawned.
2009-09-24 18:16:40,084 WARN org.apache.hadoop.mapred.TaskRunner: attempt_200909221656_0006_r_000000_1
Child Error
java.io.IOException: Task process exit with nonzero status of 1.
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
2009-09-24 18:16:40,084 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_200909221656_0006_r_-1417908695
exited. Number of tasks it ran: 0
2009-09-24 18:16:43,091 INFO org.apache.hadoop.mapred.TaskRunner: attempt_200909221656_0006_r_000000_1
done; removing files.
2009-09-24 18:16:43,092 INFO org.apache.hadoop.mapred.TaskTracker: addFreeSlot : current free
slots : 2
2009-09-24 18:17:07,057 INFO org.apache.hadoop.mapred.TaskTracker: Received 'KillJobAction'
for job: job_200909221656_0006


-----Original Message-----
From: Todd Lipcon [mailto:todd@cloudera.com]
Sent: Thursday, September 24, 2009 10:19 AM
To: common-user@hadoop.apache.org
Subject: Re: Task process exit with nonzero status of 1

Hi Marc,

Exit status 1 usually means some kind of controlled exit by the mapreduce
child task. Things like JVM crashes usually are indicated by other exit
codes (134 seems to be the code most commonly reported).

If you look at the stderr and stdout from your task (in the userlogs/
directory on the task tracker that ran them) do you see any output?
Additionally, is there anything in the logs for the task tracker itself?
That log is hadoop.log.dir/hadoop-<username>-tasktracker*log

If that log is pretty long, try grepping for WARN, ERROR, or Exception

-Todd

On Thu, Sep 24, 2009 at 9:57 AM, Marc Limotte <mlimotte@feeva.com> wrote:

> Thanks for the suggestion, Edward. I only upgraded the JVM after the
> problem occurred to see if it would help, but it made no difference.
>
> Marc
>
> -----Original Message-----
> From: Edward Capriolo [mailto:edlinuxguru@gmail.com]
> Sent: Thursday, September 24, 2009 7:50 AM
> To: common-user@hadoop.apache.org
> Subject: Re: Task process exit with nonzero status of 1
>
> On Wed, Sep 23, 2009 at 2:06 PM, Marc Limotte <mlimotte@feeva.com> wrote:
> > I'm seeing this error when I try to run my job.
> >
> > java.io.IOException: Task process exit with nonzero status of 1.
> >    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
> >
> > From what I can find by doing some Google searches, this means the mapred
> task JVM has crashed.  Not many suggestions about what to do about it.  Some
> suggestions about increasing max heap.  I tried that, although I don't think
> that's the issue because it's not a particularly memory intensive process
> and I've even tried it with a super small input data set of only a few
> records.  Still see the same issue.
> >
> > Can't find anything else in the logs.  I don't think my task even
> started, because there are no user logs created at all. Seems to fail during
> Job Setup.
> >
> > A little more background.  This job was working fine for weeks, running
> hourly, and then failed on Saturday morning and hasn't worked since.
>  Obviously, I looked for something that changed at that point, but no one
> was working at that time... can't find anything that changed.  I tried the
> job with different input data sets, doesn't seem to matter, unless I run it
> with no data at all.  The job does run with no input data, but if I have
> even a few input records it fails-doesn't seem to matter which records.  I
> suspected some corruption in HDFS, but I was able to extract the data from
> HDFS (hadoop dfs -get ...) and the data looks ok.  I also copied this data
> set to our TEST cluster and ran the job there... and it WORKED!
> >
> > Ran one of our other jobs and it failed as well, so it doesn't seem to be
> job specific either; looks like every job fails the same way.
> >
> > Did a complete reboot of the cluster-no impact.
> >
> > We're using Hadoop 0.20.0, and Java 1.6 update 16 on CentOS 5.2 64bit.
> >
> > Any suggestions on what could be wrong or where to look for more
> information would be appreciated.
> >
> >
> >
> > Marc Limotte
> > Feeva Technology
> >
> > PRIVATE AND CONFIDENTIAL - NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT FOR
> ONLY THE INTENDED RECIPIENT OF THE TRANSMISSION, AND MAY BE A COMMUNICATION
> PRIVILEGE BY LAW. IF YOU RECEIVED THIS E-MAIL IN ERROR, ANY REVIEW, USE,
> DISSEMINATION, DISTRIBUTION, OR COPYING OF THIS EMAIL IS STRICTLY
> PROHIBITED. PLEASE NOTIFY US IMMEDIATELY OF THE ERROR BY RETURN E-MAIL AND
> PLEASE DELETE THIS MESSAGE FROM YOUR SYSTEM.
> >
> Just a shot in the dark....
>
> Did you update java recently
>
>
> http://www.koopman.me/2009/04/hadoop-0183-could-not-create-the-java-virtual-machine/
>
> PRIVATE AND CONFIDENTIAL - NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT FOR
> ONLY THE INTENDED RECIPIENT OF THE TRANSMISSION, AND MAY BE A COMMUNICATION
> PRIVILEGE BY LAW. IF YOU RECEIVED THIS E-MAIL IN ERROR, ANY REVIEW, USE,
> DISSEMINATION, DISTRIBUTION, OR COPYING OF THIS EMAIL IS STRICTLY
> PROHIBITED. PLEASE NOTIFY US IMMEDIATELY OF THE ERROR BY RETURN E-MAIL AND
> PLEASE DELETE THIS MESSAGE FROM YOUR SYSTEM.
>

PRIVATE AND CONFIDENTIAL - NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT FOR ONLY THE INTENDED
RECIPIENT OF THE TRANSMISSION, AND MAY BE A COMMUNICATION PRIVILEGE BY LAW. IF YOU RECEIVED
THIS E-MAIL IN ERROR, ANY REVIEW, USE, DISSEMINATION, DISTRIBUTION, OR COPYING OF THIS EMAIL
IS STRICTLY PROHIBITED. PLEASE NOTIFY US IMMEDIATELY OF THE ERROR BY RETURN E-MAIL AND PLEASE
DELETE THIS MESSAGE FROM YOUR SYSTEM.

Mime
View raw message