hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jinsong Hu" <jinsong...@hotmail.com>
Subject Re: Task process exit with nonzero status of 126 . all mapreduce job fails
Date Wed, 13 Oct 2010 18:24:47 GMT
A little more clue:

I noticed that each task tracker has several thousand failed jobs. but 
strangely, none of them are assigned to blacklist.

I restarted jobtracker, and found that new jobs submitted to the cluster 
still fails with the same error.

I restarted all task tracker, then the cluster recovered.

Jimmy.

--------------------------------------------------
From: "Jinsong Hu" <jinsong_hu@hotmail.com>
Sent: Wednesday, October 13, 2010 10:56 AM
To: <common-user@hadoop.apache.org>
Subject: Task process exit with nonzero status of 126 . all mapreduce job 
fails

> Hi, There :
>  I am running a CDH3b2 distribution of hadoop and it has been working for 
> several weeks. last night all tasks begin to fail . not a single job 
> finished successfully.
>
> Here are the relevant information: the  jobtracker .out file is empty.
> I was suspecting out of memory error, as the "top" shows the RES to be 2G, 
> while my command
> line I only give it 1G. , but I searched the log and that error message 
> doesn't exist.
>
>
> Can anybody give me any clue how to find out the cause and how to fix this 
> ?
>
> Jimmy
>
>
> here is the record from jobtracker:
>
> g job_201010011833_72867
> 2010-10-13 17:00:55,003 INFO org.apache.hadoop.mapred.TaskInProgress: 
> Error from
> attempt_201010011833_72860_r_000001_0: java.lang.Throwable: Child Error
>        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:472)
> Caused by: java.io.IOException: Task process exit with nonzero status of 
> 126.
>        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:459)
>
>
>
>
>
>
> [hadoop@m0002041 logs]$ grep attempt_201010011833_63031_m_000020_3 
> hadoop-hadoop
> -jobtracker-m0002041.log
> 2010-10-13 00:04:56,361 INFO org.apache.hadoop.mapred.JobTracker: Adding 
> task 'a
> ttempt_201010011833_63031_m_000020_3' to tip 
> task_201010011833_63031_m_000020, f
> or tracker 
> 'tracker_m0002014.ppops.net:localhost.localdomain/127.0.0.1:40961'
> 2010-10-13 00:04:59,365 INFO org.apache.hadoop.mapred.TaskInProgress: 
> Error from
> attempt_201010011833_63031_m_000020_3: java.lang.Throwable: Child Error
> 2010-10-13 00:05:02,423 INFO org.apache.hadoop.mapred.JobTracker: Removed 
> comple
> ted task 'attempt_201010011833_63031_m_000020_3' from 
> 'tracker_m0002014.ppops.ne
> t:localhost.localdomain/127.0.0.1:40961'
>
>
>
> here is the result from the tasktracker:
>
> [root@m0002014 logs]# grep attempt_201010011833_63031_m_000020_3 
> hadoop-hadoop-
> tasktracker-m0002014.log
> 2010-10-13 00:04:56,362 INFO org.apache.hadoop.mapred.TaskTracker: 
> LaunchTaskAct
> ion (registerTask): attempt_201010011833_63031_m_000020_3 task's 
> state:UNASSIGNE
> D
> 2010-10-13 00:04:56,362 INFO org.apache.hadoop.mapred.TaskTracker: Trying 
> to lau
> nch : attempt_201010011833_63031_m_000020_3
> 2010-10-13 00:04:56,362 INFO org.apache.hadoop.mapred.TaskTracker: In 
> TaskLaunch
> er, current free slots : 2 and trying to launch 
> attempt_201010011833_63031_m_000
> 020_3
> 2010-10-13 00:04:56,449 WARN org.apache.hadoop.mapred.TaskRunner: 
> attempt_201010
> 011833_63031_m_000020_3Child Error
> 2010-10-13 00:04:59,455 INFO org.apache.hadoop.mapred.TaskRunner: 
> attempt_201010
> 011833_63031_m_000020_3 done; removing files.
>
>
>
>
> 2010-10-13 00:04:56,449 WARN org.apache.hadoop.mapred.TaskRunner: 
> attempt_201010011833_63031_m_000020_3Child Error
> java.io.IOException: Task process exit with nonzero status of 126.
>        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:459)
> 2010-10-13 00:04:56,449 INFO org.apache.hadoop.mapred.JvmManager: JVM : 
> jvm_2010
> 10011833_63031_m_-772401408 exited. Number of tasks it ran: 0
>
>
> here is the relevant startup command:
>
> hadoop   11874     1  6 Oct01 ?        18:09:24 
> /usr/java/latest/bin/java -Xmx10
> 00m -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxre
> mote.ssl=false -XX:+UseConcMarkSweepGC -XX:+DisableExplicitGC -XX:+HeapDumpOnOut
> OfMemoryError -XX:+UseCompressedOops -XX:+DoEscapeAnalysis -XX:+AggressiveOpts 
>  -
> Dcom.sun.management.jmxremote -Xmx4G -Dcom.sun.management.jmxremote.port=8008 
>  -v
> erbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:/home/hadoop/hadoop
> /logs/gc-jobtracker.log -Dhadoop.log.dir=/home/hadoop/hadoop/logs -Dhadoop.log.f
>
> top result:
>
> PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 1874 hadoop    23   0 4826m 2.0g  10m S 10.0 25.3   1089:29 java
>
>
>
> top -H result:
>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAN
> 11973 hadoop    15   0 4826m 2.0g  10m S  8.3 25.3  35:08.46 java
> 11974 hadoop    15   0 4826m 2.0g  10m S  0.7 25.3  35:18.06 java
> 12029 hadoop    16   0 4826m 2.0g  10m S  0.3 25.3   3:21.17 java
> 12538 hadoop    16   0 4826m 2.0g  10m S  0.3 25.3   7:18.12 java
> 12539 hadoop    15   0 4826m 2.0g  10m S  0.3 25.3   0:59.08 java
> 11874 hadoop    18   0 4826m 2.0g  10m S  0.0 25.3   0:00.00 java
> 11906 hadoop    23   0 4826m 2.0g  10m S  0.0 25.3   0:01.16 java
> 11907 hadoop    16   0 4826m 2.0g  10m S  0.0 25.3   4:55.39 java
> 11908 hadoop    16   0 4826m 2.0g  10m S  0.0 25.3   4:55.71 java
>
>
> he is the count of the error trend:
>
> [hadoop@m0002041 logs]$ grep -i error 
> hadoop-hadoop-jobtracker-m0002041.log | wc
>
> 113559 1135590 17374527
> [hadoop@m0002041 logs]$ grep -i error 
> hadoop-hadoop-jobtracker-m0002041.log.2010
> -10-12 | wc
> 111368 1113680 17039304
> [hadoop@m0002041 logs]$ grep -i error 
> hadoop-hadoop-jobtracker-m0002041.log.2010
> -10-11 | wc
>   5163   51638  790076
> [hadoop@m0002041 logs]$ grep -i error 
> hadoop-hadoop-jobtracker-m0002041.log.2010
> -10-10 | wc
>     29     316    4850
> [hadoop@m0002041 logs]$ grep -i error 
> hadoop-hadoop-jobtracker-m0002041.log.2010
> -10-09 | wc
>     35     412    7492
> [hadoop@m0002041 logs]$ grep -i error 
> hadoop-hadoop-jobtracker-m0002041.log.2010
> -10-08 | wc
>     26     346    5088
>
>
>
>
> 

Mime
View raw message