hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From java8964 java8964 <java8...@hotmail.com>
Subject java.io.IOException: Task process exit with nonzero status of -1
Date Thu, 15 Aug 2013 15:58:24 GMT
This is a 4 node hadoop cluster running on CentOS 6.3 with Oracle JDK (64bit) 1.6.0_43. Each
node has 32G memory, with max 8 mapper tasks and 4 reducer tasks being set. The hadoop version
is 1.0.4.
This is setup on Datastax DES 3.0.2, which is using Cassandra CFS as underline DFS, instead
of HDFS with NameNode. I understand this kind of setting is not really being tested with hadoop
MR, but the above MR errors should not relate to it, at least from my guess.
I am running a simple MR job, partition data by DATE for 700G of 600 files. The MR logic is
very straightforward, but in our above staging environment, I saw a lot of Reducers failed
with the above error. I want to know the reason and fix it.
1) There is no log related to this error in the reducer task attempt log in user log directory.
The only log related to this is in the system.log, which generated by cassandra processor:
    INFO [JVM Runner jvm_201308141528_0003_r_625176200 spawned.] 2013-08-15 07:28:59,326 JvmManager.java
(line 510) JVM : jvm_201308141528_0003_r_625176200 exited with exit code -1. Number of tasks
it ran: 0
2) I believe this error is related to the system resource, but just cannot google anything
to be the root cause. From the log, I believe the JVM terminated/crashed for the reducer task,
but I don't know the reason. 
3) I checked the limits of the user which process is running under, here is the info, and
I didn't spot any obvious problems.-bash-4.1$ ulimit -acore file size          (blocks, -c)
0data seg size           (kbytes, -d) unlimitedscheduling priority             (-e) 0file
size               (blocks, -f) unlimitedpending signals                 (-i) 256589max locked
memory       (kbytes, -l) unlimitedmax memory size         (kbytes, -m) unlimitedopen files
                     (-n) 400000pipe size            (512 bytes, -p) 8POSIX message queues
    (bytes, -q) 819200real-time priority              (-r) 0stack size              (kbytes,
-s) 10240cpu time               (seconds, -t) unlimitedmax user processes              (-u)
32768virtual memory          (kbytes, -v) unlimitedfile locks                      (-x) unlimited
4) Since this is a new cluster, there is really not too much hadoop setting changed from the
default value. I did run the reducer as '-mx2048m', to set the heap size of JVM to 2G, as
1st time the reducers failed with OOM error. I google around, as it looks like people recommend
to set "mapred.child.ulimit" to 3x of heap size, which should be around 6G in this case. I
can give that a try, but in the nodes, the virtual memory is set to unlimited for user whom
is running under, so I am not sure if this will really fix it.
5) Another possibility I found in google is that the child process return -1 when it failed
to write to user logs, as Linux EXT3 has a limitation about how many file/directories can
be created under one folder (32k?). But my system is using EXT4, and there is not too many
MR jobs running so far.
6) I am really not sure what is the root cause of this, as exit code -1 could mean a lot.
But I wonder any one here can give me more hints, or any help about debugging this issue in
my environment? Is there any way in hapoop or JVM setting I can set to dump more info/log
about why the JVM terminated at runtime with exit code -1?
View raw message