hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Greg Langmead <glangm...@sdl.com>
Subject Problem identifying cause of a failed job
Date Tue, 16 Nov 2010 22:50:17 GMT
Newbie alert.

I have a Pig script I tested on small data and am now running it on a larger
data set (85GB). My cluster is two machines right now, each with 16 cores
and 32G of ram. I configured Hadoop to have 15 tasktrackers on each of these
nodes. One of them is the namenode, one is the secondary name node. I¹m
using Pig 0.7.0 and Hadoop 0.20.2 with Java 1.6.0_18 on Linux Fedora Core
12, 64-bit.

My Pig job starts, and eventually a reduce task fails. I¹d like to find out
why. Here¹s what I know:

The webUI lists the failed reduce tasks and indicates this error:

java.io.IOException: Task process exit with nonzero status of 134.
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)

The userlog userlogs/attempt_201011151350_0001_r_000063_0/stdout says this:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007ff74158463c, pid=27109, tid=140699912791824
#
# JRE version: 6.0_18-b07
# Java VM: Java HotSpot(TM) 64-Bit Server VM (16.0-b13 mixed mode
linux-amd64 )
[thread 140699484784400 also had an error]# Problematic frame:

# V  [libjvm.so+0x62263c]
#
# An error report file with more information is saved as:
# 
/tmp/hadoop-hadoop/mapred/local/taskTracker/jobcache/job_201011151350_0001/a
ttempt_201011151350_0001_r_000063_0/work/hs_err_pid27109.log
#
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp
#

My mapred-site.xml already includes this:

<property>
<name>keep.failed.task.files</name>
<value>true</value>
</property>

So I was hoping that the file hs_err_pid27109.log would exist but it
doesn¹t. I was sure to check the /tmp dir on both tasktrackers. In fact
there is no dir  

  jobcache/job_201011151350_0001/attempt_201011151350_0001_r_000063_0

only

  
jobcache/job_201011151350_0001/attempt_201011151350_0001_r_000063_0.cleanup

I¹d like to find the source of the segfault, can anyone point me in the
right direction? 

Of course let me know if you need more information!

Greg Langmead | Senior Research Scientist | SDL Language Weaver | (t) +1 310
437 7300
  
SDL PLC confidential, all rights reserved.
If you are not the intended recipient of this mail SDL requests and requires that you delete
it without acting upon or copying any of its contents, and we further request that you advise
us.
SDL PLC is a public limited company registered in England and Wales.  Registered number: 02675207.
Registered address: Globe House, Clivemont Road, Maidenhead, Berkshire SL6 7DY, UK.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message