Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 65A8098B6 for ; Sat, 25 May 2013 17:15:18 +0000 (UTC) Received: (qmail 17362 invoked by uid 500); 25 May 2013 17:15:12 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 17070 invoked by uid 500); 25 May 2013 17:15:12 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 17044 invoked by uid 99); 25 May 2013 17:15:11 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 25 May 2013 17:15:11 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW X-Spam-Check-By: apache.org Received-SPF: error (nike.apache.org: local policy) Received: from [209.85.220.177] (HELO mail-vc0-f177.google.com) (209.85.220.177) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 25 May 2013 17:15:05 +0000 Received: by mail-vc0-f177.google.com with SMTP id ib11so3842336vcb.8 for ; Sat, 25 May 2013 10:14:24 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:x-gm-message-state; bh=ufrsT2xOsi9fmAa5FySUa8yujFWZxj4zLOWwO13UQFA=; b=c4mzUbkVk+Mcl60b34ptiG7j00udJJdh00hRtHn9jfeWSonA97H14qAFaatWm8eLmE ikPkstKt8StnBUxB4OgnD6nDMaDIKM81SRr39Pbn2EnCXiM8cSv3FAsgM9lLQ6BDrs9Q dEmiObTTVrqijZJz6+CozblZeBte1plwYutwtW8WVvH6n6P7DAe5BER4F5rn+MaU0alg OcKWIbmYU0xyWLTNMVJ2MJf56sruSjlJUQYlDrxLvynjAiMMqWMqzq152PZ5zJ1hgHWI CS6wP3fIwjqZP/ztiF4n6gJI1FXV+bHSmYESFoRGjD5Fqz3rhI7akYbCcsinWsAxJxMU 2kcQ== X-Received: by 10.52.0.198 with SMTP id 6mr9762587vdg.65.1369502064356; Sat, 25 May 2013 10:14:24 -0700 (PDT) MIME-Version: 1.0 Received: by 10.52.165.45 with HTTP; Sat, 25 May 2013 10:14:04 -0700 (PDT) In-Reply-To: References: From: Jean-Marc Spaggiari Date: Sat, 25 May 2013 13:14:04 -0400 Message-ID: Subject: Re: Child Error To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=047d7b33cccc10c02a04dd8e0ca8 X-Gm-Message-State: ALoCoQlTFSp8P8iy2BVHAD5heH61Vft8ambBe6ptcLCKstVl5bHvlVo51WVMmcO9ObW/Cb2GD3nQ X-Virus-Checked: Checked by ClamAV on apache.org --047d7b33cccc10c02a04dd8e0ca8 Content-Type: text/plain; charset=UTF-8 Hi Jim, Will you be able to do the same test with Oracle JDK 1.6 instead of OpenJDK 1.7 to see if it maked a difference? JM 2013/5/25 Jim Twensky > Hi Jean, thanks for replying. I'm using java 1.7.0_21 on ubuntu. Here is > the output: > > $ java -version > java version "1.7.0_21" > OpenJDK Runtime Environment (IcedTea 2.3.9) (7u21-2.3.9-0ubuntu0.12.10.1) > OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode) > > I don't get any OOME errors and this error happens on random nodes, not a > particular one. Usually all tasks running on a particular node fail and > that node gets blacklisted. However, the same node works just fine during > the next or previous jobs. Can it be a problem with the ssh keys? What else > can cause the IOException with "failure to login" message? I've been > digging into this for two days but I'm almost clueless. > > Thanks, > Jim > > > > > On Fri, May 24, 2013 at 10:32 PM, Jean-Marc Spaggiari < > jean-marc@spaggiari.org> wrote: > >> Hi Jim, >> >> Which JVM are you using? >> >> I don't think you have any memory issue. Else you will have got some >> OOME... >> >> JM >> >> >> 2013/5/24 Jim Twensky >> >>> Hi again, in addition to my previous post, I was able to get some error >>> logs from the task tracker/data node this morning and looks like it might >>> be a jetty issue: >>> >>> 2013-05-23 19:59:20,595 WARN org.apache.hadoop.mapred.TaskLog: Failed to >>> retrieve stdout log for task: attempt_201305231647_0007_m_001096_0 >>> java.io.IOException: Owner 'jim' for path >>> /var/tmp/jim/hadoop-logs/userlogs/job_201305231647_0007/attempt_201305231647_0007_m_001096_0/stdout >>> did not match expected owner '10929' >>> at org.apache.hadoop.io.SecureIOUtils.checkStat(SecureIOUtils.java:177) >>> at >>> org.apache.hadoop.io.SecureIOUtils.openForRead(SecureIOUtils.java:117) >>> at org.apache.hadoop.mapred.TaskLog$Reader.(TaskLog.java:455) >>> at >>> org.apache.hadoop.mapred.TaskLogServlet.printTaskLog(TaskLogServlet.java:81) >>> at >>> org.apache.hadoop.mapred.TaskLogServlet.doGet(TaskLogServlet.java:296) >>> at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) >>> at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) >>> at >>> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) >>> at >>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) >>> at >>> org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:848) >>> at >>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) >>> at >>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) >>> >>> >>> I am wondering if I am hitting MAPREDUCE-2389If so, how do I downgrade my jetty version? Should I just replace the jetty >>> jar file in the lib directory with an earlier version and restart my >>> cluster? >>> >>> Thank you. >>> >>> >>> >>> >>> On Thu, May 23, 2013 at 7:14 PM, Jim Twensky wrote: >>> >>>> Hello, I have a 20 node Hadoop cluster where each node has 8GB memory >>>> and an 8-core processor. I sometimes get the following error on a random >>>> basis: >>>> >>>> >>>> >>>> ----------------------------------------------------------------------------------------------------------- >>>> >>>> Exception in thread "main" java.io.IOException: Exception reading file:/var/tmp/jim/hadoop-jim/mapred/local/taskTracker/jim/jobcache/job_201305231647_0005/jobToken >>>> at org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:135) >>>> at org.apache.hadoop.mapreduce.security.TokenCache.loadTokens(TokenCache.java:165) >>>> at org.apache.hadoop.mapred.Child.main(Child.java:92) >>>> Caused by: java.io.IOException: failure to login >>>> at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:501) >>>> at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:463) >>>> at org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:1519) >>>> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1420) >>>> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254) >>>> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187) >>>> at org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:129) >>>> ... 2 more >>>> Caused by: javax.security.auth.login.LoginException: java.lang.NullPointerException: invalid null input: name >>>> at com.sun.security.auth.UnixPrincipal.(UnixPrincipal.java:70) >>>> at com.sun.security.auth.module.UnixLoginModule.login(UnixLoginModule.java:132) >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>>> >>>> ...... >>>> >>>> >>>> ----------------------------------------------------------------------------------------------------------- >>>> >>>> This does not always happen but I see a pattern when the intermediate >>>> data is larger, it tends to occur more frequently. In the web log, I can >>>> see the following: >>>> >>>> java.lang.Throwable: Child Error >>>> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) >>>> Caused by: java.io.IOException: Task process exit with nonzero status of 1. >>>> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258) >>>> >>>> From what I read online, a possible cause is when there is not enough >>>> memory for all JVM's. My mapred site.xml is set up to allocate 1100MB for >>>> each child and the maximum number of map and reduce tasks are set to 3 - So >>>> 6600MB of the child JVMs + (500MB * 2) for the data node and task tracker >>>> (as I set HADOOP_HEAP to 500 MB). I feel like memory is not the cause but I >>>> couldn't avoid it so far. >>>> In case it helps, here are the relevant sections of my mapred-site.xml >>>> >>>> >>>> ----------------------------------------------------------------------------------------------------------- >>>> >>>> mapred.tasktracker.map.tasks.maximum >>>> 3 >>>> >>>> mapred.tasktracker.reduce.tasks.maximum >>>> 3 >>>> >>>> mapred.child.java.opts >>>> -Xmx1100M -ea -XX:+HeapDumpOnOutOfMemoryError >>>> -XX:HeapDumpPath=/var/tmp/soner >>>> >>>> mapred.reduce.parallel.copies >>>> 5 >>>> >>>> tasktracker.http.threads >>>> 80 >>>> >>>> ----------------------------------------------------------------------------------------------------------- >>>> >>>> My jobs still complete most of the time though they occasionally fail >>>> and I'm really puzzled at this point. I'd appreciate any help or ideas. >>>> >>>> Thanks >>>> >>> >>> >> > --047d7b33cccc10c02a04dd8e0ca8 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi Jim,

Will you be able to do the same test with Oracle JDK 1.6 ins= tead of OpenJDK 1.7 to see if it maked a difference?

JM

2013/5/25 Jim Twensky <jim.twensky@gmail.com>
Hi Jean, tha= nks for replying. I'm using java 1.7.0_21 on ubuntu. Here is the output= :

$ java -version
java version "1.7.0_21"
OpenJDK Runtime= Environment (IcedTea 2.3.9) (7u21-2.3.9-0ubuntu0.12.10.1)
OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)

I don= 9;t get any OOME errors and this error happens on random nodes, not a parti= cular one. Usually all tasks running on a particular node fail and that nod= e gets blacklisted. However, the same node works just fine during the next = or previous jobs. Can it be a problem with the ssh keys? What else can caus= e the IOException with "failure to login" message? I've been = digging into this for two days but I'm almost clueless.

Thanks,
Jim


=

On Fri, May 24, 2013 at 10:32 PM, Jean-Mar= c Spaggiari <jean-marc@spaggiari.org> wrote:
Hi Jim,

Which JVM are you using?
<= br>I don't think you have any memory issue. Else you will have got some= OOME...

JM


2013/5/24 = Jim Twensky <jim.twensky@gmail.com>
Hi again, in addi= tion to my previous post, I was able to get some error logs from the task t= racker/data node this morning and looks like it might be a jetty issue:

2013-05-23 19:59:20,595 WARN org.apache.hadoop.mapred.TaskLog: Failed t= o retrieve stdout log for task: attempt_201305231647_0007_m_001096_0
java.io.IOException: Owner 'jim' for path /var/tmp/jim/hadoop-logs/= userlogs/job_201305231647_0007/attempt_201305231647_0007_m_001096_0/stdout = did not match expected owner '10929'
=C2=A0 at org.apache.hadoop= .io.SecureIOUtils.checkStat(SecureIOUtils.java:177)
=C2=A0 at org.apache.hadoop.io.SecureIOUtils.openForRead(SecureIOUtils.java= :117)
=C2=A0 at org.apache.hadoop.mapred.TaskLog$Reader.<init>(Tas= kLog.java:455)
=C2=A0 at org.apache.hadoop.mapred.TaskLogServlet.printTa= skLog(TaskLogServlet.java:81)
=C2=A0 at org.apache.hadoop.mapred.TaskLogServlet.doGet(TaskLogServlet.java= :296)
=C2=A0 at javax.servlet.http.HttpServlet.service(HttpServlet.java:= 707)
=C2=A0 at javax.servlet.http.HttpServlet.service(HttpServlet.java:8= 20)
=C2=A0 at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHold= er.java:511)
=C2=A0 at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(Ser= vletHandler.java:1221)
=C2=A0 at org.apache.hadoop.http.HttpServer$Quoti= ngInputFilter.doFilter(HttpServer.java:848)
=C2=A0 at org.mortbay.jetty.= servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
=C2=A0 at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.ja= va:399)


I am wondering if I am hitting MAPREDUCE-2= 389 If so, how do I downgrade my jetty version? Should I just replace t= he jetty jar file in the lib directory with an earlier version and restart = my cluster?

Thank you.




On Thu, May 23, 20= 13 at 7:14 PM, Jim Twensky <jim.twensky@gmail.com> wrote= :
Hello, I hav= e a 20 node Hadoop cluster where each node has 8GB memory and an 8-core pro= cessor. I sometimes get the following error on a random basis:


-------------------------------------------------------------------= ----------------------------------------
Exception in thread "main" java.io.IOException: Exception re=
ading file:/var/tmp/jim/hadoop-jim/mapred/local/taskTracker/jim/jobcache/jo=
b_201305231647_0005/jobToken
	at org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials=
.java:135)
	at org.apache.hadoop.mapreduce.security.TokenCache.loadTokens(TokenCache.j=
ava:165)
	at org.apache.hadoop.mapred.Child.main(Child.java:92)
Caused by: java.io.IOException: failure to login
	at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupI=
nformation.java:501)
	at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGrou=
pInformation.java:463)
	at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:=
1519)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1420)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
	at org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials=
.java:129)
	... 2 more
Caused by: javax.security.auth.login.LoginException: java.lang.NullPointerE=
xception: invalid null input: name
	at com.sun.security.auth.UnixPrincipal.<init>(UnixPrincipal.java:70)
	at com.sun.security.auth.module.UnixLoginModule.login(UnixLoginModule.java=
:132)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.ja=
va:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccesso=
rImpl.java:43)

......
-------------------------------------= ----------------------------------------------------------------------

This does not always happen but I see a pattern when the intermed= iate data is larger, it tends to occur more frequently. In the web log, I c= an see the following:

java.lang.Throwable: Child Error
	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status of 1.
	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

From what I read online, a possible cause is when there is not enough me= mory for all JVM's. My mapred site.xml is set up to allocate 1100MB for= each child and the maximum number of map and reduce tasks are set to 3 - S= o 6600MB of the child JVMs + (500MB * 2) for the data node and task tracker= (as I set HADOOP_HEAP to 500 MB). I feel like memory is not the cause but = I couldn't avoid it so far.
In case it helps, here are the relevant sections of my mapred-site.xm= l

------------------------------------------------------------------= -----------------------------------------

=C2=A0=C2=A0=C2=A0 <nam= e>mapred.tasktracker.map.tasks.maximum</name>
=C2=A0=C2=A0=C2=A0 <value>3</value>

=C2=A0=C2=A0=C2=A0 &= lt;name>mapred.tasktracker.reduce.tasks.maximum</name>
=C2=A0= =C2=A0=C2=A0 <value>3</value>

=C2=A0=C2=A0=C2=A0 <nam= e>mapred.child.java.opts</name>
=C2=A0=C2=A0=C2=A0 <value>= ;-Xmx1100M -ea -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=3D/var/tmp/= soner</value>

=C2=A0=C2=A0=C2=A0 <name>mapred.reduce.parallel.copies</name&g= t;
=C2=A0=C2=A0=C2=A0 <value>5</value>

=C2=A0=C2=A0= =C2=A0 <name>tasktracker.http.threads</name>
=C2=A0=C2=A0=C2= =A0 <value>80</value>
--------------------------------------= ---------------------------------------------------------------------

My jobs still complete most of the time though they occasionally = fail and I'm really puzzled at this point. I'd appreciate any help = or ideas.

Thanks




--047d7b33cccc10c02a04dd8e0ca8--