Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: error (nike.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: 
 <CALUCP4FzVLXZUS-58oCY_ZKxpTy=GJuewQdYCGb7qgG=R3=fzg@mail.gmail.com>
References: 
 <CALUCP4H+E-z5HjrkLgf8w6gQQ4Jpi7bTqNyfJ0dd2QsQ4Rtw0g@mail.gmail.com>
 <CALUCP4HFoQZnLCPCgBXrw_gdvSO_QZmUkboSgTzShPerp9tabw@mail.gmail.com>
 <CAPQV63XmsOchKf2FeU=QnKsW09gevDUWpDH7hiHCu0TrHAbn+g@mail.gmail.com>
 <CALUCP4FzVLXZUS-58oCY_ZKxpTy=GJuewQdYCGb7qgG=R3=fzg@mail.gmail.com>
From: Jean-Marc Spaggiari <jean-marc@spaggiari.org>
Date: Sat, 25 May 2013 13:14:04 -0400
Message-ID: 
 <CAPQV63U4d01+Pptab=RAT=BLuPhToHBe_tOJ+dsT7txihP0P1Q@mail.gmail.com>
Subject: Re: Child Error
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=047d7b33cccc10c02a04dd8e0ca8

--047d7b33cccc10c02a04dd8e0ca8
Content-Type: text/plain; charset=UTF-8

Hi Jim,

Will you be able to do the same test with Oracle JDK 1.6 instead of OpenJDK
1.7 to see if it maked a difference?

JM

2013/5/25 Jim Twensky <jim.twensky@gmail.com>

> Hi Jean, thanks for replying. I'm using java 1.7.0_21 on ubuntu. Here is
> the output:
>
> $ java -version
> java version "1.7.0_21"
> OpenJDK Runtime Environment (IcedTea 2.3.9) (7u21-2.3.9-0ubuntu0.12.10.1)
> OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)
>
> I don't get any OOME errors and this error happens on random nodes, not a
> particular one. Usually all tasks running on a particular node fail and
> that node gets blacklisted. However, the same node works just fine during
> the next or previous jobs. Can it be a problem with the ssh keys? What else
> can cause the IOException with "failure to login" message? I've been
> digging into this for two days but I'm almost clueless.
>
> Thanks,
> Jim
>
>
>
>
> On Fri, May 24, 2013 at 10:32 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
>> Hi Jim,
>>
>> Which JVM are you using?
>>
>> I don't think you have any memory issue. Else you will have got some
>> OOME...
>>
>> JM
>>
>>
>> 2013/5/24 Jim Twensky <jim.twensky@gmail.com>
>>
>>> Hi again, in addition to my previous post, I was able to get some error
>>> logs from the task tracker/data node this morning and looks like it might
>>> be a jetty issue:
>>>
>>> 2013-05-23 19:59:20,595 WARN org.apache.hadoop.mapred.TaskLog: Failed to
>>> retrieve stdout log for task: attempt_201305231647_0007_m_001096_0
>>> java.io.IOException: Owner 'jim' for path
>>> /var/tmp/jim/hadoop-logs/userlogs/job_201305231647_0007/attempt_201305231647_0007_m_001096_0/stdout
>>> did not match expected owner '10929'
>>>   at org.apache.hadoop.io.SecureIOUtils.checkStat(SecureIOUtils.java:177)
>>>   at
>>> org.apache.hadoop.io.SecureIOUtils.openForRead(SecureIOUtils.java:117)
>>>   at org.apache.hadoop.mapred.TaskLog$Reader.<init>(TaskLog.java:455)
>>>   at
>>> org.apache.hadoop.mapred.TaskLogServlet.printTaskLog(TaskLogServlet.java:81)
>>>   at
>>> org.apache.hadoop.mapred.TaskLogServlet.doGet(TaskLogServlet.java:296)
>>>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
>>>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
>>>   at
>>> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
>>>   at
>>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
>>>   at
>>> org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:848)
>>>   at
>>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>>>   at
>>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>>>
>>>
>>> I am wondering if I am hitting MAPREDUCE-2389<https://issues.apache.org/jira/browse/MAPREDUCE-2389>If so, how do I downgrade my jetty version? Should I just replace the jetty
>>> jar file in the lib directory with an earlier version and restart my
>>> cluster?
>>>
>>> Thank you.
>>>
>>>
>>>
>>>
>>> On Thu, May 23, 2013 at 7:14 PM, Jim Twensky <jim.twensky@gmail.com>wrote:
>>>
>>>> Hello, I have a 20 node Hadoop cluster where each node has 8GB memory
>>>> and an 8-core processor. I sometimes get the following error on a random
>>>> basis:
>>>>
>>>>
>>>>
>>>> -----------------------------------------------------------------------------------------------------------
>>>>
>>>> Exception in thread "main" java.io.IOException: Exception reading file:/var/tmp/jim/hadoop-jim/mapred/local/taskTracker/jim/jobcache/job_201305231647_0005/jobToken
>>>> 	at org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:135)
>>>> 	at org.apache.hadoop.mapreduce.security.TokenCache.loadTokens(TokenCache.java:165)
>>>> 	at org.apache.hadoop.mapred.Child.main(Child.java:92)
>>>> Caused by: java.io.IOException: failure to login
>>>> 	at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:501)
>>>> 	at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:463)
>>>> 	at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:1519)
>>>> 	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1420)
>>>> 	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
>>>> 	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
>>>> 	at org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:129)
>>>> 	... 2 more
>>>> Caused by: javax.security.auth.login.LoginException: java.lang.NullPointerException: invalid null input: name
>>>> 	at com.sun.security.auth.UnixPrincipal.<init>(UnixPrincipal.java:70)
>>>> 	at com.sun.security.auth.module.UnixLoginModule.login(UnixLoginModule.java:132)
>>>> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>
>>>> ......
>>>>
>>>>
>>>> -----------------------------------------------------------------------------------------------------------
>>>>
>>>> This does not always happen but I see a pattern when the intermediate
>>>> data is larger, it tends to occur more frequently. In the web log, I can
>>>> see the following:
>>>>
>>>> java.lang.Throwable: Child Error
>>>> 	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
>>>> Caused by: java.io.IOException: Task process exit with nonzero status of 1.
>>>> 	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
>>>>
>>>> From what I read online, a possible cause is when there is not enough
>>>> memory for all JVM's. My mapred site.xml is set up to allocate 1100MB for
>>>> each child and the maximum number of map and reduce tasks are set to 3 - So
>>>> 6600MB of the child JVMs + (500MB * 2) for the data node and task tracker
>>>> (as I set HADOOP_HEAP to 500 MB). I feel like memory is not the cause but I
>>>> couldn't avoid it so far.
>>>> In case it helps, here are the relevant sections of my mapred-site.xml
>>>>
>>>>
>>>> -----------------------------------------------------------------------------------------------------------
>>>>
>>>>     <name>mapred.tasktracker.map.tasks.maximum</name>
>>>>     <value>3</value>
>>>>
>>>>     <name>mapred.tasktracker.reduce.tasks.maximum</name>
>>>>     <value>3</value>
>>>>
>>>>     <name>mapred.child.java.opts</name>
>>>>     <value>-Xmx1100M -ea -XX:+HeapDumpOnOutOfMemoryError
>>>> -XX:HeapDumpPath=/var/tmp/soner</value>
>>>>
>>>>     <name>mapred.reduce.parallel.copies</name>
>>>>     <value>5</value>
>>>>
>>>>     <name>tasktracker.http.threads</name>
>>>>     <value>80</value>
>>>>
>>>> -----------------------------------------------------------------------------------------------------------
>>>>
>>>> My jobs still complete most of the time though they occasionally fail
>>>> and I'm really puzzled at this point. I'd appreciate any help or ideas.
>>>>
>>>> Thanks
>>>>
>>>
>>>
>>
>

--047d7b33cccc10c02a04dd8e0ca8
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hi Jim,<br><br>Will you be able to do the same test with Oracle JDK 1.6 ins=
tead of OpenJDK 1.7 to see if it maked a difference?<br><br>JM<br><br><div =
class=3D"gmail_quote">2013/5/25 Jim Twensky <span dir=3D"ltr">&lt;<a href=
=3D"mailto:jim.twensky@gmail.com" target=3D"_blank">jim.twensky@gmail.com</=
a>&gt;</span><br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><div><div>Hi Jean, tha=
nks for replying. I&#39;m using java 1.7.0_21 on ubuntu. Here is the output=
:<br>

<br>$ java -version<br>java version &quot;1.7.0_21&quot;<br>OpenJDK Runtime=
 Environment (IcedTea 2.3.9) (7u21-2.3.9-0ubuntu0.12.10.1)<br>
OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)<br><br></div>I don=
9;t get any OOME errors and this error happens on random nodes, not a parti=
cular one. Usually all tasks running on a particular node fail and that nod=
e gets blacklisted. However, the same node works just fine during the next =
or previous jobs. Can it be a problem with the ssh keys? What else can caus=
e the IOException with &quot;failure to login&quot; message? I&#39;ve been =
digging into this for two days but I&#39;m almost clueless.<br>


<br></div>Thanks,<br></div>Jim<br><div><div><div><br><br></div></div></div>=
</div><div class=3D"HOEnZb"><div class=3D"h5"><div class=3D"gmail_extra"><b=
r><br><div class=3D"gmail_quote">On Fri, May 24, 2013 at 10:32 PM, Jean-Mar=
c Spaggiari <span dir=3D"ltr">&lt;<a href=3D"mailto:jean-marc@spaggiari.org=
" target=3D"_blank">jean-marc@spaggiari.org</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Hi Jim,<br><br>Which JVM are you using?<br><=
br>I don&#39;t think you have any memory issue. Else you will have got some=
 OOME...<span><font color=3D"#888888"><br>


<br>JM</font></span><div><div><br><br><div class=3D"gmail_quote">2013/5/24 =
Jim Twensky <span dir=3D"ltr">&lt;<a href=3D"mailto:jim.twensky@gmail.com" =
target=3D"_blank">jim.twensky@gmail.com</a>&gt;</span><br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><div>Hi again, in addi=
tion to my previous post, I was able to get some error logs from the task t=
racker/data node this morning and looks like it might be a jetty issue:<br>


<br>2013-05-23 19:59:20,595 WARN org.apache.hadoop.mapred.TaskLog: Failed t=
o retrieve stdout log for task: attempt_201305231647_0007_m_001096_0<br>
java.io.IOException: Owner &#39;jim&#39; for path /var/tmp/jim/hadoop-logs/=
userlogs/job_201305231647_0007/attempt_201305231647_0007_m_001096_0/stdout =
did not match expected owner &#39;10929&#39;<br>=C2=A0 at org.apache.hadoop=
.io.SecureIOUtils.checkStat(SecureIOUtils.java:177)<br>


=C2=A0 at org.apache.hadoop.io.SecureIOUtils.openForRead(SecureIOUtils.java=
:117)<br>=C2=A0 at org.apache.hadoop.mapred.TaskLog$Reader.&lt;init&gt;(Tas=
kLog.java:455)<br>=C2=A0 at org.apache.hadoop.mapred.TaskLogServlet.printTa=
skLog(TaskLogServlet.java:81)<br>


=C2=A0 at org.apache.hadoop.mapred.TaskLogServlet.doGet(TaskLogServlet.java=
:296)<br>=C2=A0 at javax.servlet.http.HttpServlet.service(HttpServlet.java:=
707)<br>=C2=A0 at javax.servlet.http.HttpServlet.service(HttpServlet.java:8=
20)<br>=C2=A0 at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHold=
er.java:511)<br>


=C2=A0 at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(Ser=
vletHandler.java:1221)<br>=C2=A0 at org.apache.hadoop.http.HttpServer$Quoti=
ngInputFilter.doFilter(HttpServer.java:848)<br>=C2=A0 at org.mortbay.jetty.=
servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)<br>


=C2=A0 at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.ja=
va:399)<br><br><br></div>I am wondering if I am hitting <a href=3D"https://=
issues.apache.org/jira/browse/MAPREDUCE-2389" target=3D"_blank">MAPREDUCE-2=
389</a> If so, how do I downgrade my jetty version? Should I just replace t=
he jetty jar file in the lib directory with an earlier version and restart =
my cluster?<br>


<br></div>Thank you.<br><div><div><br><br></div></div></div><div><div><div =
class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Thu, May 23, 20=
13 at 7:14 PM, Jim Twensky <span dir=3D"ltr">&lt;<a href=3D"mailto:jim.twen=
sky@gmail.com" target=3D"_blank">jim.twensky@gmail.com</a>&gt;</span> wrote=
:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><div><div>Hello, I hav=
e a 20 node Hadoop cluster where each node has 8GB memory and an 8-core pro=
cessor. I sometimes get the following error on a random basis:<br>


<br><br>-------------------------------------------------------------------=
----------------------------------------<br>
<pre>Exception in thread &quot;main&quot; java.io.IOException: Exception re=
ading file:/var/tmp/jim/hadoop-jim/mapred/local/taskTracker/jim/jobcache/jo=
b_201305231647_0005/jobToken
	at org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials=
.java:135)
	at org.apache.hadoop.mapreduce.security.TokenCache.loadTokens(TokenCache.j=
ava:165)
	at org.apache.hadoop.mapred.Child.main(Child.java:92)
Caused by: java.io.IOException: failure to login
	at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupI=
nformation.java:501)
	at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGrou=
pInformation.java:463)
	at org.apache.hadoop.fs.FileSystem$Cache$Key.&lt;init&gt;(FileSystem.java:=
1519)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1420)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
	at org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials=
.java:129)
	... 2 more
Caused by: javax.security.auth.login.LoginException: java.lang.NullPointerE=
xception: invalid null input: name
	at com.sun.security.auth.UnixPrincipal.&lt;init&gt;(UnixPrincipal.java:70)
	at com.sun.security.auth.module.UnixLoginModule.login(UnixLoginModule.java=
:132)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.ja=
va:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccesso=
rImpl.java:43)<br><br>......<br></pre>-------------------------------------=
----------------------------------------------------------------------<br>


<br></div>This does not always happen but I see a pattern when the intermed=
iate data is larger, it tends to occur more frequently. In the web log, I c=
an see the following:<br><br><pre>java.lang.Throwable: Child Error
	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status of 1.
	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)<br><br></p=
re>From what I read online, a possible cause is when there is not enough me=
mory for all JVM&#39;s. My mapred site.xml is set up to allocate 1100MB for=
 each child and the maximum number of map and reduce tasks are set to 3 - S=
o 6600MB of the child JVMs + (500MB * 2) for the data node and task tracker=
 (as I set HADOOP_HEAP to 500 MB). I feel like memory is not the cause but =
I couldn&#39;t avoid it so far.<br>


</div>In case it helps, here are the relevant sections of my mapred-site.xm=
l<br><br>------------------------------------------------------------------=
-----------------------------------------<br><br>=C2=A0=C2=A0=C2=A0 &lt;nam=
e&gt;mapred.tasktracker.map.tasks.maximum&lt;/name&gt;<br>


=C2=A0=C2=A0=C2=A0 &lt;value&gt;3&lt;/value&gt;<br><br>=C2=A0=C2=A0=C2=A0 &=
lt;name&gt;mapred.tasktracker.reduce.tasks.maximum&lt;/name&gt;<br>=C2=A0=
=C2=A0=C2=A0 &lt;value&gt;3&lt;/value&gt;<br><br>=C2=A0=C2=A0=C2=A0 &lt;nam=
e&gt;mapred.child.java.opts&lt;/name&gt;<br>=C2=A0=C2=A0=C2=A0 &lt;value&gt=
;-Xmx1100M -ea -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=3D/var/tmp/=
soner&lt;/value&gt;<br>


<br>=C2=A0=C2=A0=C2=A0 &lt;name&gt;mapred.reduce.parallel.copies&lt;/name&g=
t;<br>=C2=A0=C2=A0=C2=A0 &lt;value&gt;5&lt;/value&gt;<br><br>=C2=A0=C2=A0=
=C2=A0 &lt;name&gt;tasktracker.http.threads&lt;/name&gt;<br>=C2=A0=C2=A0=C2=
=A0 &lt;value&gt;80&lt;/value&gt;<br>--------------------------------------=
---------------------------------------------------------------------<br>


<br></div>My jobs still complete most of the time though they occasionally =
fail and I&#39;m really puzzled at this point. I&#39;d appreciate any help =
or ideas.<br><br>Thanks<br></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br>

--047d7b33cccc10c02a04dd8e0ca8--