hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jim Twensky <jim.twen...@gmail.com>
Subject Re: Child Error
Date Sat, 25 May 2013 16:32:34 GMT
Hi Jean, thanks for replying. I'm using java 1.7.0_21 on ubuntu. Here is
the output:

$ java -version
java version "1.7.0_21"
OpenJDK Runtime Environment (IcedTea 2.3.9) (7u21-2.3.9-0ubuntu0.12.10.1)
OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)

I don't get any OOME errors and this error happens on random nodes, not a
particular one. Usually all tasks running on a particular node fail and
that node gets blacklisted. However, the same node works just fine during
the next or previous jobs. Can it be a problem with the ssh keys? What else
can cause the IOException with "failure to login" message? I've been
digging into this for two days but I'm almost clueless.

Thanks,
Jim




On Fri, May 24, 2013 at 10:32 PM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Hi Jim,
>
> Which JVM are you using?
>
> I don't think you have any memory issue. Else you will have got some
> OOME...
>
> JM
>
>
> 2013/5/24 Jim Twensky <jim.twensky@gmail.com>
>
>> Hi again, in addition to my previous post, I was able to get some error
>> logs from the task tracker/data node this morning and looks like it might
>> be a jetty issue:
>>
>> 2013-05-23 19:59:20,595 WARN org.apache.hadoop.mapred.TaskLog: Failed to
>> retrieve stdout log for task: attempt_201305231647_0007_m_001096_0
>> java.io.IOException: Owner 'jim' for path
>> /var/tmp/jim/hadoop-logs/userlogs/job_201305231647_0007/attempt_201305231647_0007_m_001096_0/stdout
>> did not match expected owner '10929'
>>   at org.apache.hadoop.io.SecureIOUtils.checkStat(SecureIOUtils.java:177)
>>   at
>> org.apache.hadoop.io.SecureIOUtils.openForRead(SecureIOUtils.java:117)
>>   at org.apache.hadoop.mapred.TaskLog$Reader.<init>(TaskLog.java:455)
>>   at
>> org.apache.hadoop.mapred.TaskLogServlet.printTaskLog(TaskLogServlet.java:81)
>>   at
>> org.apache.hadoop.mapred.TaskLogServlet.doGet(TaskLogServlet.java:296)
>>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
>>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
>>   at
>> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
>>   at
>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
>>   at
>> org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:848)
>>   at
>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>>   at
>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>>
>>
>> I am wondering if I am hitting MAPREDUCE-2389<https://issues.apache.org/jira/browse/MAPREDUCE-2389>If
so, how do I downgrade my jetty version? Should I just replace the jetty
>> jar file in the lib directory with an earlier version and restart my
>> cluster?
>>
>> Thank you.
>>
>>
>>
>>
>> On Thu, May 23, 2013 at 7:14 PM, Jim Twensky <jim.twensky@gmail.com>wrote:
>>
>>> Hello, I have a 20 node Hadoop cluster where each node has 8GB memory
>>> and an 8-core processor. I sometimes get the following error on a random
>>> basis:
>>>
>>>
>>>
>>> -----------------------------------------------------------------------------------------------------------
>>>
>>> Exception in thread "main" java.io.IOException: Exception reading file:/var/tmp/jim/hadoop-jim/mapred/local/taskTracker/jim/jobcache/job_201305231647_0005/jobToken
>>> 	at org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:135)
>>> 	at org.apache.hadoop.mapreduce.security.TokenCache.loadTokens(TokenCache.java:165)
>>> 	at org.apache.hadoop.mapred.Child.main(Child.java:92)
>>> Caused by: java.io.IOException: failure to login
>>> 	at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:501)
>>> 	at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:463)
>>> 	at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:1519)
>>> 	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1420)
>>> 	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
>>> 	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
>>> 	at org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:129)
>>> 	... 2 more
>>> Caused by: javax.security.auth.login.LoginException: java.lang.NullPointerException:
invalid null input: name
>>> 	at com.sun.security.auth.UnixPrincipal.<init>(UnixPrincipal.java:70)
>>> 	at com.sun.security.auth.module.UnixLoginModule.login(UnixLoginModule.java:132)
>>> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>
>>> ......
>>>
>>>
>>> -----------------------------------------------------------------------------------------------------------
>>>
>>> This does not always happen but I see a pattern when the intermediate
>>> data is larger, it tends to occur more frequently. In the web log, I can
>>> see the following:
>>>
>>> java.lang.Throwable: Child Error
>>> 	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
>>> Caused by: java.io.IOException: Task process exit with nonzero status of 1.
>>> 	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
>>>
>>> From what I read online, a possible cause is when there is not enough
>>> memory for all JVM's. My mapred site.xml is set up to allocate 1100MB for
>>> each child and the maximum number of map and reduce tasks are set to 3 - So
>>> 6600MB of the child JVMs + (500MB * 2) for the data node and task tracker
>>> (as I set HADOOP_HEAP to 500 MB). I feel like memory is not the cause but I
>>> couldn't avoid it so far.
>>> In case it helps, here are the relevant sections of my mapred-site.xml
>>>
>>>
>>> -----------------------------------------------------------------------------------------------------------
>>>
>>>     <name>mapred.tasktracker.map.tasks.maximum</name>
>>>     <value>3</value>
>>>
>>>     <name>mapred.tasktracker.reduce.tasks.maximum</name>
>>>     <value>3</value>
>>>
>>>     <name>mapred.child.java.opts</name>
>>>     <value>-Xmx1100M -ea -XX:+HeapDumpOnOutOfMemoryError
>>> -XX:HeapDumpPath=/var/tmp/soner</value>
>>>
>>>     <name>mapred.reduce.parallel.copies</name>
>>>     <value>5</value>
>>>
>>>     <name>tasktracker.http.threads</name>
>>>     <value>80</value>
>>>
>>> -----------------------------------------------------------------------------------------------------------
>>>
>>> My jobs still complete most of the time though they occasionally fail
>>> and I'm really puzzled at this point. I'd appreciate any help or ideas.
>>>
>>> Thanks
>>>
>>
>>
>

Mime
View raw message