hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Marc Spaggiari <jean-m...@spaggiari.org>
Subject Re: Child Error
Date Tue, 28 May 2013 14:24:41 GMT
That's strange.

So now each time you are running it it's railing with the exception below?
Or it's sometime working, sometime failing?

also, can you clear you tmp directory and make sure you have enough space
it it before you retry?

JM

2013/5/27 Jim Twensky <jim.twensky@gmail.com>

> Hi Jean,
>
> I switched to Oracle JDK 1.6 as you suggested and ran a job successfully
> this afternoon which lasted for about 3 hours - this job was producing
> errors with open JDK normally. I then stopped the cluster, re-started it
> again and tried running the same job but I got the same "failure to log'in"
> error using the Oracle JDK. This is really weird and unusual. I am pasting
> the stack trace below. It occurred in 3 different nodes out of 20. Any
> suggestions?
>
>
>
>
>
> Exception in thread "main" java.io.IOException: Exception reading
> file:/var/tmp/jim/hadoop-jim/mapred/local/taskTracker/jim/jobcache/job_201305262239_0002/jobToken
>
>     at
> org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:135)
>     at
> org.apache.hadoop.mapreduce.security.TokenCache.loadTokens(TokenCache.java:165)
>     at org.apache.hadoop.mapred.Child.main(Child.java:92)
> Caused by: java.io.IOException: failure to login
>     at
> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:501)
>     at
> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:463)
>     at
> org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:1519)
>     at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1420)
>     at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
>     at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
>     at
> org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:129)
>     ... 2 more
> Caused by: javax.security.auth.login.LoginException:
> java.lang.NullPointerException: invalid null input: name
>     at com.sun.security.auth.UnixPrincipal.<init>(UnixPrincipal.java:53)
>     at
> com.sun.security.auth.module.UnixLoginModule.login(UnixLoginModule.java:114)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>     at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>     at java.lang.reflect.Method.invoke(Method.java:597)
>     at javax.security.auth.login.LoginContext.invoke(LoginContext.java:769)
>     at
> javax.security.auth.login.LoginContext.access$000(LoginContext.java:186)
>     at javax.security.auth.login.LoginContext$5.run(LoginContext.java:706)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at
> javax.security.auth.login.LoginContext.invokeCreatorPriv(LoginContext.java:703)
>     at javax.security.auth.login.LoginContext.login(LoginContext.java:575)
>     at
> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:482)
>
>     at
> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:463)
>     at
> org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:1519)
>     at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1420)
>     at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
>     at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
>     at
> org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:129)
>     at
> org.apache.hadoop.mapreduce.security.TokenCache.loadTokens(TokenCache.java:165)
>     at org.apache.hadoop.mapred.Child.main(Child.java:92)
>
>     at javax.security.auth.login.LoginContext.invoke(LoginContext.java:872)
>     at
> javax.security.auth.login.LoginContext.access$000(LoginContext.java:186)
>     at javax.security.auth.login.LoginContext$5.run(LoginContext.java:706)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at
> javax.security.auth.login.LoginContext.invokeCreatorPriv(LoginContext.java:703)
>     at javax.security.auth.login.LoginContext.login(LoginContext.java:575)
>     at
> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:482)
>
>
> On Sat, May 25, 2013 at 12:14 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
>> Hi Jim,
>>
>> Will you be able to do the same test with Oracle JDK 1.6 instead of
>> OpenJDK 1.7 to see if it maked a difference?
>>
>> JM
>>
>>
>> 2013/5/25 Jim Twensky <jim.twensky@gmail.com>
>>
>>> Hi Jean, thanks for replying. I'm using java 1.7.0_21 on ubuntu. Here is
>>> the output:
>>>
>>> $ java -version
>>> java version "1.7.0_21"
>>> OpenJDK Runtime Environment (IcedTea 2.3.9) (7u21-2.3.9-0ubuntu0.12.10.1)
>>> OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)
>>>
>>> I don't get any OOME errors and this error happens on random nodes, not
>>> a particular one. Usually all tasks running on a particular node fail and
>>> that node gets blacklisted. However, the same node works just fine during
>>> the next or previous jobs. Can it be a problem with the ssh keys? What else
>>> can cause the IOException with "failure to login" message? I've been
>>> digging into this for two days but I'm almost clueless.
>>>
>>> Thanks,
>>> Jim
>>>
>>>
>>>
>>>
>>> On Fri, May 24, 2013 at 10:32 PM, Jean-Marc Spaggiari <
>>> jean-marc@spaggiari.org> wrote:
>>>
>>>> Hi Jim,
>>>>
>>>> Which JVM are you using?
>>>>
>>>> I don't think you have any memory issue. Else you will have got some
>>>> OOME...
>>>>
>>>> JM
>>>>
>>>>
>>>> 2013/5/24 Jim Twensky <jim.twensky@gmail.com>
>>>>
>>>>> Hi again, in addition to my previous post, I was able to get some
>>>>> error logs from the task tracker/data node this morning and looks like
it
>>>>> might be a jetty issue:
>>>>>
>>>>> 2013-05-23 19:59:20,595 WARN org.apache.hadoop.mapred.TaskLog: Failed
>>>>> to retrieve stdout log for task: attempt_201305231647_0007_m_001096_0
>>>>> java.io.IOException: Owner 'jim' for path
>>>>> /var/tmp/jim/hadoop-logs/userlogs/job_201305231647_0007/attempt_201305231647_0007_m_001096_0/stdout
>>>>> did not match expected owner '10929'
>>>>>   at
>>>>> org.apache.hadoop.io.SecureIOUtils.checkStat(SecureIOUtils.java:177)
>>>>>   at
>>>>> org.apache.hadoop.io.SecureIOUtils.openForRead(SecureIOUtils.java:117)
>>>>>   at org.apache.hadoop.mapred.TaskLog$Reader.<init>(TaskLog.java:455)
>>>>>   at
>>>>> org.apache.hadoop.mapred.TaskLogServlet.printTaskLog(TaskLogServlet.java:81)
>>>>>   at
>>>>> org.apache.hadoop.mapred.TaskLogServlet.doGet(TaskLogServlet.java:296)
>>>>>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
>>>>>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
>>>>>   at
>>>>> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
>>>>>   at
>>>>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
>>>>>   at
>>>>> org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:848)
>>>>>   at
>>>>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>>>>>   at
>>>>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>>>>>
>>>>>
>>>>> I am wondering if I am hitting MAPREDUCE-2389<https://issues.apache.org/jira/browse/MAPREDUCE-2389>If
so, how do I downgrade my jetty version? Should I just replace the jetty
>>>>> jar file in the lib directory with an earlier version and restart my
>>>>> cluster?
>>>>>
>>>>> Thank you.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, May 23, 2013 at 7:14 PM, Jim Twensky <jim.twensky@gmail.com>wrote:
>>>>>
>>>>>> Hello, I have a 20 node Hadoop cluster where each node has 8GB memory
>>>>>> and an 8-core processor. I sometimes get the following error on a
random
>>>>>> basis:
>>>>>>
>>>>>>
>>>>>>
>>>>>> -----------------------------------------------------------------------------------------------------------
>>>>>>
>>>>>> Exception in thread "main" java.io.IOException: Exception reading
file:/var/tmp/jim/hadoop-jim/mapred/local/taskTracker/jim/jobcache/job_201305231647_0005/jobToken
>>>>>> 	at org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:135)
>>>>>> 	at org.apache.hadoop.mapreduce.security.TokenCache.loadTokens(TokenCache.java:165)
>>>>>> 	at org.apache.hadoop.mapred.Child.main(Child.java:92)
>>>>>> Caused by: java.io.IOException: failure to login
>>>>>> 	at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:501)
>>>>>> 	at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:463)
>>>>>> 	at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:1519)
>>>>>> 	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1420)
>>>>>> 	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
>>>>>> 	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
>>>>>> 	at org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:129)
>>>>>> 	... 2 more
>>>>>> Caused by: javax.security.auth.login.LoginException: java.lang.NullPointerException:
invalid null input: name
>>>>>> 	at com.sun.security.auth.UnixPrincipal.<init>(UnixPrincipal.java:70)
>>>>>> 	at com.sun.security.auth.module.UnixLoginModule.login(UnixLoginModule.java:132)
>>>>>> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>
>>>>>> ......
>>>>>>
>>>>>>
>>>>>> -----------------------------------------------------------------------------------------------------------
>>>>>>
>>>>>> This does not always happen but I see a pattern when the intermediate
>>>>>> data is larger, it tends to occur more frequently. In the web log,
I can
>>>>>> see the following:
>>>>>>
>>>>>> java.lang.Throwable: Child Error
>>>>>> 	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
>>>>>> Caused by: java.io.IOException: Task process exit with nonzero status
of 1.
>>>>>> 	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
>>>>>>
>>>>>> From what I read online, a possible cause is when there is not enough
>>>>>> memory for all JVM's. My mapred site.xml is set up to allocate 1100MB
for
>>>>>> each child and the maximum number of map and reduce tasks are set
to 3 - So
>>>>>> 6600MB of the child JVMs + (500MB * 2) for the data node and task
tracker
>>>>>> (as I set HADOOP_HEAP to 500 MB). I feel like memory is not the cause
but I
>>>>>> couldn't avoid it so far.
>>>>>> In case it helps, here are the relevant sections of my mapred-site.xml
>>>>>>
>>>>>>
>>>>>> -----------------------------------------------------------------------------------------------------------
>>>>>>
>>>>>>     <name>mapred.tasktracker.map.tasks.maximum</name>
>>>>>>     <value>3</value>
>>>>>>
>>>>>>     <name>mapred.tasktracker.reduce.tasks.maximum</name>
>>>>>>     <value>3</value>
>>>>>>
>>>>>>     <name>mapred.child.java.opts</name>
>>>>>>     <value>-Xmx1100M -ea -XX:+HeapDumpOnOutOfMemoryError
>>>>>> -XX:HeapDumpPath=/var/tmp/soner</value>
>>>>>>
>>>>>>     <name>mapred.reduce.parallel.copies</name>
>>>>>>     <value>5</value>
>>>>>>
>>>>>>     <name>tasktracker.http.threads</name>
>>>>>>     <value>80</value>
>>>>>>
>>>>>> -----------------------------------------------------------------------------------------------------------
>>>>>>
>>>>>> My jobs still complete most of the time though they occasionally
fail
>>>>>> and I'm really puzzled at this point. I'd appreciate any help or
ideas.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message