hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinod Kumar Vavilapalli (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MAPREDUCE-3333) MR AM for sort-job going out of memory
Date Tue, 08 Nov 2011 10:26:51 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Vinod Kumar Vavilapalli updated MAPREDUCE-3333:
-----------------------------------------------

    Attachment: MAPREDUCE-3333-20111108.txt

Tracked this down finally. With lots of help from Karam.

What was happening was that after MAPREDUCE-3256, we create one connection per container to
a nodeManager and this per-container connection wasn't closed after its use. Soon, the number
of threads created by Hadoop RPC per connection reaches the ulimit on the node's number of
processes and java beautifully describes it as an out-of-memory error.

I put in a "RPC.stopProxy(obj)" call a couple of days back itself, but that didn't work because
of the multiple layering of RPC in Yarn. It's time somebody cleanup that mess.

Attached patch should (finally) fix this. Cannot add in any automated tests. Testing on a
big cluster only where this is reproducible consistently.

                
> MR AM for sort-job going out of memory
> --------------------------------------
>
>                 Key: MAPREDUCE-3333
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3333
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>         Attachments: MAPREDUCE-3333-20111102.txt, MAPREDUCE-3333-20111108.txt
>
>
> [~Karams] just found this. The usual sort job on a 350 node cluster hung due to OutOfMemory
and eventually failed after an hour instead of the usual odd 20 minutes.
> {code}
> 2011-11-02 11:40:36,438 ERROR [ContainerLauncher #258] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
Container launch failed for container_1320233407485_0002
> _01_001434 : java.lang.reflect.UndeclaredThrowableException
>         at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagerPBClientImpl.startContainer(ContainerManagerPBClientImpl.java:88)
>         at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:290)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> Caused by: com.google.protobuf.ServiceException: java.io.IOException: Failed on local
exception: java.io.IOException: Couldn't set up IO streams; Host Details : local host is:
"gsbl91281.blue.ygrid.yahoo.com/98.137.101.189"; destination host is: ""gsbl91525.blue.ygrid.yahoo.com":45450;

>         at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:139)
>         at $Proxy20.startContainer(Unknown Source)
>         at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagerPBClientImpl.startContainer(ContainerManagerPBClientImpl.java:81)
>         ... 4 more
> Caused by: java.io.IOException: Failed on local exception: java.io.IOException: Couldn't
set up IO streams; Host Details : local host is: "gsbl91281.blue.ygrid.yahoo.com/98.137.101.189";
destination host is: ""gsbl91525.blue.ygrid.yahoo.com":45450; 
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:655)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1089)
>         at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:136)
>         ... 6 more
> Caused by: java.io.IOException: Couldn't set up IO streams
>         at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:621)
>         at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:205)
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1195)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1065)
>         ... 7 more
> Caused by: java.lang.OutOfMemoryError: unable to create new native thread
>         at java.lang.Thread.start0(Native Method)
>         at java.lang.Thread.start(Thread.java:597)
>         at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:614)
>         ... 10 more
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message