reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mariia Mykhailova (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (REEF-1250) Memory leak in Evaluators
Date Mon, 02 May 2016 23:53:12 GMT

    [ https://issues.apache.org/jira/browse/REEF-1250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15267759#comment-15267759
] 

Mariia Mykhailova commented on REEF-1250:
-----------------------------------------

We have an issue which might be caused by this. A long-running driver requests thousands of
evaluators but promptly returns most of them, since it needs only around 250 (which have to
be on distinct machines, so most evaluators granted are not suitable). As a result, REEF AM
crashes with {{java.lang.OutOfMemoryError}} after about 1 hour of running.

How to verify that this is the actual root cause? And how do we want to fix this (regardless
of whether it is)? Is it safe for us to forget about evaluators which have been returned?

{noformat}
Application failed due to:
Java heap space
With stack trace:
java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:84)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:59)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:119)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:366)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:338)
at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1965)
at org.apache.reef.runtime.yarn.driver.UploaderToJobFolder.uploadToJobFolder(UploaderToJobfolder.java:73)
at org.apache.reef.runtime.yarn.driver.EvaluatorSetupHelper.getResources(EvaluatorSetupHelper.java:116)
at org.apache.reef.runtime.yarn.driver.YARNResourceLaunchHandler.onNext(YARNResourceLaunchHandler.java:83)
at org.apache.reef.runtime.yarn.driver.YARNResourceLaunchHandler.onNext(YARNResourceLaunchHandler.java:47)
at org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager.onResourceLaunch(EvaluatorManager.java:474)
at org.apache.reef.runtime.common.driver.evaluator.AllocatedEvaluatorImpl.resourceBuildAndLaunch(AllocatedEvaluatorImpl.java:251)
at org.apache.reef.runtime.common.driver.evaluator.AllocatedEvaluatorImpl.launchWithConfigurationString(AllocatedEvaluatorImpl.java:236)
at org.apache.reef.runtime.common.driver.evaluator.AllocatedEvaluatorImpl.submitContextAndTask(AllocatedEvaluatorImpl.java:170)
at org.apache.reef.javabridge.AllocatedEvaluatorBridge.submitContextAndTaskString(AllocatedEvaluatorBridge.java:71)
at org.apache.reef.javabridge.NativeInterop.clrSystemAllocatedEvaluatorHandlerOnNext(Native
Method)
at org.apache.reef.javabridge.generic.JobDriver.submitEvaluator(JobDriver.java:238)
at org.apache.reef.javabridge.generic.JobDriver.access$600(JobDriver.java:66)
at org.apache.reef.javabridge.generic.JobDriver$AllocatedEvaluatorHandler.onNext(JobDriver.java:347)
at org.apache.reef.javabridge.generic.JobDriver$AllocatedEvaluatorHandler.onNext(JobDriver.java:341)
at org.apache.reef.runtime.common.utils.BroadCastEventHandler.onNext(BroadCastEventHandler.java:40)
at org.apache.reef.util.ExceptionHandlingEventHandler.onNext(ExceptionHandlingEventHandler.java:46)
at org.apache.reef.runtime.common.utils.DispatchingEStage$1.onNext(DispatchingEStage.java:68)
at org.apache.reef.runtime.common.utils.DispatchingEStage$1.onNext(DispatchingEStage.java:65)
at org.apache.reef.wake.impl.ThreadPoolStage$1.run(ThreadPoolStage.java:182)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745) 
{noformat}

> Memory leak in Evaluators
> -------------------------
>
>                 Key: REEF-1250
>                 URL: https://issues.apache.org/jira/browse/REEF-1250
>             Project: REEF
>          Issue Type: Bug
>          Components: REEF Driver
>            Reporter: Markus Weimer
>            Priority: Minor
>
> In {{Evaluators}}, we keep track of all the Evaluators that ever existed. Including the
ones that have failed or been returned. For very long running Drivers, this is a memory leak.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message