flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ufuk Celebi <...@apache.org>
Subject Re: loss of TaskManager
Date Thu, 25 Feb 2016 11:35:06 GMT
Hey Chris!

I think that the full amount of memory to Flink leads to the TM
process being killed by the OS. Can you check the OS logs whether the
OOM killer shut it down? You should be able to see this in the system

– Ufuk

On Thu, Feb 25, 2016 at 11:24 AM, Boden, Christoph
<christoph.boden@tu-berlin.de> wrote:
> Dear Flink Community,
> I am trying to fit a support vector machine classifier using the CoCoA implementation
provided in flink/ml/classification/ on a data set of moderate size (400k data points, 2000
features, approx. 12GB) on a cluster of 25 nodes with 28 GB memory each - and each worker
node is awarded the full 28GB in  taskmanager.heap.mb.
> With the standard configuration I constantly run into different versions of JVM HeapSpace
OutOfMemory Errors. (e.g. com.esotericsoftware.kryo.KryoException: java.io.IOException: Failed
to serialize element. Serialized size (> 276647402 bytes) exceeds JVM heap space - Serialization
trace: data (org.apache.flink.ml.math.SparseVector) ... ")
> As changing DOP did not alter anything, I significantly reduced the taskmanager.memory.fraction.
With this I now (reproducibly) run into the following problem.
> After running for a while, the job fails with the following error:
> java.lang.Exception: The slot in which the task was executed has been released. Probably
loss of TaskManager  @ host slots - URL: akka.tcp://flink@url
> 2/user/taskmanager
>         at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:153)
>         at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)
>         at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)
>         at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)
>         at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)
>         at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:154)
>         at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182)
>         at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:421)
>         at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>         at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>         at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>         at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:36)
>         at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:29)
>         at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>         at org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:29)
>         at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>         at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:92)
>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>         at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)
>         at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
>         at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
>         at akka.actor.ActorCell.invoke(ActorCell.scala:486)
>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>         at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>         at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>         at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>         at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>         at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>         at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> However the log of the taskmanager in question does not show any error or exception in
its log. The last log entry is:
> 2016-02-25 09:38:12,543 INFO  org.apache.flink.runtime.iterative.task.IterationIntermediatePactTask
 - finishing iteration [2]:  Combine (Reduce at org.apache.flink.ml.classification.SVM$$anon$25$$anonfun$6.apply(SVM.scala:392))
> I am somewhat puzzled what could be the cause of this. Any help, or pointers to appropriate
documentation would be greatly appreciated.
> I'll try increasing the heartbeat intervals next, but would still like to understand
what goes wrong here.
> Best regards,
> Christoph

View raw message