flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravinder Kaur <neetu0...@gmail.com>
Subject Re: OutofMemoryError: Java heap space & Loss of Taskmanager
Date Tue, 15 Mar 2016 09:54:11 GMT
Hello Ufuk,

Yes, the same WordCount program is being run.

Kind Regards,
Ravinder Kaur

On Tue, Mar 15, 2016 at 10:45 AM, Ufuk Celebi <uce@apache.org> wrote:

> What do you mean with iteration in this context? Are you repeatedly
> running the same WordCount program for streaming and batch
> respectively?
>
> – Ufuk
>
> On Tue, Mar 15, 2016 at 10:22 AM, Till Rohrmann <trohrmann@apache.org>
> wrote:
> > Hi Ravinder,
> >
> > could you tell us what's written in the taskmanager log of the failing
> > taskmanager? There should be some kind of failure why the taskmanager
> > stopped working.
> >
> > Moreover, given that you have 64 GB of main memory, you could easily give
> > 50GB as heap memory to each taskmanager.
> >
> > Cheers,
> > Till
> >
> > On Tue, Mar 15, 2016 at 9:48 AM, Ravinder Kaur <neetu0404@gmail.com>
> wrote:
> >>
> >> Hello All,
> >>
> >> I'm running a simple word count example using the quickstart package
> from
> >> the Flink(0.10.1), on an input dataset of 500MB. This dataset is a set
> of
> >> randomly generated words of length 8.
> >>
> >> Cluster Configuration:
> >>
> >> Number of machines: 7
> >> Total cores : 25
> >> Memory on each: 64GB
> >>
> >> I'm interested in the performance measure between Batch and Stream modes
> >> and so I'm running WordCount example with number of iteration (max 10)
> on
> >> datasets of sizes ranging between 100MB and 50GB consisting of random
> words
> >> of length 4 and 8.
> >>
> >> While I ran the experiments in Batch mode all iterations ran fine, but
> now
> >> I'm stuck in the Streaming mode at this
> >>
> >> Caused by: java.lang.OutOfMemoryError: Java heap space
> >>         at java.util.HashMap.resize(HashMap.java:580)
> >>         at java.util.HashMap.addEntry(HashMap.java:879)
> >>         at java.util.HashMap.put(HashMap.java:505)
> >>         at
> >>
> org.apache.flink.runtime.state.AbstractHeapKvState.update(AbstractHeapKvState.java:98)
> >>         at
> >>
> org.apache.flink.streaming.api.operators.StreamGroupedReduce.processElement(StreamGroupedReduce.java:59)
> >>         at
> >>
> org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:166)
> >>         at
> >>
> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:63)
> >>         at
> >>
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:218)
> >>         at org.apache.flink.runtime.taskmanager.Task.run(Task.java:584)
> >>         at java.lang.Thread.run(Thread.java:745)
> >>
> >> I investigated found 2 solutions. (1) Increasing the taskmanager.heap.mb
> >> and (2) Reducing the taskmanager.memory.fraction
> >>
> >> Therefore I set taskmanager.heap.mb: 1024 and
> taskmanager.memory.fraction:
> >> 0.5 (default 0.7)
> >>
> >> When I ran the example with this setting I loose taskmanagers one by one
> >> during the job execution with the following cause
> >>
> >> Caused by: java.lang.Exception: The slot in which the task was executed
> >> has been released. Probably loss of TaskManager
> >> 831a72dad6fbb533b193820f45bdc5bc @ vm-10-155-208-138 - 4 slots - URL:
> >> akka.tcp://flink@10.155.208.138:42222/user/taskmanager
> >>         at
> >>
> org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:153)
> >>         at
> >>
> org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)
> >>         at
> >>
> org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)
> >>         at
> >> org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156)
> >>         at
> >>
> org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215)
> >>         at
> >>
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:696)
> >>         at
> >>
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
> >>         at
> >>
> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
> >>         at
> >>
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
> >>         at
> >> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
> >>         at
> >> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
> >>         at
> >> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
> >>         at
> >>
> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
> >>         at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> >>         at
> >>
> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
> >>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> >>         at
> >>
> akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)
> >>         at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
> >>         at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
> >>         at akka.actor.ActorCell.invoke(ActorCell.scala:486)
> >>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
> >>         at akka.dispatch.Mailbox.run(Mailbox.scala:221)
> >>         at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
> >>         at
> >> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> >>         at
> >>
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> >>         ... 2 more
> >>
> >>
> >> While I look at the results generated at each taskmanager, they are
> fine.
> >> The logs also don't show any causes for the the job to get cancelled.
> >>
> >>
> >> Could anyone kindly guide me here?
> >>
> >> Kind Regards,
> >> Ravinder Kaur.
> >
> >
>

Mime
View raw message