flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ufuk Celebi <...@apache.org>
Subject Re: OutofMemoryError: Java heap space & Loss of Taskmanager
Date Tue, 15 Mar 2016 09:45:03 GMT
What do you mean with iteration in this context? Are you repeatedly
running the same WordCount program for streaming and batch
respectively?

– Ufuk

On Tue, Mar 15, 2016 at 10:22 AM, Till Rohrmann <trohrmann@apache.org> wrote:
> Hi Ravinder,
>
> could you tell us what's written in the taskmanager log of the failing
> taskmanager? There should be some kind of failure why the taskmanager
> stopped working.
>
> Moreover, given that you have 64 GB of main memory, you could easily give
> 50GB as heap memory to each taskmanager.
>
> Cheers,
> Till
>
> On Tue, Mar 15, 2016 at 9:48 AM, Ravinder Kaur <neetu0404@gmail.com> wrote:
>>
>> Hello All,
>>
>> I'm running a simple word count example using the quickstart package from
>> the Flink(0.10.1), on an input dataset of 500MB. This dataset is a set of
>> randomly generated words of length 8.
>>
>> Cluster Configuration:
>>
>> Number of machines: 7
>> Total cores : 25
>> Memory on each: 64GB
>>
>> I'm interested in the performance measure between Batch and Stream modes
>> and so I'm running WordCount example with number of iteration (max 10) on
>> datasets of sizes ranging between 100MB and 50GB consisting of random words
>> of length 4 and 8.
>>
>> While I ran the experiments in Batch mode all iterations ran fine, but now
>> I'm stuck in the Streaming mode at this
>>
>> Caused by: java.lang.OutOfMemoryError: Java heap space
>>         at java.util.HashMap.resize(HashMap.java:580)
>>         at java.util.HashMap.addEntry(HashMap.java:879)
>>         at java.util.HashMap.put(HashMap.java:505)
>>         at
>> org.apache.flink.runtime.state.AbstractHeapKvState.update(AbstractHeapKvState.java:98)
>>         at
>> org.apache.flink.streaming.api.operators.StreamGroupedReduce.processElement(StreamGroupedReduce.java:59)
>>         at
>> org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:166)
>>         at
>> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:63)
>>         at
>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:218)
>>         at org.apache.flink.runtime.taskmanager.Task.run(Task.java:584)
>>         at java.lang.Thread.run(Thread.java:745)
>>
>> I investigated found 2 solutions. (1) Increasing the taskmanager.heap.mb
>> and (2) Reducing the taskmanager.memory.fraction
>>
>> Therefore I set taskmanager.heap.mb: 1024 and taskmanager.memory.fraction:
>> 0.5 (default 0.7)
>>
>> When I ran the example with this setting I loose taskmanagers one by one
>> during the job execution with the following cause
>>
>> Caused by: java.lang.Exception: The slot in which the task was executed
>> has been released. Probably loss of TaskManager
>> 831a72dad6fbb533b193820f45bdc5bc @ vm-10-155-208-138 - 4 slots - URL:
>> akka.tcp://flink@10.155.208.138:42222/user/taskmanager
>>         at
>> org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:153)
>>         at
>> org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)
>>         at
>> org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)
>>         at
>> org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156)
>>         at
>> org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215)
>>         at
>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:696)
>>         at
>> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>>         at
>> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>>         at
>> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>>         at
>> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>>         at
>> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>>         at
>> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>>         at
>> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>>         at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>>         at
>> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
>>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>         at
>> akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)
>>         at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
>>         at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
>>         at akka.actor.ActorCell.invoke(ActorCell.scala:486)
>>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>>         at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>>         at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>>         at
>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>         at
>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>         ... 2 more
>>
>>
>> While I look at the results generated at each taskmanager, they are fine.
>> The logs also don't show any causes for the the job to get cancelled.
>>
>>
>> Could anyone kindly guide me here?
>>
>> Kind Regards,
>> Ravinder Kaur.
>
>

Mime
View raw message