flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravinder Kaur <neetu0...@gmail.com>
Subject Re: OutofMemoryError: Java heap space & Loss of Taskmanager
Date Tue, 15 Mar 2016 10:17:47 GMT
Hi Till,

Log of JobManager

09:55:31,574 WARN  org.apache.hadoop.util.NativeCodeLoader
      - Unable to load native-hadoop library for your platform... using
builtin-java classes where applicable
09:55:31,742 INFO  org.apache.flink.runtime.jobmanager.JobManager
     -
--------------------------------------------------------------------------------
09:55:31,742 INFO  org.apache.flink.runtime.jobmanager.JobManager
     -  Starting JobManager (Version: 0.10.1, Rev:2e9b231, Date:22.11.2015
@ 12:41:12 CET)
09:55:31,742 INFO  org.apache.flink.runtime.jobmanager.JobManager
     -  Current user: flink
09:55:31,742 INFO  org.apache.flink.runtime.jobmanager.JobManager
     -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.7/24.95-b01
09:55:31,743 INFO  org.apache.flink.runtime.jobmanager.JobManager
     -  Maximum heap size: 246 MiBytes
09:55:31,743 INFO  org.apache.flink.runtime.jobmanager.JobManager
     -  JAVA_HOME: /usr/lib/jvm/java-1.7.0-openjdk-amd64
09:55:31,745 INFO  org.apache.flink.runtime.jobmanager.JobManager
     -  Hadoop version: 2.7.0
09:55:31,746 INFO  org.apache.flink.runtime.jobmanager.JobManager
     -  JVM Options:
09:55:31,746 INFO  org.apache.flink.runtime.jobmanager.JobManager
     -     -Xms256m
09:55:31,746 INFO  org.apache.flink.runtime.jobmanager.JobManager
     -     -Xmx256m
09:55:31,746 INFO  org.apache.flink.runtime.jobmanager.JobManager
     -     -XX:MaxPermSize=256m
09:55:31,746 INFO  org.apache.flink.runtime.jobmanager.JobManager
     -
-Dlog.file=/home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-vm-10-155-208-156.cloud.mwn.de.log
09:55:31,746 INFO  org.apache.flink.runtime.jobmanager.JobManager
     -
-Dlog4j.configuration=file:/home/flink/flink-0.10.1/conf/log4j.properties
09:55:31,746 INFO  org.apache.flink.runtime.jobmanager.JobManager
     -
-Dlogback.configurationFile=file:/home/flink/flink-0.10.1/conf/logback.xml
09:55:31,746 INFO  org.apache.flink.runtime.jobmanager.JobManager
     -  Program Arguments:
09:55:31,746 INFO  org.apache.flink.runtime.jobmanager.JobManager
     -     --configDir
09:55:31,746 INFO  org.apache.flink.runtime.jobmanager.JobManager
     -     /home/flink/flink-0.10.1/conf
09:55:31,746 INFO  org.apache.flink.runtime.jobmanager.JobManager
     -     --executionMode
09:55:31,746 INFO  org.apache.flink.runtime.jobmanager.JobManager
     -     cluster
09:55:31,746 INFO  org.apache.flink.runtime.jobmanager.JobManager
     -     --streamingMode
09:55:31,746 INFO  org.apache.flink.runtime.jobmanager.JobManager
     -     streaming
09:55:31,746 INFO  org.apache.flink.runtime.jobmanager.JobManager
     -  Classpath:
/home/flink/flink-0.10.1/lib/flink-dist_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/flink-python_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/log4j-1.2.17.jar:/home/flink/flink-0.10.1/lib/slf4j-log4j12-1.7.7.jar:::
09:55:31,746 INFO  org.apache.flink.runtime.jobmanager.JobManager
     -
--------------------------------------------------------------------------------
09:55:31,924 INFO  org.apache.flink.runtime.jobmanager.JobManager
     - Loading configuration from /home/flink/flink-0.10.1/conf
09:55:31,941 INFO  org.apache.flink.runtime.jobmanager.JobManager
     - Staring JobManager without high-availability
09:55:31,950 INFO  org.apache.flink.runtime.jobmanager.JobManager
     - Starting JobManager on 10.155.208.156:6123 with execution mode
CLUSTER and streaming mode STREAMING
09:55:32,039 INFO  org.apache.flink.runtime.jobmanager.JobManager
     - Security is not enabled. Starting non-authenticated JobManager.
09:55:32,039 INFO  org.apache.flink.runtime.jobmanager.JobManager
     - Starting JobManager
09:55:32,040 INFO  org.apache.flink.runtime.jobmanager.JobManager
     - Starting JobManager actor system at 10.155.208.156:6123
09:55:32,483 INFO  akka.event.slf4j.Slf4jLogger
     - Slf4jLogger started
09:55:32,564 INFO  Remoting
     - Starting remoting
09:55:32,730 INFO  Remoting
     - Remoting started; listening on addresses :[akka.tcp://
flink@10.155.208.156:6123]
09:55:32,731 INFO  org.apache.flink.runtime.jobmanager.JobManager
     - Starting JobManger web frontend
09:55:32,761 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
      - Using directory /tmp/flink-web-6cd96e7e-62be-4301-9376-c98528bd58b8
for the web interface files
09:55:32,762 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
      - Serving job manager log from
/home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-vm-10-155-208-156.cloud.mwn.de.log
09:55:32,762 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
      - Serving job manager stdout from
/home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-vm-10-155-208-156.cloud.mwn.de.out
09:55:33,040 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
      - Web frontend listening at 0:0:0:0:0:0:0:0:8081
09:55:33,041 INFO  org.apache.flink.runtime.jobmanager.JobManager
     - Starting JobManager actor
09:55:33,046 INFO  org.apache.flink.runtime.blob.BlobServer
     - Created BLOB server storage directory
/tmp/blobStore-28cbb318-efc7-4a8b-85a0-1ea6539f1ba8
09:55:33,047 INFO  org.apache.flink.runtime.blob.BlobServer
     - Started BLOB server at 0.0.0.0:59504 - max concurrent requests: 50 -
max backlog: 1000
09:55:33,057 INFO  org.apache.flink.runtime.jobmanager.JobManager
     - Starting JobManager at akka.tcp://
flink@10.155.208.156:6123/user/jobmanager.
09:55:33,060 INFO  org.apache.flink.runtime.jobmanager.JobManager
     - JobManager akka.tcp://flink@10.155.208.156:6123/user/jobmanager was
granted leadership with leader session ID None.
09:55:33,063 INFO  org.apache.flink.runtime.jobmanager.MemoryArchivist
      - Started memory archivist akka://flink/user/archive
09:55:33,064 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
      - Starting with JobManager akka.tcp://
flink@10.155.208.156:6123/user/jobmanager on port 8081
09:55:33,064 INFO  org.apache.flink.runtime.webmonitor.JobManagerRetriever
      - New leader reachable under akka.tcp://
flink@10.155.208.156:6123/user/jobmanager:null.
09:55:34,013 INFO  org.apache.flink.runtime.instance.InstanceManager
      - Registered TaskManager at vm-10-155-208-156 (akka.tcp://
flink@10.155.208.156:33728/user/taskmanager) as
9b665ef16d88314bf37f816b5afdfe79. Current number of registered hosts is 1.
Current number of alive task slots is 3.
09:55:34,735 INFO  org.apache.flink.runtime.instance.InstanceManager
      - Registered TaskManager at vm-10-155-208-157 (akka.tcp://
flink@10.155.208.157:33728/user/taskmanager) as
fc8b661bb0ad5568855c8c2d6a0029f2. Current number of registered hosts is 2.
Current number of alive task slots is 6.
09:55:35,394 INFO  org.apache.flink.runtime.instance.InstanceManager
      - Registered TaskManager at vm-10-155-208-158 (akka.tcp://
flink@10.155.208.158:33728/user/taskmanager) as
c12e0ca481aaa4bdd6fa6be4cfc8335e. Current number of registered hosts is 3.
Current number of alive task slots is 9.
09:55:36,105 INFO  org.apache.flink.runtime.instance.InstanceManager
      - Registered TaskManager at slave3 (akka.tcp://
flink@10.155.208.135:37219/user/taskmanager) as
6e023a6417743d1e5d67410dc76824b8. Current number of registered hosts is 4.
Current number of alive task slots is 13.
09:55:36,542 INFO  org.apache.flink.runtime.instance.InstanceManager
      - Registered TaskManager at vm-10-155-208-137 (akka.tcp://
flink@10.155.208.137:35052/user/taskmanager) as
c0098ff87ca76c958ca18a4e7d30e24e. Current number of registered hosts is 5.
Current number of alive task slots is 17.
09:55:37,422 INFO  org.apache.flink.runtime.instance.InstanceManager
      - Registered TaskManager at slave2 (akka.tcp://
flink@10.155.208.136:50432/user/taskmanager) as
d62208a25ce877757aeb72fe4b6530fc. Current number of registered hosts is 6.
Current number of alive task slots is 21.
09:55:38,351 INFO  org.apache.flink.runtime.instance.InstanceManager
      - Registered TaskManager at vm-10-155-208-138 (akka.tcp://
flink@10.155.208.138:49653/user/taskmanager) as
b0c2566fcc0e2baf7e2605e96e5a6b9c. Current number of registered hosts is 7.
Current number of alive task slots is 25.
09:56:45,448 WARN  akka.remote.ReliableDeliverySupervisor
     - Association with remote system [akka.tcp://flink@10.155.208.156:33728]
has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
09:56:46,013 WARN  akka.remote.ReliableDeliverySupervisor
     - Association with remote system [akka.tcp://flink@10.155.208.157:33728]
has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
09:56:46,630 WARN  akka.remote.ReliableDeliverySupervisor
     - Association with remote system [akka.tcp://flink@10.155.208.158:33728]
has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
09:56:47,259 WARN  akka.remote.ReliableDeliverySupervisor
     - Association with remote system [akka.tcp://flink@10.155.208.135:37219]
has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
09:56:47,788 WARN  akka.remote.ReliableDeliverySupervisor
     - Association with remote system [akka.tcp://flink@10.155.208.137:35052]
has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
09:56:48,333 WARN  akka.remote.ReliableDeliverySupervisor
     - Association with remote system [akka.tcp://flink@10.155.208.136:50432]
has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
09:56:48,629 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
      - Removing web root dir
/tmp/flink-web-6cd96e7e-62be-4301-9376-c98528bd58b8
09:56:48,635 INFO  org.apache.flink.runtime.blob.BlobServer
     - Stopped BLOB server at 0.0.0.0:59504

and One of the TaskManagers

09:55:36,694 WARN  org.apache.hadoop.util.NativeCodeLoader
      - Unable to load native-hadoop library for your platform... using
builtin-java classes where applicable
09:55:36,918 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     -
--------------------------------------------------------------------------------
09:55:36,918 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     -  Starting TaskManager (Version: 0.10.1, Rev:2e9b231, Date:22.11.2015
@ 12:41:12 CET)
09:55:36,919 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     -  Current user: flink
09:55:36,919 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.7/24.91-b01
09:55:36,919 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     -  Maximum heap size: 990 MiBytes
09:55:36,919 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     -  JAVA_HOME: /usr/lib/jvm/java-1.7.0-openjdk-amd64
09:55:36,922 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     -  Hadoop version: 2.7.0
09:55:36,922 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     -  JVM Options:
09:55:36,922 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     -     -XX:+UseConcMarkSweepGC
09:55:36,922 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     -     -XX:+CMSClassUnloadingEnabled
09:55:36,922 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     -     -Xms1024M
09:55:36,922 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     -     -Xmx1024M
09:55:36,922 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     -     -XX:MaxDirectMemorySize=8388607T
09:55:36,922 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     -     -XX:MaxPermSize=256m
09:55:36,922 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     -
-Dlog.file=/home/flink/flink-0.10.1/log/flink-flink-taskmanager-0-vm-10-155-208-138.cloud.mwn.de.log
09:55:36,923 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     -
-Dlog4j.configuration=file:/home/flink/flink-0.10.1/conf/log4j.properties
09:55:36,923 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     -
-Dlogback.configurationFile=file:/home/flink/flink-0.10.1/conf/logback.xml
09:55:36,923 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     -  Program Arguments:
09:55:36,923 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     -     --configDir
09:55:36,923 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     -     /home/flink/flink-0.10.1/conf
09:55:36,923 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     -     --streamingMode
09:55:36,923 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     -     streaming
09:55:36,923 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     -  Classpath:
/home/flink/flink-0.10.1/lib/flink-dist_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/flink-python_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/log4j-1.2.17.jar:/home/flink/flink-0.10.1/lib/slf4j-log4j12-1.7.7.jar:/usr/lib/jvm/java-1.7.0-openjdk-amd64/lib/tools.jar::
09:55:36,923 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     -
--------------------------------------------------------------------------------
09:55:36,928 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Maximum number of open file descriptors is 4096
09:55:36,955 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Loading configuration from /home/flink/flink-0.10.1/conf
09:55:37,040 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Security is not enabled. Starting non-authenticated TaskManager.
09:55:37,071 INFO  org.apache.flink.runtime.util.LeaderRetrievalUtils
     - Trying to select the network interface and address to use by
connecting to the leading JobManager.
09:55:37,072 INFO  org.apache.flink.runtime.util.LeaderRetrievalUtils
     - TaskManager will try to connect for 10000 milliseconds before
falling back to heuristics
09:55:37,075 INFO  org.apache.flink.runtime.net.ConnectionUtils
     - Retrieved new target address /10.155.208.156:6123.
09:55:37,084 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - TaskManager will use hostname/address 'vm-10-155-208-138.cloud.mwn.de'
(10.155.208.138) for communication.
09:55:37,085 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Starting TaskManager in streaming mode STREAMING
09:55:37,085 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Starting TaskManager actor system at 10.155.208.138:0
09:55:37,531 INFO  akka.event.slf4j.Slf4jLogger
     - Slf4jLogger started
09:55:37,587 INFO  Remoting
     - Starting remoting
09:55:37,774 INFO  Remoting
     - Remoting started; listening on addresses :[akka.tcp://
flink@10.155.208.138:49653]
09:55:37,782 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Starting TaskManager actor
09:55:37,798 INFO  org.apache.flink.runtime.io.network.netty.NettyConfig
      - NettyConfig [server address:
vm-10-155-208-138.cloud.mwn.de/10.155.208.138, server port: 32798, memory
segment size (bytes): 32768, transport type: NIO, number of server threads:
0 (use Netty's default), number of client threads: 0 (use Netty's default),
server connect backlog: 0 (use Netty's default), client connect timeout
(sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)]
09:55:37,803 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Messages between TaskManager and JobManager have a max timeout of
100000 milliseconds
09:55:37,811 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Temporary file directory '/tmp': total 4 GB, usable 0 GB (0.00%
usable)
09:55:37,848 INFO
 org.apache.flink.runtime.io.network.buffer.NetworkBufferPool  - Allocated
64 MB for network buffer pool (number of memory segments: 2048, bytes per
segment: 32768).
09:55:37,955 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Using 0.5 of the currently free heap space for Flink managed heap
memory (455 MB).
09:55:37,978 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager
     - I/O manager uses directory
/tmp/flink-io-3b11098e-a3ea-4a8a-8ea4-c3f1c5b13d6f for spill files.
09:55:37,986 INFO  org.apache.flink.runtime.filecache.FileCache
     - User file cache uses directory
/tmp/flink-dist-cache-516dd09a-1dfe-46eb-b50b-b6e24b6e9fad
09:55:38,146 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Starting TaskManager actor at akka://flink/user/taskmanager#56985599.
09:55:38,146 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - TaskManager data connection information:
vm-10-155-208-138.cloud.mwn.de (dataPort=32798)
09:55:38,147 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - TaskManager has 4 task slot(s).
09:55:38,148 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Memory usage stats: [HEAP: 100/990/990 MB, NON HEAP: 24/37/304 MB
(used/committed/max)]
09:55:38,151 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Trying to register at JobManager akka.tcp://
flink@10.155.208.156:6123/user/jobmanager (attempt 1, timeout: 500
milliseconds)
09:55:38,301 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Successful registration at JobManager (akka.tcp://
flink@10.155.208.156:6123/user/jobmanager), starting network stack and
library cache.
09:55:38,479 INFO  org.apache.flink.runtime.io.network.netty.NettyClient
      - Successful initialization (took 55 ms).
09:55:38,533 INFO  org.apache.flink.runtime.io.network.netty.NettyServer
      - Successful initialization (took 54 ms). Listening on SocketAddress /
10.155.208.138:32798.
09:55:38,534 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Determined BLOB server address to be /10.155.208.156:59504. Starting
BLOB cache.
09:55:38,536 INFO  org.apache.flink.runtime.blob.BlobCache
      - Created BLOB cache storage directory
/tmp/blobStore-8e88302d-3303-4c80-8613-f0be13911fb2
09:56:48,371 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager
     - I/O manager removed spill file directory
/tmp/flink-io-3b11098e-a3ea-4a8a-8ea4-c3f1c5b13d6f

I also found that 4 of the deamons were not stopped after the cluster was
stopped, though the JM claimed it had stopped these TM deamons.

I have another question from you previous comment as to why is it necessary
to allocate 50 GB of memory to taskmanager.heap.mb? Is the value I provided
not sufficient for the job? If this is so, how was I able to run the
previous examples with input datasets as large as 50GB error-free? It would
help if you could provide some explanation here?

Thank you.

Kind Regards,
Ravinder Kaur




On Tue, Mar 15, 2016 at 11:01 AM, Till Rohrmann <trohrmann@apache.org>
wrote:

> Hi Ravinder,
>
> this should not be the relevant log extract. The log says that the TM is
> started on port 49653 and the JM log says that the TM on port 42222 is
> lost. Would you mind to share the complete JM and TM logs with us?
>
> Cheers,
> Till
>
> On Tue, Mar 15, 2016 at 10:54 AM, Ravinder Kaur <neetu0404@gmail.com>
> wrote:
>
>> Hello Ufuk,
>>
>> Yes, the same WordCount program is being run.
>>
>> Kind Regards,
>> Ravinder Kaur
>>
>> On Tue, Mar 15, 2016 at 10:45 AM, Ufuk Celebi <uce@apache.org> wrote:
>>
>>> What do you mean with iteration in this context? Are you repeatedly
>>> running the same WordCount program for streaming and batch
>>> respectively?
>>>
>>> – Ufuk
>>>
>>> On Tue, Mar 15, 2016 at 10:22 AM, Till Rohrmann <trohrmann@apache.org>
>>> wrote:
>>> > Hi Ravinder,
>>> >
>>> > could you tell us what's written in the taskmanager log of the failing
>>> > taskmanager? There should be some kind of failure why the taskmanager
>>> > stopped working.
>>> >
>>> > Moreover, given that you have 64 GB of main memory, you could easily
>>> give
>>> > 50GB as heap memory to each taskmanager.
>>> >
>>> > Cheers,
>>> > Till
>>> >
>>> > On Tue, Mar 15, 2016 at 9:48 AM, Ravinder Kaur <neetu0404@gmail.com>
>>> wrote:
>>> >>
>>> >> Hello All,
>>> >>
>>> >> I'm running a simple word count example using the quickstart package
>>> from
>>> >> the Flink(0.10.1), on an input dataset of 500MB. This dataset is a
>>> set of
>>> >> randomly generated words of length 8.
>>> >>
>>> >> Cluster Configuration:
>>> >>
>>> >> Number of machines: 7
>>> >> Total cores : 25
>>> >> Memory on each: 64GB
>>> >>
>>> >> I'm interested in the performance measure between Batch and Stream
>>> modes
>>> >> and so I'm running WordCount example with number of iteration (max
>>> 10) on
>>> >> datasets of sizes ranging between 100MB and 50GB consisting of random
>>> words
>>> >> of length 4 and 8.
>>> >>
>>> >> While I ran the experiments in Batch mode all iterations ran fine,
>>> but now
>>> >> I'm stuck in the Streaming mode at this
>>> >>
>>> >> Caused by: java.lang.OutOfMemoryError: Java heap space
>>> >>         at java.util.HashMap.resize(HashMap.java:580)
>>> >>         at java.util.HashMap.addEntry(HashMap.java:879)
>>> >>         at java.util.HashMap.put(HashMap.java:505)
>>> >>         at
>>> >>
>>> org.apache.flink.runtime.state.AbstractHeapKvState.update(AbstractHeapKvState.java:98)
>>> >>         at
>>> >>
>>> org.apache.flink.streaming.api.operators.StreamGroupedReduce.processElement(StreamGroupedReduce.java:59)
>>> >>         at
>>> >>
>>> org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:166)
>>> >>         at
>>> >>
>>> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:63)
>>> >>         at
>>> >>
>>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:218)
>>> >>         at
>>> org.apache.flink.runtime.taskmanager.Task.run(Task.java:584)
>>> >>         at java.lang.Thread.run(Thread.java:745)
>>> >>
>>> >> I investigated found 2 solutions. (1) Increasing the
>>> taskmanager.heap.mb
>>> >> and (2) Reducing the taskmanager.memory.fraction
>>> >>
>>> >> Therefore I set taskmanager.heap.mb: 1024 and
>>> taskmanager.memory.fraction:
>>> >> 0.5 (default 0.7)
>>> >>
>>> >> When I ran the example with this setting I loose taskmanagers one by
>>> one
>>> >> during the job execution with the following cause
>>> >>
>>> >> Caused by: java.lang.Exception: The slot in which the task was
>>> executed
>>> >> has been released. Probably loss of TaskManager
>>> >> 831a72dad6fbb533b193820f45bdc5bc @ vm-10-155-208-138 - 4 slots - URL:
>>> >> akka.tcp://flink@10.155.208.138:42222/user/taskmanager
>>> >>         at
>>> >>
>>> org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:153)
>>> >>         at
>>> >>
>>> org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)
>>> >>         at
>>> >>
>>> org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)
>>> >>         at
>>> >> org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156)
>>> >>         at
>>> >>
>>> org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215)
>>> >>         at
>>> >>
>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:696)
>>> >>         at
>>> >>
>>> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>>> >>         at
>>> >>
>>> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>>> >>         at
>>> >>
>>> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>>> >>         at
>>> >>
>>> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>>> >>         at
>>> >>
>>> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>>> >>         at
>>> >> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>>> >>         at
>>> >>
>>> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>>> >>         at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>>> >>         at
>>> >>
>>> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
>>> >>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>> >>         at
>>> >>
>>> akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)
>>> >>         at
>>> akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
>>> >>         at
>>> akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
>>> >>         at akka.actor.ActorCell.invoke(ActorCell.scala:486)
>>> >>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>>> >>         at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>>> >>         at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>>> >>         at
>>> >> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>> >>         at
>>> >>
>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>> >>         ... 2 more
>>> >>
>>> >>
>>> >> While I look at the results generated at each taskmanager, they are
>>> fine.
>>> >> The logs also don't show any causes for the the job to get cancelled.
>>> >>
>>> >>
>>> >> Could anyone kindly guide me here?
>>> >>
>>> >> Kind Regards,
>>> >> Ravinder Kaur.
>>> >
>>> >
>>>
>>
>>
>

Mime
View raw message