flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kien Truong <duckientru...@gmail.com>
Subject Re: Task Manager detached under load
Date Sat, 20 Jan 2018 11:57:02 GMT
Hi,

You should enable and check your garbage collection log.

We've encountered case where Task Manager disassociated due to long GC 
pause.


Regards,

Kien

On 1/20/2018 1:27 AM, ashish pok wrote:
> Hi All,
>
> We have hit some load related issues and was wondering if any one has 
> some suggestions. We are noticing task managers and job managers being 
> detached from each other under load and never really sync up again. As 
> a result, Flink session shows 0 slots available for processing. Even 
> though, apps are configured to restart it isn't really helping as 
> there are no slots available to run the apps.
>
>
> Here are excerpt from logs that seemed relevant. (I am trimming out 
> rest of the logs for brevity)
>
> *Job Manager:*
> 2018-01-19 12:38:00,423 INFO 
> org.apache.flink.runtime.jobmanager.JobManager   -  Starting 
> JobManager (Version: 1.4.0, Rev:3a9d9f2, Date:06.12.2017 @ 11:08:40 UTC)
>
> 2018-01-19 12:38:00,792 INFO 
> org.apache.flink.runtime.jobmanager.JobManager   -  Maximum heap size: 
> 16384 MiBytes
> 2018-01-19 12:38:00,794 INFO 
> org.apache.flink.runtime.jobmanager.JobManager     -  Hadoop version: 
> 2.6.5
> 2018-01-19 12:38:00,794 INFO 
> org.apache.flink.runtime.jobmanager.JobManager     -  JVM Options:
> 2018-01-19 12:38:00,794 INFO 
> org.apache.flink.runtime.jobmanager.JobManager     -     -Xms16384m
> 2018-01-19 12:38:00,794 INFO 
> org.apache.flink.runtime.jobmanager.JobManager     -     -Xmx16384m
> 2018-01-19 12:38:00,795 INFO 
> org.apache.flink.runtime.jobmanager.JobManager     -     -XX:+UseG1GC
>
> 2018-01-19 12:38:00,908 INFO 
> org.apache.flink.configuration.GlobalConfiguration     - Loading 
> configuration property: jobmanager.rpc.port, 6123
> 2018-01-19 12:38:00,908 INFO 
> org.apache.flink.configuration.GlobalConfiguration     - Loading 
> configuration property: jobmanager.heap.mb, 16384
>
>
> 2018-01-19 12:53:34,671 WARN  akka.remote.RemoteWatcher               
>                      - Detected unreachable: 
> [akka.tcp://flink@<jm-host>:37840]
> 2018-01-19 12:53:34,676 INFO 
> org.apache.flink.runtime.jobmanager.JobManager   - Task manager 
> akka.tcp://flink@<jm-host>:37840/user/taskmanager terminated.
>
> -- So once Flink session boots up, we are hitting it with pretty heavy 
> load, which typically results in the WARN above
>
> *Task Manager:*
> 2018-01-19 12:38:01,002 INFO 
> org.apache.flink.runtime.taskmanager.TaskManager -  Starting 
> TaskManager (Version: 1.4.0, Rev:3a9d9f2, Date:06.12.2017 @ 11:08:40 UTC)
> 2018-01-19 12:38:01,367 INFO 
> org.apache.flink.runtime.taskmanager.TaskManager   -  Hadoop version: 
> 2.6.5
> 2018-01-19 12:38:01,367 INFO 
> org.apache.flink.runtime.taskmanager.TaskManager   -  JVM Options:
> 2018-01-19 12:38:01,367 INFO 
> org.apache.flink.runtime.taskmanager.TaskManager   -     -Xms16384M
> 2018-01-19 12:38:01,367 INFO 
> org.apache.flink.runtime.taskmanager.TaskManager   -     -Xmx16384M
> 2018-01-19 12:38:01,367 INFO 
> org.apache.flink.runtime.taskmanager.TaskManager   -    
>  -XX:MaxDirectMemorySize=8388607T
> 2018-01-19 12:38:01,367 INFO 
> org.apache.flink.runtime.taskmanager.TaskManager   -     -XX:+UseG1GC
>
> 2018-01-19 12:38:01,392 INFO 
> org.apache.flink.configuration.GlobalConfiguration     - Loading 
> configuration property: jobmanager.rpc.port, 6123
> 2018-01-19 12:38:01,392 INFO 
> org.apache.flink.configuration.GlobalConfiguration     - Loading 
> configuration property: jobmanager.heap.mb, 16384
>
>
> 2018-01-19 12:54:48,626 WARN  akka.remote.RemoteWatcher               
>                      - Detected unreachable: 
> [akka.tcp://flink@<jm-host>:6123]
> 2018-01-19 12:54:48,690 INFO  akka.remote.Remoting                     
>               - Quarantined address [akka.tcp://flink@<jm-host>:6123] 
> is still unreachable or has not been restarted. Keeping it quarantined.
> 018-01-19 12:54:48,774 WARN  akka.remote.Remoting                     
>                 - Tried to associate with unreachable remote address 
> [akka.tcp://flink@<tm-host>:6123]. Address is now gated for 5000 ms, 
> all messages to this address will be delivered to dead letters. 
> Reason: [The remote system has a UID that has been quarantined. 
> Association aborted.]
> 2018-01-19 12:54:48,833 WARN  akka.remote.Remoting                     
>                 - Tried to associate with unreachable remote address 
> [akka.tcp://flink@<tm-host>:6123]. Address is now gated for 5000 ms, 
> all messages to this address will be delivered to dead letters. 
> Reason: [The remote system has quarantined this system. No further 
> associations to the remote system are possible until this system is 
> restarted.]
> <bunch of ERRORs on operations not shutdown properly - assuming 
> because JM is unreachable>
>
> 2018-01-19 12:56:51,244 INFO 
> org.apache.flink.runtime.taskmanager.TaskManager       - Trying to 
> register at JobManager akka.tcp://flink@<jm-host>:6123/user/jobmanager 
> (attempt 10, timeout: 30000 milliseconds)
> 2018-01-19 12:56:51,253 WARN  akka.remote.Remoting                     
>                   - Tried to associate with unreachable remote address 
> [akka.tcp://flink@<jm-host>:6123]. Address is now gated for 5000 ms, 
> all messages to this address will be delivered to dead letters. 
> Reason: [The remote system has quarantined this system. No further 
> associations to the remote system are possible until this system is 
> restarted.]
>
> So bottom line is, JM and TM couldn't communicate under load, which is 
> obviously not good. I tried to bump up akka.tcp.timeout as well but it 
> didnt help either. So my question here is after all processing is 
> halted and there is no new data being picked up, shouldn't this 
> environment self-heal? Any other things I can be looking at other than 
> extending timeouts?
>
> Thanks,
>
> Ashish
>
>
>
>
>
>

Mime
View raw message