flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Li Peng <li.p...@doordash.com>
Subject Re: Task-manager kubernetes pods take a long time to terminate
Date Thu, 30 Jan 2020 20:50:45 GMT
Hi Yun,

I'm currently specifying that specific RPC address in my kubernetes charts
for conveniene, should I be generating a new one for every deployment?

And yes, I am deleting the pods using those commands, I'm just noticing
that the task-manager termination process is short circuited by the
registration timeout check, so that instead of terminating quickly, the
task-manger would wait for 5 minutes to timeout before terminating. I'm
expecting it to just terminate without doing that registration timeout, is
there a way to configure that?


On Thu, Jan 30, 2020 at 8:53 AM Yun Tang <myasuka@live.com> wrote:

> Hi Li
> Why you still use ’job-manager' as thejobmanager.rpc.address for the
> second new cluster? If you use another rpc address, previous task managers
> would not try to register with old one.
> Take flink documentation [1] for k8s as example. You can list/delete all
> pods like:
> kubectl get/delete pods -l app=flink
> By the way, the default registration timeout is 5min [2], those
> taskmanager could not register to the JM will suicide after 5 minutes.
> [1]
> https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/kubernetes.html#session-cluster-resource-definitions
> [2]
> https://github.com/apache/flink/blob/7e1a0f446e018681cb537dd936ae54388b5a7523/flink-core/src/main/java/org/apache/flink/configuration/TaskManagerOptions.java#L158
> Best
> Yun Tang
> ------------------------------
> *From:* Li Peng <li.peng@doordash.com>
> *Sent:* Thursday, January 30, 2020 9:24
> *To:* user <user@flink.apache.org>
> *Subject:* Task-manager kubernetes pods take a long time to terminate
> Hey folks, I'm deploying a Flink cluster via kubernetes, and starting each
> task manager with taskmanager.sh. I noticed that when I tell kubectl to
> delete the deployment, the job-manager pod usually terminates very quickly,
> but any task-manager that doesn't get terminated before the job-manager,
> usually gets stuck in this loop:
> 2020-01-29 09:18:47,867 INFO
>  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Could not
> resolve ResourceManager address akka.tcp://flink@job-manager:6123/user/resourcemanager,
> retrying in 10000 ms: Could not connect to rpc endpoint under address
> akka.tcp://flink@job-manager:6123/user/resourcemanager
> It then does this for about 10 minutes(?), and then shuts down. If I'm
> deploying a new cluster, this pod will try to register itself with the new
> job manager before terminating lter. This isn't a troubling issue as far as
> I can tell, but I find it annoying that I sometimes have to force delete
> the pods.
> Any easy ways to just have the task managers terminate gracefully and
> quickly?
> Thanks,
> Li

View raw message