flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Lucas <patr...@data-artisans.com>
Subject Re: Flink HA Zookeeper Connection Timeout
Date Tue, 14 Nov 2017 11:13:02 GMT
Hi Sathya,

Here are two JIRA issues that may be related: FLINK-5996
<https://issues.apache.org/jira/browse/FLINK-5996>, FLINK-7021
<https://issues.apache.org/jira/browse/FLINK-7021>

Are there any logs from your ZK cluster that may be of use? Since you're on
Kubernetes, do you have Liveness/ReadinessChecks on ZK, and if so, do they
show any problems? For example, a failed ReadinessCheck could result in the
node temporarily being dropped from the K8s Service, resulting in a timeout
from Flink.

Actually, it's probably a good idea to avoid using a Service altogether
with ZooKeeper in Kubernetes and address the pods directly. For this you
could use a StatefulSet
<https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/>
which gives you hostnames like zookeeper-0, zookeeper-1 etc., avoiding the
indirection of a Service and allowing the client library to do its own
failure resolution since it knows where to find each ZooKeeper.

--
Patrick Lucas

On Wed, Nov 8, 2017 at 4:02 AM, Sathya Hariesh Prakash (sathypra) <
sathypra@cisco.com> wrote:

> Hi – We’re currently testing Flink HA and running into a zookeeper timeout
> issue. Error log below.
>
> *Is there a production checklist or any information on parameters that are
> related to flink HA that I need to pay attention to? *
>
> Any pointers would really help. Please let me know if any additional
> information is needed. Thanks!
>
> NOTE: I see multiple connection timeout messages. With different elapsed
> times.
>
> {
>    "timeMillis":1510095254557,
>    "thread":"Curator-Framework-0",
>    "level":"ERROR",
>    "loggerName":"org.apache.flink.shaded.org.apache.
> curator.ConnectionState",
>    "message":"Connection timed out for connection
> string (zookeeper.system.svc.cluster.local:2181) and
> timeout (15000) / elapsed (15004)",
>    "thrown":{
>       "commonElementCount":0,
>       "localizedMessage":"KeeperErrorCode = ConnectionLoss",
>       "message":"KeeperErrorCode = ConnectionLoss",
>       "name":"org.apache.flink.shaded.org.apache.curator.
> CuratorConnectionLossException",
>       "extendedStackTrace":[
>          {
>             "class":"org.apache.flink.shaded.org.apache.curator.
> ConnectionState",
>             "method":"checkTimeouts",
>             "file":"ConnectionState.java",
>             "line":197,
>             "exact":true,
>             "location":"flink-runtime_2.10-1.2.jar",
>             "version":"1.2"
>          },
>          {
>             "class":"org.apache.flink.shaded.org.apache.curator.
> ConnectionState",
>             "method":"getZooKeeper",
>             "file":"ConnectionState.java",
>             "line":87,
>             "exact":true,
>             "location":"flink-runtime_2.10-1.2.jar",
>             "version":"1.2"
>          },
>          {
>             "class":"org.apache.flink.shaded.org.apache.curator.
> CuratorZookeeperClient",
>             "method":"getZooKeeper",
>             "file":"CuratorZookeeperClient.java",
>             "line":115,
>             "exact":true,
>             "location":"flink-runtime_2.10-1.2.jar",
>             "version":"1.2"
>          },
>          {
>             "class":"org.apache.flink.shaded.org.
> apache.curator.framework.imps.CuratorFrameworkImpl",
>             "method":"performBackgroundOperation",
>             "file":"CuratorFrameworkImpl.java",
>             "line":806,
>             "exact":true,
>             "location":"flink-runtime_2.10-1.2.jar",
>             "version":"1.2"
>          },
>          {
>             "class":"org.apache.flink.shaded.org.
> apache.curator.framework.imps.CuratorFrameworkImpl",
>             "method":"backgroundOperationsLoop",
>             "file":"CuratorFrameworkImpl.java",
>             "line":792,
>             "exact":true,
>             "location":"flink-runtime_2.10-1.2.jar",
>             "version":"1.2"
>          },
>          {
>             "class":"org.apache.flink.shaded.org.
> apache.curator.framework.imps.CuratorFrameworkImpl",
>             "method":"access$300",
>             "file":"CuratorFrameworkImpl.java",
>             "line":62,
>             "exact":true,
>             "location":"flink-runtime_2.10-1.2.jar",
>             "version":"1.2"
>          },
>          {
>             "class":"org.apache.flink.shaded.org.
> apache.curator.framework.imps.CuratorFrameworkImpl$4",
>             "method":"call",
>             "file":"CuratorFrameworkImpl.java",
>             "line":257,
>             "exact":true,
>             "location":"flink-runtime_2.10-1.2.jar",
>             "version":"1.2"
>          },
>          {
>             "class":"java.util.concurrent.FutureTask",
>             "method":"run",
>             "file":"FutureTask.java",
>             "line":266,
>             "exact":true,
>             "location":"?",
>             "version":"1.8.0_66"
>          },
>          {
>             "class":"java.util.concurrent.ThreadPoolExecutor",
>             "method":"runWorker",
>             "file":"ThreadPoolExecutor.java",
>             "line":1142,
>             "exact":true,
>             "location":"?",
>             "version":"1.8.0_66"
>          },
>          {
>             "class":"java.util.concurrent.ThreadPoolExecutor$Worker",
>             "method":"run",
>             "file":"ThreadPoolExecutor.java",
>             "line":617,
>             "exact":true,
>             "location":"?",
>             "version":"1.8.0_66"
>          },
>          {
>             "class":"java.lang.Thread",
>             "method":"run",
>             "file":"Thread.java",
>             "line":745,
>             "exact":true,
>             "location":"?",
>             "version":"1.8.0_66"
>          }
>       ]
>    },
>    "endOfBatch":false,
>    "loggerFqcn":"org.apache.logging.slf4j.Log4jLogger",
>    "threadId":258,
>    "threadPriority":5
> }
>

Mime
View raw message