curator-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Grove <andy.gr...@codefutures.com>
Subject Re: Curator client fails to connect if any one of my zookeeper instances is down (UnknownHostException)
Date Tue, 25 Jun 2013 20:07:11 GMT
After failing to reproduce this issue locally, I enabled some trace logging and re-tested on
Amazon EC2 and have further information on this now. 

The issue seems to be specific to java.net.UnknownHostException.

The first error I see is:

     [java]  2013-06-25 19:59:47,883 ERROR c.n.c.f.i.CuratorFrameworkImpl [main] Background
exception was not retry-able or retry gave up
     [java]  java.net.UnknownHostException: ec2-107-21-126-93.compute-1.amazonaws.com
     [java] 	at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
     [java] 	at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:850)
     [java] 	at java.net.InetAddress.getAddressFromNameService(InetAddress.java:1201)
     [java] 	at java.net.InetAddress.getAllByName0(InetAddress.java:1154)
     [java] 	at java.net.InetAddress.getAllByName(InetAddress.java:1084)
     [java] 	at java.net.InetAddress.getAllByName(InetAddress.java:1020)
     [java] 	at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:60)
     [java] 	at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
     [java] 	at com.netflix.curator.utils.DefaultZookeeperFactory.newZooKeeper(DefaultZookeeperFactory.java:27)

This error is not escalated to the application code, so when the application tries performing
an operation on the curator client, I get the following logging in a loop:

BEGIN LOOP

[java]  2013-06-25 20:00:03,131 ERROR c.n.c.ConnectionState [main] Connection timed out for
connection string (10.96.214.121:8090,ec2-107-21-126-93.compute-1.amazonaws.com:8090,10.112.81.128:8090)
and timeout (15000) / elapsed (15272)
[java]  org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss

...

[java] 2013-06-25 20:00:03,134 TRACE c.d.n.ZKClient$1 [main] addCount() connections-timed-out
[java]  2013-06-25 20:00:03,135 DEBUG c.n.c.RetryLoop [main] Retry-able exception received
[java]  org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss

....

[java] 2013-06-25 20:04:47,518 TRACE c.d.n.ZKClient$1 [main] addCount() retries-allowed
[java]  2013-06-25 20:04:47,519 DEBUG c.n.c.RetryLoop [main] Retrying operation

END LOOP

On Jun 25, 2013, at 11:16 AM, Andy Grove <andy.grove@codefutures.com> wrote:

> Hi,
> 
> I'm using the following code to connect to my zookeeper instances:
> 
>             client = CuratorFrameworkFactory.newClient(connectString, sessionTimeout,
connectTimeout,new ExponentialBackoffRetry(1000, 3));
> 
> I have three hosts, lets call them host1, host2 and host3. If all hosts are running then
everything works as expected.
> 
> If host1 is down (server shut down) then all operations on the curator client fail and
I see errors like this:
> 
> ERROR com.netflix.curator.ConnectionState - Connection timed out for connection string
(host1:8090,host2:8090,host3:8090) and timeout (15000) / elapsed (15310)
> 
> It doesn't matter what order I specify the hosts in, I always get these errors and my
operation eventually fails with:
> 
>      [java] Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss
>      [java] 	at com.netflix.curator.ConnectionState.getZooKeeper(ConnectionState.java:101)
>      [java] 	at com.netflix.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:107)
>      [java] 	at com.netflix.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:445)
>      [java] 	at com.netflix.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:171)
>      [java] 	at com.netflix.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:160)
>      [java] 	at com.netflix.curator.RetryLoop.callWithRetry(RetryLoop.java:106)
>      [java] 	at com.netflix.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:156)
>      [java] 	at com.netflix.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:147)
>      [java] 	at com.netflix.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:35)
>      [java] 	at com.dbshards.nameserver.ZKClient.createPath(ZKClient.java:406)
> 
> I would expect Curator/Zookeeper to try this operation with host2 or host3 after an error
connecting to host1 but this is not the case. I even have a retry loop in my code that tries
the operation 10 times and it fails every time if host1 is in the connect string.
> 
> I'm hoping I'm missing something obvious here. Any help would be appreciated.
> 
> Thanks,
> 
> Andy.
> 
> --
> Andy Grove
> VP, R&D
> CodeFutures Corporation
> 
> Share Nothing, Shard Everything!
> http://www.dbshards.com
> 
> 
> 
> 


Mime
View raw message