lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jessica Mallet <mewmewb...@gmail.com>
Subject zookeeper reconnect failure
Date Fri, 28 Mar 2014 20:27:43 GMT
Hi,

First off, I'd like to give a disclaimer that this probably is a very edge
case issue. However, since it happened to us, I would like to get some
advice on how to best handle this failure scenario.

Basically, we had some network issue where we temporarily lost connection
and DNS. The zookeeper client properly triggered the watcher. However, when
trying to reconnect, this following Exception is thrown:

2014-03-27 17:24:46,882 ERROR [main-EventThread] SolrException.java (line
121) :java.net.UnknownHostException: <host name (scrubbed)>: Name or
service not known
        at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
        at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:866)
        at
java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1258)
        at java.net.InetAddress.getAllByName0(InetAddress.java:1211)
        at java.net.InetAddress.getAllByName(InetAddress.java:1127)
        at java.net.InetAddress.getAllByName(InetAddress.java:1063)
        at
org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:60)
        at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
        at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:380)
        at
org.apache.solr.common.cloud.SolrZooKeeper.<init>(SolrZooKeeper.java:41)
        at
org.apache.solr.common.cloud.DefaultConnectionStrategy.reconnect(DefaultConnectionStrategy.java:53)
        at
org.apache.solr.common.cloud.ConnectionManager.process(ConnectionManager.java:147)
        at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
        at
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)

I tried to look at the code and it seems that there'd be no further retries
to connect to Zookeeper, and the node is basically left in a bad state and
will not recover on its own. (Please correct me if I'm reading this wrong.)
Thinking about it, this is probably fair, since normally you wouldn't
expect retries to fix an "unknown host" issue--even though in our case it
would have--but I'm wondering what we should do to handle this situation if
it happens again in the future.

Any advice is appreciated.

Thanks,
Jessica

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message