lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jessica Mallet <mewmewb...@gmail.com>
Subject Re: zookeeper reconnect failure
Date Tue, 01 Apr 2014 18:14:33 GMT
Filed: https://issues.apache.org/jira/browse/SOLR-5945


On Tue, Apr 1, 2014 at 11:10 AM, Jessica Mallet <mewmewball@gmail.com>wrote:

> Will do Mark. Thanks!
>
>
> On Sun, Mar 30, 2014 at 1:29 PM, Mark Miller <markrmiller@gmail.com>wrote:
>
>> We don't currently retry, but I don't think it would hurt much if we did
>> - at least briefly.
>>
>> If you want to file a JIRA issue, that would be the best way to get it in
>> a future release.
>>
>> --
>> Mark Miller
>> about.me/markrmiller
>>
>> On March 28, 2014 at 5:40:47 PM, Michael Della Bitta (
>> michael.della.bitta@appinions.com) wrote:
>>
>> Hi, Jessica,
>>
>> We've had a similar problem when DNS resolution of our Hadoop task nodes
>> has failed. They tend to take a dirt nap until you fix the problem
>> manually. Are you experiencing this in AWS as well?
>>
>> I'd say the two things to do are to poll the node state via HTTP using a
>> monitoring tool so you get an immediate notification of the problem, and
>> to
>> install some sort of caching server like nscd if you expect to have DNS
>> resolution failures regularly.
>>
>>
>>
>> Michael Della Bitta
>>
>> Applications Developer
>>
>> o: +1 646 532 3062
>>
>> appinions inc.
>>
>> "The Science of Influence Marketing"
>>
>> 18 East 41st Street
>>
>> New York, NY 10017
>>
>> t: @appinions <https://twitter.com/Appinions> | g+:
>> plus.google.com/appinions<
>> https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
>> >
>> w: appinions.com <http://www.appinions.com/>
>>
>>
>> On Fri, Mar 28, 2014 at 4:27 PM, Jessica Mallet <mewmewball@gmail.com
>> >wrote:
>>
>> > Hi,
>> >
>> > First off, I'd like to give a disclaimer that this probably is a very
>> edge
>> > case issue. However, since it happened to us, I would like to get some
>> > advice on how to best handle this failure scenario.
>> >
>> > Basically, we had some network issue where we temporarily lost
>> connection
>> > and DNS. The zookeeper client properly triggered the watcher. However,
>> when
>> > trying to reconnect, this following Exception is thrown:
>> >
>> > 2014-03-27 17:24:46,882 ERROR [main-EventThread] SolrException.java
>> (line
>> > 121) :java.net.UnknownHostException: <host name (scrubbed)>: Name or
>> > service not known
>> > at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
>> > at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:866)
>> > at
>> > java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1258)
>> > at java.net.InetAddress.getAllByName0(InetAddress.java:1211)
>> > at java.net.InetAddress.getAllByName(InetAddress.java:1127)
>> > at java.net.InetAddress.getAllByName(InetAddress.java:1063)
>> > at
>> >
>> >
>> org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:60)
>> > at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
>> > at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:380)
>> > at
>> > org.apache.solr.common.cloud.SolrZooKeeper.<init>(SolrZooKeeper.java:41)
>> > at
>> >
>> >
>> org.apache.solr.common.cloud.DefaultConnectionStrategy.reconnect(DefaultConnectionStrategy.java:53)
>> > at
>> >
>> >
>> org.apache.solr.common.cloud.ConnectionManager.process(ConnectionManager.java:147)
>> > at
>> >
>> >
>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
>> > at
>> > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
>> >
>> > I tried to look at the code and it seems that there'd be no further
>> retries
>> > to connect to Zookeeper, and the node is basically left in a bad state
>> and
>> > will not recover on its own. (Please correct me if I'm reading this
>> wrong.)
>> > Thinking about it, this is probably fair, since normally you wouldn't
>> > expect retries to fix an "unknown host" issue--even though in our case
>> it
>> > would have--but I'm wondering what we should do to handle this
>> situation if
>> > it happens again in the future.
>> >
>> > Any advice is appreciated.
>> >
>> > Thanks,
>> > Jessica
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message