lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dennis Gove (JIRA)" <>
Subject [jira] [Commented] (SOLR-8599) Errors in construction of SolrZooKeeper cause Solr to go into an inconsistent state
Date Mon, 22 Feb 2016 20:26:18 GMT


Dennis Gove commented on SOLR-8599:

I have somewhat of an interesting situation at hand here.

As part of this patch a test is added to ConnectionManagerTest which forces a DNS failure
on the zookeeper connection by attempting to connect to "BADADDRESS" and then fixing it after
5 seconds. This shows that the change Keith put in ConnectionManager will continually try
to make a connection until it can. It's a good test and it exercises the bug and fix perfectly.

However, the test depends on my ISP. I've run the test under 5 scenarios and only 3 of them

1. Connected to my corporate network
In this scenario the test passes perfectly as it should.

2. Connected to no network (ie, wifi card turned off)
In this scenario the test passes perfectly as it should.

3. Connected to my home network backed by Verizon FIOS
In this scenario the test hangs and upon further investigation I found that it is in an "infinite"
loop in ConnectionManager::waitForConnected. This appears to be an infinite loop because while
there is a timeout the timeout is Long.MAX_VALUE. The problem here is that the loop waits
until it is either connected or closed. Neither of those conditions are ever hit. But why?
We're trying to hit http://BADADDRESS and clearly that is a DNS lookup failure. Oh no no no,
not according to Verizon. See, Verizon instead says "Oh, you must've typed something in wrong
so instead of returning to you a DNS failure let me return to you a redirect to a search page
- you clearly want this search page". It appears that because of this redirection a connection
is never made nor is it ever closed. Hence, loop forever. 

4. Connected to my personal wifi hotspot backed by T-Mobile
Same issue as seen with Verizon FIOS, though a T-Mobile specific search page. 

5. Connected to a hotspot through my iPhone backed by Verizon Wireless
In this scenario the test passes perfectly as it should.

Note that this difference is *only* seen when a DNS lookup failure is in play. If I change
the bad address to "http://BADADDRESS" then it fails instead because "//BADADDRESSIS" is said
to be an invalid path string. Technically this is testing a slightly different case but I'm
comfortable calling it the same test because the issue being corrected is a failure to make
a connection during the construction of SolrZooKeeper and a malformed url fails just the same.

> Errors in construction of SolrZooKeeper cause Solr to go into an inconsistent state
> -----------------------------------------------------------------------------------
>                 Key: SOLR-8599
>                 URL:
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>            Reporter: Keith Laban
>         Attachments: SOLR-8599.patch, SOLR-8599.patch
> We originally saw this happen due to a DNS exception (see stack trace below). Although
any exception thrown in the constructor of SolrZooKeeper or the parent class, ZooKeeper, will
cause DefaultConnectionStrategy to fail to update the zookeeper client. Once it gets into
this state, it will not try to connect again until the process is restarted. The node itself
will also respond successfully to query requests, but not to update requests.
> Two things should be address here:
> 1) Fix the error handling and issue some number of retries
> 2) If we are stuck in a state like this stop responding to all requests 
> {code}
> 2016-01-23 13:49:20.222 ERROR ConnectionManager [main-EventThread] -
HOSTNAME: unknown error
> at Method)
> at$2.lookupAllHostAddr(
> at
> at
> at
> at
> at org.apache.zookeeper.client.StaticHostProvider.<init>(
> at org.apache.zookeeper.ZooKeeper.<init>(
> at org.apache.zookeeper.ZooKeeper.<init>(
> at<init>(
> at
> at
> at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(
> at org.apache.zookeeper.ClientCnxn$
> 2016-01-23 13:49:20.222 INFO ConnectionManager [main-EventThread] - Connected:false
> 2016-01-23 13:49:20.222 INFO ClientCnxn [main-EventThread] - EventThread shut down
> {code}

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message