lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <>
Subject [jira] [Updated] (SOLR-11484) Possible bug with CloudSolrClient directedUpdates & cached collection state -- TestCollectionsAPIViaSolrCloudCluster.testCollectionCreateSearchDelete
Date Tue, 17 Oct 2017 17:12:00 GMT


Hoss Man updated SOLR-11484:
    Attachment: SOLR-11484.patch

[~noble.paul]:  IIUC you're saying -- at a broader level -- that you think this is a bug in
CloudSolrClient, and not a mistake in the affected test(s), correct?

In that case, I'm attaching a patch with test explicitly targeted at this caching problem
in CloudSolrClient that fails reliably (for me) 100% of the time, and i'll re-word the summary/description
to more specificaly describe the underlying problem.

That said: I don't really understand all the possible code paths well enough to be confident
of your suggested fix:

* is adding {{RouteException}} to the {{wasCommError}} logic safe?
** it's not clear to me that every possible reason for a {{RouteException}} is actually a
"communication error" that should trigger all the affected downstream logic that comes with
setting that boolean
* is setting {{wasCommError}} enough to actually fix this bug?
** It doesn't seem like setting that is sufficient, because I don't think the {{directUpdate()}}
code paths will (currently) attempt a retry even if {{true==wasCommError}}  ?


> Possible bug with CloudSolrClient directedUpdates & cached collection state -- TestCollectionsAPIViaSolrCloudCluster.testCollectionCreateSearchDelete
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>                 Key: SOLR-11484
>                 URL:
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>         Attachments: SOLR-11484.patch, jenkins.thetaphi.20662.txt
> {{TestCollectionsAPIViaSolrCloudCluster.testCollectionCreateSearchDelete}} seems to fail
with non-trivial frequency, so I grabbed the logs from a recent failure and starting trying
to follow along with the actions to figure out what exactly is happening....
> {noformat}
>    [junit4] ERROR   20.3s J1 | TestCollectionsAPIViaSolrCloudCluster.testCollectionCreateSearchDelete
>    [junit4]    > Throwable #1: org.apache.solr.client.solrj.impl.CloudSolrClient$RouteException:
Error from server at Expected
mime type a
> pplication/octet-stream but got text/html. <html>
>    [junit4]    > <head>
>    [junit4]    > <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
>    [junit4]    > <title>Error 404 </title>
> {noformat}
> The crux of this failure appears to be a genuine bug in how CloudSolrClient uses it's
cached ClusterState info when doing (direct) updates.  The key bits seem to be:
> * CloudSolrClient does _something_ (update,query,etc...) with a collection causing the
current cluster state for the collection to be cached
> * The actual collection changes such that a Solr node/core no longer exists as part of
the collection
> * CloudSolrClient is asked to process an UpdateRequest which triggers the code paths
for the {{directUpdate()}} method -- which attempts to route the updates directly to a replica
of the appropriate shard using the (cache) collection state info
> * CloudSolrClient (may) attempt to send that UpdateRequest to a node/core that doesn't
exist, getting a 404 -- which does not (seem to) trigger a state refresh, or retry to find
a correct URL to resend the update to.
> Details to follow in comment....

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message