lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Noble Paul (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (SOLR-11484) CloudSolrClient's cache of collection clusterstate can cause RouteExceptions when attempting directUpdates after collection modifications
Date Fri, 27 Oct 2017 05:47:01 GMT

     [ https://issues.apache.org/jira/browse/SOLR-11484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Noble Paul resolved SOLR-11484.
-------------------------------
       Resolution: Fixed
    Fix Version/s: master (8.0)
                   7.2

> CloudSolrClient's cache of collection clusterstate can cause RouteExceptions when attempting
directUpdates after collection modifications
> -----------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-11484
>                 URL: https://issues.apache.org/jira/browse/SOLR-11484
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>            Assignee: Noble Paul
>             Fix For: 7.2, master (8.0)
>
>         Attachments: SOLR-11484.patch, SOLR-11484.patch, jenkins.thetaphi.20662.txt
>
>
> This was discovered while auditing jenkins failures from 
> {{TestCollectionsAPIViaSolrCloudCluster.testCollectionCreateSearchDelete}} (where a test
explicitly deletes and then recreates a collection with the same name), but as noted in a
comment below, SOLR-11392 is another example of non-obvious test failures that can pop up
because of this bug.
> In practice, it can affect any CloudSolrClient user after changes have been made to a
collection (to add/move replicas, etc...)
> ----
> Original jira notes...
> {{TestCollectionsAPIViaSolrCloudCluster.testCollectionCreateSearchDelete}}
> seems to fail with non-trivial frequency, so I grabbed the logs from a recent failure
and starting trying to follow along with the actions to figure out what exactly is happening....
> https://jenkins.thetaphi.de/job/Lucene-Solr-master-Linux/20662/
> {noformat}
>    [junit4] ERROR   20.3s J1 | TestCollectionsAPIViaSolrCloudCluster.testCollectionCreateSearchDelete
<<<
>    [junit4]    > Throwable #1: org.apache.solr.client.solrj.impl.CloudSolrClient$RouteException:
Error from server at https://127.0.0.1:42959/solr/testcollection_shard1_replica_n3: Expected
mime type a
> pplication/octet-stream but got text/html. <html>
>    [junit4]    > <head>
>    [junit4]    > <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
>    [junit4]    > <title>Error 404 </title>
> {noformat}
> The crux of this failure appears to be a genuine bug in how CloudSolrClient uses it's
cached ClusterState info when doing (direct) updates.  The key bits seem to be:
> * CloudSolrClient does _something_ (update,query,etc...) with a collection causing the
current cluster state for the collection to be cached
> * The actual collection changes such that a Solr node/core no longer exists as part of
the collection
> * CloudSolrClient is asked to process an UpdateRequest which triggers the code paths
for the {{directUpdate()}} method -- which attempts to route the updates directly to a replica
of the appropriate shard using the (cache) collection state info
> * CloudSolrClient (may) attempt to send that UpdateRequest to a node/core that doesn't
exist, getting a 404 -- which does not (seem to) trigger a state refresh, or retry to find
a correct URL to resend the update to.
> Details to follow in comment....



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message