lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rob Speer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-3561) Error during deletion of shard/core
Date Thu, 11 Oct 2012 18:47:03 GMT

    [ https://issues.apache.org/jira/browse/SOLR-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13474421#comment-13474421
] 

Rob Speer commented on SOLR-3561:
---------------------------------

I'm still seeing this error, consistently.

I'm currently running two Solr processes on one machine to test sharding. If I ever delete
all the cores of a collection (and even if I explicitly delete the collection using the cloud
admin), it shows an error like this first:

{noformat}
INFO: Unregistering core testdb-shard6-rep2 from cloudstate.
Oct 11, 2012 2:42:11 PM org.apache.solr.core.SolrCore close
INFO: [testdb-shard6-rep2]  CLOSING SolrCore org.apache.solr.core.SolrCore@7a0ec60b
Oct 11, 2012 2:42:11 PM org.apache.solr.core.SolrCore closeSearcher
INFO: [testdb-shard6-rep2] Closing main searcher on request.
Oct 11, 2012 2:42:11 PM org.apache.solr.update.DirectUpdateHandler2 close
INFO: closing DirectUpdateHandler2{commits=14,autocommits=0,soft autocommits=0,optimizes=0,rollbacks=0,expungeDeletes=0,docsPending=0,adds=0,deletesById=0,deletesByQuery=0,errors=0,cumulative_adds=44,cumulative_deletesById=0,cumulative_deletesByQuery=8,cumulative_errors=0}
Oct 11, 2012 2:42:11 PM org.apache.solr.cloud.RecoveryStrategy close
WARNING: Stopping recovery for core testdb-shard6-rep2 zkNodeName=panama:8983_solr_testdb-shard6-rep2
Oct 11, 2012 2:42:11 PM org.apache.solr.cloud.LeaderElector$1 process
WARNING: 
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
        at java.util.ArrayList.rangeCheck(ArrayList.java:571)
        at java.util.ArrayList.get(ArrayList.java:349)
        at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:95)
        at org.apache.solr.cloud.LeaderElector.access$000(LeaderElector.java:57)
        at org.apache.solr.cloud.LeaderElector$1.process(LeaderElector.java:125)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:526)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:502)
{noformat}

Afterward it repeats this error over and over:

{noformat}
SEVERE: Error while trying to recover.                                                   
                                                                                         
                                         java.lang.RuntimeException: No registered leader
was found, collection:lumi-test_pipeline-test slice:shard2
        at org.apache.solr.common.cloud.ZkStateReader.getLeaderProps(ZkStateReader.java:428)
        at org.apache.solr.common.cloud.ZkStateReader.getLeaderProps(ZkStateReader.java:414)
        at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:297)
        at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:211)
Oct 11, 2012 11:30:11 AM org.apache.solr.cloud.RecoveryStrategy doRecovery
SEVERE: Recovery failed - trying again...
Oct 11, 2012 11:30:11 AM org.apache.solr.cloud.RecoveryStrategy doRecovery
{noformat}
                
> Error during deletion of shard/core
> -----------------------------------
>
>                 Key: SOLR-3561
>                 URL: https://issues.apache.org/jira/browse/SOLR-3561
>             Project: Solr
>          Issue Type: Bug
>          Components: multicore, replication (java), SolrCloud
>    Affects Versions: 4.0-ALPHA
>         Environment: Solr trunk (4.0-SNAPSHOT) from 29/2-2012
>            Reporter: Per Steffensen
>            Assignee: Mark Miller
>             Fix For: 4.1, 5.0
>
>
> Running several Solr servers in Cloud-cluster (zkHost set on the Solr servers).
> Several collections with several slices and one replica for each slice (each slice has
two shards)
> Basically we want let our system delete an entire collection. We do this by trying to
delete each and every shard under the collection. Each shard is deleted one by one, by doing
CoreAdmin-UNLOAD-requests against the relevant Solr
> {code}
> CoreAdminRequest request = new CoreAdminRequest();
> request.setAction(CoreAdminAction.UNLOAD);
> request.setCoreName(shardName);
> CoreAdminResponse resp = request.process(new CommonsHttpSolrServer(solrUrl));
> {code}
> The delete/unload succeeds, but in like 10% of the cases we get errors on involved Solr
servers, right around the time where shard/cores are deleted, and we end up in a situation
where ZK still claims (forever) that the deleted shard is still present and active.
> Form here the issue is easilier explained by a more concrete example:
> - 7 Solr servers involved
> - Several collection a.o. one called "collection_2012_04", consisting of 28 slices, 56
shards (remember 1 replica for each slice) named "collection_2012_04_sliceX_shardY" for all
pairs in {X:1..28}x{Y:1,2}
> - Each Solr server running 8 shards, e.g Solr server #1 is running shard "collection_2012_04_slice1_shard1"
and Solr server #7 is running shard "collection_2012_04_slice1_shard2" belonging to the same
slice "slice1".
> When we decide to delete the collection "collection_2012_04" we go through all 56 shards
and delete/unload them one-by-one - including "collection_2012_04_slice1_shard1" and "collection_2012_04_slice1_shard2".
At some point during or shortly after all this deletion we see the following exceptions in
solr.log on Solr server #7
> {code}
> Aug 1, 2012 12:02:50 AM org.apache.solr.common.SolrException log
> SEVERE: Error while trying to recover:org.apache.solr.common.SolrException: core not
found:collection_2012_04_slice1_shard1
> request: http://solr_server_1:8983/solr/admin/cores?action=PREPRECOVERY&core=collection_2012_04_slice1_shard1&nodeName=solr_server_7%3A8983_solr&coreNodeName=solr_server_7%3A8983_solr_collection_2012_04_slice1_shard2&state=recovering&checkLive=true&pauseFor=6000&wt=javabin&version=2
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> at org.apache.solr.common.SolrExceptionPropagationHelper.decodeFromMsg(SolrExceptionPropagationHelper.java:29)
> at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:445)
> at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:264)
> at org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:188)
> at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:285)
> at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:206)
> Aug 1, 2012 12:02:50 AM org.apache.solr.common.SolrException log
> SEVERE: Recovery failed - trying again...
> Aug 1, 2012 12:02:51 AM org.apache.solr.cloud.LeaderElector$1 process
> WARNING:
> java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
> at java.util.ArrayList.RangeCheck(ArrayList.java:547)
> at java.util.ArrayList.get(ArrayList.java:322)
> at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:96)
> at org.apache.solr.cloud.LeaderElector.access$000(LeaderElector.java:57)
> at org.apache.solr.cloud.LeaderElector$1.process(LeaderElector.java:121)
> at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507)
> Aug 1, 2012 12:02:51 AM org.apache.solr.cloud.LeaderElector$1 process
> {code}
> Im not sure exactly how to interpret this, but it seems to me that some recovery job
tries to recover collection_2012_04_slice1_shard2 on Solr server #7 from collection_2012_04_slice1_shard1
on Solr server #1, but fail because Solr server #1 answers back that it doesnt run collection_2012_04_slice1_shard1
(anymore).
> This problem occurs for serveral (in this conrete test for 4) of the 28 slice pairs.
For those 4 shards the end result is that /node_states/solr_server_X:8983_solr in ZK still
contains information about the shard being running and active. E.g. /node_states/solr_server_7:8983_solr
still contains
> {code}
> { 
>  "shard":"slice1",
>  "state":"active",
>  "core":"collection_2012_04_slice1_shard2",
>  "collection":"collection_2012_04",
>  "node_name":"solr_server_7:8983_solr",
>  "base_url":"http://solr_server_7:8983/solr"
> } 
> {code}
> and that CloudState therefore still reports that those shards are running and active
- but thay are not. A.o. I have noticed that "collection_2012_04_slice1_shard2" HAS been removed
from solr.xml on Solr server #7 (we are running with persistent="true")
> Any chance that this bug is fixed in a later revision (than one from 29/2-2012) of 4.0-SNAPSHOT?
> If not we need to get it fixed, I believe.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message