lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Per Steffensen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-3721) Multiple concurrent recoveries of same shard?
Date Thu, 16 Aug 2012 13:37:38 GMT

    [ https://issues.apache.org/jira/browse/SOLR-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435953#comment-13435953
] 

Per Steffensen commented on SOLR-3721:
--------------------------------------

What if two Solrs, respectively running leader and replica for the same slice (only one replica),
lose their ZK connection at about the same time. Then there will be no active shard that either
of them can recover from. Could it be in such scenarios that multiple concurrent recoveries
of the same shard somehow get started?

BTW, the scenario above shouldnt end in a situation where the slice is just dead. The two
shards in the same slice ought to find out who has the newest version of the shard-data (will
probably be the one that was leader last), make that shard the leader (without recovering)
and let the other shard recover from it. Is this scenarios handled (in the way I suggest or
in another way) already in Solr 4.0 (beta - tip of branch) or is that a future thing (e.g.
on 4.1 or 5.0)?

Regards, Per Steffensen
                
> Multiple concurrent recoveries of same shard?
> ---------------------------------------------
>
>                 Key: SOLR-3721
>                 URL: https://issues.apache.org/jira/browse/SOLR-3721
>             Project: Solr
>          Issue Type: Bug
>          Components: multicore, SolrCloud
>    Affects Versions: 4.0
>         Environment: Using our own Solr release based on Apache revision 1355667 from
4.x branch. Our changes to the Solr version is our solutions to TLT-3178 etc., and should
have no effect on this issue.
>            Reporter: Per Steffensen
>              Labels: concurrency, multicore, recovery, solrcloud
>             Fix For: 4.0
>
>         Attachments: recovery_in_progress.png, recovery_start_finish.log
>
>
> We run a performance/endurance test on a 7 Solr instance SolrCloud setup and eventually
Solrs lose ZK connections and go into recovery. BTW the recovery often does not ever succeed,
but we are looking into that. While doing that I noticed that, according to logs, multiple
recoveries are in progress at the same time for the same shard. That cannot be intended and
I can certainly imagine that it will cause some problems.
> It is just the logs that are wrong, did I make some mistake, or is this a real bug?
> See attached grep from log, grepping only on "Finished recovery" and "Starting recovery"
logs.
> {code}
> grep -B 1 "Finished recovery\|Starting recovery" solr9.log solr8.log solr7.log solr6.log
solr5.log solr4.log solr3.log solr2.log solr1.log solr0.log > recovery_start_finish.log
> {code}
> It can be hard to get an overview of the log, but I have generated a graph showing (based
alone on "Started recovery" and "Finished recovery" logs) how many recoveries are in progress
at any time for the different shards. See attached recovery_in_progress.png. The graph is
also a little hard to get an overview of (due to the many shards) but it is clear that for
several shards there are multiple recoveries going on at the same time, and that several recoveries
never succeed.
> Regards, Per Steffensen

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message