lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Why do Solr nodes go into Recovery status
Date Wed, 07 Jun 2017 15:48:31 GMT
tlogs on Solr, not ZooKeeper. ZooKeeper is not involved in individual
Solr operations (indexing querying and the like), it just keeps the
state of the nodes....

While recovery is happening, updates are still forwarded to the node
that is recovering. They're written to the local tlog then replayed
after the index is copied.

Best,
Erick

On Tue, Jun 6, 2017 at 7:01 PM, suresh pendap <sureshforsolr@gmail.com> wrote:
> Thanks Erick for the reply.
>
> When the leader asks the follower to go into recovery status,  does it stop
> sending future updates to this replica until it becomes fully in sync with
> the leader?
>
> Regards
> Suresh
>
> On Mon, Jun 5, 2017 at 8:32 PM, Erick Erickson <erickerickson@gmail.com>
> wrote:
>
>> bq: This means that technically the replica nodes should not fall behind
>> and do
>> not have to go into recovery mode
>>
>> Well, true if nothing weird happens. By "weird" I mean anything that
>> interferes with the leader getting anything other than a success code
>> back from a follower it sends  document to.
>>
>> bq: Is this the only scenario in which a node can go into recovery status?
>>
>> No, there are others. One for-instance: Leader sends a doc to the
>> follower and the request times out (huge  GC pauses, the doc takes too
>> long to index for whatever reason etc). The leader then sends a
>> message to the follower to go directly into the recovery state since
>> the leader has no way of knowing whether the follower successfully
>> wrote the document to it's transaction log. You'll see messages about
>> "leader initiated recovery" in the follower's solr log in this case.
>>
>> two bits of pedantry:
>>
>> bq:  Down by the other replicas
>>
>> Almost. we're talking indexing here and IIUC only the leader can send
>> another node into recovery as all updates go through the leader.
>>
>> If I'm going to be nit-picky, Zookeeper can _also_ cause a node to be
>> marked as down if it's periodic ping of the node fails to return.
>> Actually I think this is done through another Solr node that ZK
>> notifies....
>>
>> bq: It goes into a recovery mode and tries to recover all the
>> documents from the leader of shard1.
>>
>> Also nit-picky. But if the follower isn't "too far" behind it can be
>> brought back into sync from via "peer sync" where it gets the missed
>> docs sent to it from the tlog of a healthy replica. "Too far" is 100
>> docs by default, but can be set in solrconfig.xml if necessary. If
>> that limit is exceeded, then indeed the entire index is copied from
>> the leader.
>>
>> Best,
>> Erick
>>
>>
>>
>> On Mon, Jun 5, 2017 at 5:18 PM, suresh pendap <sureshforsolr@gmail.com>
>> wrote:
>> > Hi,
>> >
>> > Why and in what scenarios do Solr nodes go into recovery status?
>> >
>> > Given that Solr is a CP system it means that the writes for a Document
>> > index are acknowledged only after they are propagated and acknowledged by
>> > all the replicas of the Shard.
>> >
>> > This means that technically the replica nodes should not fall behind and
>> do
>> > not have to go into recovery mode.
>> >
>> > Is my above understanding correct?
>> >
>> > Can a below scenario happen?
>> >
>> > 1. Assume that we have 3 replicas for Shard shard1 with the names
>> > shard1_replica1, shard1_replica2 and shard1_replica3.
>> >
>> > 2. Due to some reason, network issue or something else, the
>> shard1_replica2
>> > is not reachable by the other replicas and it is marked as Down by the
>> > other replicas (shard1_replica1 and shard1_replica3 in this case)
>> >
>> > 3. The network issue is restored and the shard1_replica2 is reachable
>> > again. It goes into a recovery mode and tries to recover all the
>> documents
>> > from the leader of shard1.
>> >
>> > Is this the only scenario in which a node can go into recovery status?
>> >
>> > In other words, does the node has to go into a Down status before getting
>> > back into a recovery status?
>> >
>> >
>> > Regards
>>

Mime
View raw message