lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Wartes <jwar...@whitepages.com>
Subject Re: SolrCloud replicas out of sync
Date Tue, 26 Jan 2016 22:32:05 GMT

Ah, perhaps you fell into something like this then? https://issues.apache.org/jira/browse/SOLR-7844

That says it’s fixed in 5.4, but that would be an example of a split-brain type incident,
where different documents were accepted by different replicas who each thought they were the
leader. If this is the case, and you actually have different data on each replica, I’m not
aware of any way to fix the problem short of reindexing those documents. Before that, you’ll
probably need to choose a replica and just force the others to get in sync with it. I’d
choose the current leader, since that’s slightly easier.

Typically, a leader writes an update to it’s transaction log, then sends the request to
all replicas, and when those all finish it acknowledges the update. If a replica gets restarted,
and is less than N documents behind, the leader will only replay that transaction log. (Where
N is the numRecordsToKeep configured in the updateLog section of solrconfig.xml)

What you want is to provoke the heavy-duty process normally invoked if a replica has missed
more than N docs, which essentially does a checksum and file copy on all the raw index files.
FetchIndex would probably work, but it’s a replication handler API originally designed for
master/slave replication, so take care: https://wiki.apache.org/solr/SolrReplication#HTTP_API
Probably a lot easier would be to just delete the replica and re-create it. That will also
trigger a full file copy of the index from the leader onto the new replica.

I think design decisions around Solr generally use CP as a goal. (I sometimes wish I could
get more AP behavior!) See posts like this: http://lucidworks.com/blog/2014/12/10/call-maybe-solrcloud-jepsen-flaky-networks/

So the fact that you encountered this sounds like a bug to me.
That said, another general recommendation (of mine) is that you not use Solr as your primary
data source, so you can rebuild your index from scratch if you really need to. 






On 1/26/16, 1:10 PM, "David Smith" <dsmithsolr@yahoo.com.INVALID> wrote:

>Thanks Jeff!  A few comments
>
>>>
>>> Although you could probably bounce a node and get your document counts back in
sync (by provoking a check)
>>>
> 
>
>If the check is a simple doc count, that will not work. We have found that replica1 and
replica3, although they contain the same doc count, don’t have the SAME docs.  They each
missed at least one update, but of different docs.  This also means none of our three replicas
are complete.
>
>>>
>>>it’s interesting that you’re in this situation. It implies to me that at some
point the leader couldn’t write a doc to one of the replicas,
>>>
>
>That is our belief as well. We experienced a datacenter-wide network disruption of a few
seconds, and user complaints started the first workday after that event.  
>
>The most interesting log entry during the outage is this:
>
>"1/19/2016, 5:08:07 PM ERROR null DistributedUpdateProcessorRequest says it is coming
from leader,​ but we are the leader: update.distrib=FROMLEADER&distrib.from=http://dot.dot.dot.dot:8983/solr/blah_blah_shard1_replica3/&wt=javabin&version=2"
>
>>>
>>> You might watch the achieved replication factor of your updates and see if it
ever changes
>>>
>
>This is a good tip. I’m not sure I like the implication that any failure to write all
3 of our replicas must be retried at the app layer.  Is this really how SolrCloud applications
must be built to survive network partitions without data loss? 
>
>Regards,
>
>David
>
>
>On 1/26/16, 12:20 PM, "Jeff Wartes" <jwartes@whitepages.com> wrote:
>
>>
>>My understanding is that the "version" represents the timestamp the searcher was opened,
so it doesn’t really offer any assurances about your data.
>>
>>Although you could probably bounce a node and get your document counts back in sync
(by provoking a check), it’s interesting that you’re in this situation. It implies to
me that at some point the leader couldn’t write a doc to one of the replicas, but that the
replica didn’t consider itself down enough to check itself.
>>
>>You might watch the achieved replication factor of your updates and see if it ever
changes:
>>https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Fault+Tolerance
(See Achieved Replication Factor/min_rf)
>>
>>If it does, that might give you clues about how this is happening. Also, it might
allow you to work around the issue by trying the write again.
>>
>>
>>
>>
>>
>>
>>On 1/22/16, 10:52 AM, "David Smith" <dsmithsolr@yahoo.com.INVALID> wrote:
>>
>>>I have a SolrCloud v5.4 collection with 3 replicas that appear to have fallen
permanently out of sync.  Users started to complain that the same search, executed twice,
sometimes returned different result counts.  Sure enough, our replicas are not identical:
>>>
>>>>> shard1_replica1:  89867 documents / version 1453479763194
>>>>> shard1_replica2:  89866 documents / version 1453479763194
>>>>> shard1_replica3:  89867 documents / version 1453479763191
>>>
>>>I do not think this discrepancy is going to resolve itself.  The Solr Admin screen
reports all 3 replicas as “Current”.  The last modification to this collection was 2 hours
before I captured this information, and our auto commit time is 60 seconds.  
>>>
>>>I have a lot of concerns here, but my first question is if anyone else has had
problems with out of sync replicas, and if so, what they have done to correct this?
>>>
>>>Kind Regards,
>>>
>>>David
>>>
>
Mime
View raw message