lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <markrmil...@gmail.com>
Subject Re: Solrcloud Index corruption
Date Thu, 05 Mar 2015 23:11:45 GMT
If you google replication can cause index corruption there are two jira issues that are the
most likely cause of corruption in a solrcloud env. 

- Mark

> On Mar 5, 2015, at 2:20 PM, Garth Grimm <GarthGrimm@averyranchconsulting.com> wrote:
> 
> For updates, the document will always get routed to the leader of the appropriate shard,
no matter what server first receives the request.
> 
> -----Original Message-----
> From: Martin de Vries [mailto:martin@downnotifier.com] 
> Sent: Thursday, March 05, 2015 4:14 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solrcloud Index corruption
> 
> Hi Erick,
> 
> Thank you for your detailed reply.
> 
> You say in our case some docs didn't made it to the node, but that's not really true:
the docs can be found on the corrupted nodes when I search on ID. The docs are also complete.
The problem is that the docs do not appear when I filter on certain fields (however the fields
are in the doc and have the right value when I search on ID). So something seems to be corrupt
in the filter index. We will try the checkindex, hopefully it is able to identify the problematic
cores.
> 
> I understand there is not a "master" in SolrCloud. In our case we use haproxy as a load
balancer for every request. So when indexing every document will be sent to a different solr
server, immediately after each other. Maybe SolrCloud is not able to handle that correctly?
> 
> 
> Thanks,
> 
> Martin
> 
> 
> 
> 
> Erick Erickson schreef op 05.03.2015 19:00:
> 
>> Wait up. There's no "master" index in SolrCloud. Raw documents are 
>> forwarded to each replica, indexed and put in the local tlog. If a 
>> replica falls too far out of synch (say you take it offline), then the 
>> entire index _can_ be replicated from the leader and, if the leader's 
>> index was incomplete then that might propagate the error.
>> 
>> The practical consequence of this is that if _any_ replica has a 
>> complete index, you can recover. Before going there though, the 
>> brute-force approach is to just re-index everything from scratch.
>> That's likely easier, especially on indexes this size.
>> 
>> Here's what I'd do.
>> 
>> Assuming you have the Collections API calls for ADDREPLICA and 
>> DELETEREPLICA, then:
>> 0> Identify the complete replicas. If you're lucky you have at least
>> one for each shard.
>> 1> Copy 1 good index from each shard somewhere just to have a backup.
>> 2> DELETEREPLICA on all the incomplete replicas
>> 2.5> I might shut down all the nodes at this point and check that all 
>> the cores I'd deleted were gone. If any remnants exist, 'rm -rf 
>> deleted_core_dir'.
>> 3> ADDREPLICA to get the ones removed in back.
>> 
>> should copy the entire index from the leader for each replica. As you 
>> do the leadership will change and after you've deleted all the 
>> incomplete replicas, one of the complete ones will be the leader and 
>> you should be OK.
>> 
>> If you don't want to/can't use the Collections API, then
>> 0> Identify the complete replicas. If you're lucky you have at least
>> one for each shard.
>> 1> Shut 'em all down.
>> 2> Copy the good index somewhere just to have a backup.
>> 3> 'rm -rf data' for all the incomplete cores.
>> 4> Bring up the good cores.
>> 5> Bring up the cores that you deleted the data dirs from.
>> 
>> What should do is replicate the entire index from the leader. When you 
>> restart the good cores (step 4 above), they'll _become_ the leader.
>> 
>> bq: Is it possible to make Solrcloud invulnerable for network problems 
>> I'm a little surprised that this is happening. It sounds like the 
>> network problems were such that some nodes weren't out of touch long 
>> enough for Zookeeper to sense that they were down and put them into 
>> recovery. Not sure there's any way to secure against that.
>> 
>> bq: Is it possible to see if a core is corrupt?
>> There's "CheckIndex", here's at least one link:
>> http://java.dzone.com/news/lucene-and-solrs-checkindex
>> What you're describing, though, is that docs just didn't make it to 
>> the node, _not_ that the index has unexpected bits, bad disk sectors 
>> and the like so CheckIndex can't detect that. How would it know what 
>> _should_ have been in the index?
>> 
>> bq: I noticed a difference in the "Gen" column on Overview - 
>> Replication. Does this mean there is something wrong?
>> You cannot infer anything from this. In particular, the merging will 
>> be significantly different between a single full-reindex and what the 
>> state of segment merges is in an incrementally built index.
>> 
>> The admin UI screen is rooted in the pre-cloud days, the Master/Slave 
>> thing is entirely misleading. In SolrCloud, since all the raw data is 
>> forwarded to all replicas, and any auto commits that happen may very 
>> well be slightly out of sync, the index size, number of segments, 
>> generations, and all that are pretty safely ignored.
>> 
>> Best,
>> Erick
>> 
>> On Thu, Mar 5, 2015 at 6:50 AM, Martin de Vries 
>> <martin@downnotifier.com>
>> wrote:
>> 
>>> Hi Andrew, Even our master index is corrupt, so I'm afraid this won't 
>>> help in our case. Martin Andrew Butkus schreef op 05.03.2015 16:45:
>>> 
>>>> Force a fetchindex on slave from master command:
>>>> http://slave_host:port/solr/replication?command=fetchindex - from 
>>>> http://wiki.apache.org/solr/SolrReplication [1] The above command 
>>>> will download the whole index from master to slave, there are 
>>>> configuration options in solr to make this problem happen less often 
>>>> (allowing it to recover from new documents added and only send the 
>>>> changes with a wider gap) - but I cant remember what those were.
> 
> 
> 
> Links:
> ------
> [1] http://wiki.apache.org/solr/SolrReplication

Mime
View raw message