lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Lamy <t.l...@cytainment.de>
Subject Re: leader split-brain at least once a day - need help
Date Tue, 13 Jan 2015 08:58:09 GMT
Hi Mark,

we're currently at 4.10.2, update to 4.10.3 ist scheduled for tomorrow.

T

Am 12.01.15 um 17:30 schrieb Mark Miller:
> bq. ClusterState says we are the leader, but locally we don't think so
>
> Generally this is due to some bug. One bug that can lead to it was recently
> fixed in 4.10.3 I think. What version are you on?
>
> - Mark
>
> On Mon Jan 12 2015 at 7:35:47 AM Thomas Lamy <t.lamy@cytainment.de> wrote:
>
>> Hi,
>>
>> I found no big/unusual GC pauses in the Log (at least manually; I found
>> no free solution to analyze them that worked out of the box on a
>> headless debian wheezy box). Eventually i tried with -Xmx8G (was 64G
>> before) on one of the nodes, after checking allocation after 1 hour run
>> time was at about 2-3GB. That didn't move the time frame where a restart
>> was needed, so I don't think Solr's JVM GC is the problem.
>> We're trying to get all of our node's logs (zookeeper and solr) into
>> Splunk now, just to get a better sorted view of what's going on in the
>> cloud once a problem occurs. We're also enabling GC logging for
>> zookeeper; maybe we were missing problems there while focussing on solr
>> logs.
>>
>> Thomas
>>
>>
>> Am 08.01.15 um 16:33 schrieb Yonik Seeley:
>>> It's worth noting that those messages alone don't necessarily signify
>>> a problem with the system (and it wouldn't be called "split brain").
>>> The async nature of updates (and thread scheduling) along with
>>> stop-the-world GC pauses that can change leadership, cause these
>>> little windows of inconsistencies that we detect and log.
>>>
>>> -Yonik
>>> http://heliosearch.org - native code faceting, facet functions,
>>> sub-facets, off-heap data
>>>
>>>
>>> On Wed, Jan 7, 2015 at 5:01 AM, Thomas Lamy <t.lamy@cytainment.de>
>> wrote:
>>>> Hi there,
>>>>
>>>> we are running a 3 server cloud serving a dozen
>>>> single-shard/replicate-everywhere collections. The 2 biggest
>> collections are
>>>> ~15M docs, and about 13GiB / 2.5GiB size. Solr is 4.10.2, ZK 3.4.5,
>> Tomcat
>>>> 7.0.56, Oracle Java 1.7.0_72-b14
>>>>
>>>> 10 of the 12 collections (the small ones) get filled by DIH full-import
>> once
>>>> a day starting at 1am. The second biggest collection is updated usind
>> DIH
>>>> delta-import every 10 minutes, the biggest one gets bulk json updates
>> with
>>>> commits once in 5 minutes.
>>>>
>>>> On a regular basis, we have a leader information mismatch:
>>>> org.apache.solr.update.processor.DistributedUpdateProcessor; Request
>> says it
>>>> is coming from leader, but we are the leader
>>>> or the opposite
>>>> org.apache.solr.update.processor.DistributedUpdateProcessor;
>> ClusterState
>>>> says we are the leader, but locally we don't think so
>>>>
>>>> One of these pop up once a day at around 8am, making either some cores
>> going
>>>> to "recovery failed" state, or all cores of at least one cloud node into
>>>> state "gone".
>>>> This started out of the blue about 2 weeks ago, without changes to
>> neither
>>>> software, data, or client behaviour.
>>>>
>>>> Most of the time, we get things going again by restarting solr on the
>>>> current leader node, forcing a new election - can this be triggered
>> while
>>>> keeping solr (and the caches) up?
>>>> But sometimes this doesn't help, we had an incident last weekend where
>> our
>>>> admins didn't restart in time, creating millions of entries in
>>>> /solr/oversser/queue, making zk close the connection, and leader
>> re-elect
>>>> fails. I had to flush zk, and re-upload collection config to get solr up
>>>> again (just like in https://gist.github.com/
>> isoboroff/424fcdf63fa760c1d1a7).
>>>> We have a much bigger cloud (7 servers, ~50GiB Data in 8 collections,
>> 1500
>>>> requests/s) up and running, which does not have these problems since
>>>> upgrading to 4.10.2.
>>>>
>>>>
>>>> Any hints on where to look for a solution?
>>>>
>>>> Kind regards
>>>> Thomas
>>>>
>>>> --
>>>> Thomas Lamy
>>>> Cytainment AG & Co KG
>>>> Nordkanalstrasse 52
>>>> 20097 Hamburg
>>>>
>>>> Tel.:     +49 (40) 23 706-747
>>>> Fax:     +49 (40) 23 706-139
>>>> Sitz und Registergericht Hamburg
>>>> HRA 98121
>>>> HRB 86068
>>>> Ust-ID: DE213009476
>>>>
>>
>> --
>> Thomas Lamy
>> Cytainment AG & Co KG
>> Nordkanalstrasse 52
>> 20097 Hamburg
>>
>> Tel.:     +49 (40) 23 706-747
>> Fax:     +49 (40) 23 706-139
>>
>> Sitz und Registergericht Hamburg
>> HRA 98121
>> HRB 86068
>> Ust-ID: DE213009476
>>
>>


-- 
Thomas Lamy
Cytainment AG & Co KG
Nordkanalstrasse 52
20097 Hamburg

Tel.:     +49 (40) 23 706-747
Fax:     +49 (40) 23 706-139

Sitz und Registergericht Hamburg
HRA 98121
HRB 86068
Ust-ID: DE213009476


Mime
View raw message