lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <markrmil...@gmail.com>
Subject Re: Rogue query killed several replicas with OOM, after recovering - match all docs query problem
Date Mon, 22 Apr 2013 20:33:32 GMT
Bummer on the log loss :(

Good info though. Somehow that replica became active without actually syncing? This is heavily
tested (though not with OOM's I suppose), so I'm a little surprised, but it's hard to speculate
how it happened without the logs. Specially, the logs from the node that is off would be great
- we would see what it did when it recovered and why it might think it was in sync :(

- Mark

On Apr 22, 2013, at 2:19 PM, Timothy Potter <thelabdude@gmail.com> wrote:

> nm - can't read my own output - the leader had more docs than the replica ;-)
> 
> On Mon, Apr 22, 2013 at 11:42 AM, Timothy Potter <thelabdude@gmail.com> wrote:
>> Have a little more info about this ... the numDocs for *:* fluctuates
>> between two values (difference of 324 docs) depending on which nodes I
>> hit (distrib=true)
>> 
>> 589,674,416
>> 589,674,092
>> 
>> Using distrib=false, I found 1 shard with a mis-match:
>> 
>> shard15: {
>>  leader = 32,765,254
>>  replica = 32,764,930 diff:324
>> }
>> 
>> Interesting that the replica has more docs than the leader.
>> 
>> Unfortunately, due to some bad log management scripting on my part,
>> the logs were lost when these instances got re-started, which really
>> bums me out :-(
>> 
>> For now, I'm going to assume the replica with more docs is the one I
>> want to keep and will replicate the full index over to the other one.
>> Sorry about losing the logs :-(
>> 
>> Tim
>> 
>> 
>> 
>> 
>> On Sat, Apr 20, 2013 at 10:23 AM, Timothy Potter <thelabdude@gmail.com> wrote:
>>> Thanks for responding Mark. I'll collect the information you asked
>>> about and open a JIRA once I have a little more understanding of what
>>> happened. Hopefully I can piece together some story after going over
>>> the logs.
>>> 
>>> As for replica / leader, I suspect some leaders went down but
>>> fail-over to new leaders seemed to work fine. We lost about 9 nodes at
>>> once and continued to serve queries, which is awesome.
>>> 
>>> On Sat, Apr 20, 2013 at 10:11 AM, Mark Miller <markrmiller@gmail.com> wrote:
>>>> Yeah, thats no good.
>>>> 
>>>> You might hit each node with distrib=false to get the doc counts.
>>>> 
>>>> Which ones have what you think are the right counts and which the wrong -
eg is it all replicas that are off, or leaders as well?
>>>> 
>>>> You say several replicas - do you mean no leaders went down?
>>>> 
>>>> You might look closer at the logs for a node that has it's count off.
>>>> 
>>>> Finally, I guess I'd try and track it in a JIRA issue.
>>>> 
>>>> - Mark
>>>> 
>>>> On Apr 19, 2013, at 6:37 PM, Timothy Potter <thelabdude@gmail.com>
wrote:
>>>> 
>>>>> We had a rogue query take out several replicas in a large 4.2.0 cluster
>>>>> today, due to OOM's (we use the JVM args to kill the process on OOM).
>>>>> 
>>>>> After recovering, when I execute the match all docs query (*:*), I get
a
>>>>> different count each time.
>>>>> 
>>>>> In other words, if I execute q=*:* several times in a row, then I get
a
>>>>> different count back for numDocs.
>>>>> 
>>>>> This was not the case prior to the failure as that is one thing we monitor
>>>>> for.
>>>>> 
>>>>> I think I should be worried ... any ideas on how to troubleshoot this?
One
>>>>> thing to mention is that several of my replicas had to do full recoveries
>>>>> from the leader when they came back online. Indexing was happening when
the
>>>>> replicas failed.
>>>>> 
>>>>> Thanks.
>>>>> Tim
>>>> 


Mime
View raw message