lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Timothy Potter <>
Subject Re: Rogue query killed several replicas with OOM, after recovering - match all docs query problem
Date Mon, 22 Apr 2013 17:42:22 GMT
Have a little more info about this ... the numDocs for *:* fluctuates
between two values (difference of 324 docs) depending on which nodes I
hit (distrib=true)


Using distrib=false, I found 1 shard with a mis-match:

shard15: {
  leader = 32,765,254
  replica = 32,764,930 diff:324

Interesting that the replica has more docs than the leader.

Unfortunately, due to some bad log management scripting on my part,
the logs were lost when these instances got re-started, which really
bums me out :-(

For now, I'm going to assume the replica with more docs is the one I
want to keep and will replicate the full index over to the other one.
Sorry about losing the logs :-(


On Sat, Apr 20, 2013 at 10:23 AM, Timothy Potter <> wrote:
> Thanks for responding Mark. I'll collect the information you asked
> about and open a JIRA once I have a little more understanding of what
> happened. Hopefully I can piece together some story after going over
> the logs.
> As for replica / leader, I suspect some leaders went down but
> fail-over to new leaders seemed to work fine. We lost about 9 nodes at
> once and continued to serve queries, which is awesome.
> On Sat, Apr 20, 2013 at 10:11 AM, Mark Miller <> wrote:
>> Yeah, thats no good.
>> You might hit each node with distrib=false to get the doc counts.
>> Which ones have what you think are the right counts and which the wrong - eg is it
all replicas that are off, or leaders as well?
>> You say several replicas - do you mean no leaders went down?
>> You might look closer at the logs for a node that has it's count off.
>> Finally, I guess I'd try and track it in a JIRA issue.
>> - Mark
>> On Apr 19, 2013, at 6:37 PM, Timothy Potter <> wrote:
>>> We had a rogue query take out several replicas in a large 4.2.0 cluster
>>> today, due to OOM's (we use the JVM args to kill the process on OOM).
>>> After recovering, when I execute the match all docs query (*:*), I get a
>>> different count each time.
>>> In other words, if I execute q=*:* several times in a row, then I get a
>>> different count back for numDocs.
>>> This was not the case prior to the failure as that is one thing we monitor
>>> for.
>>> I think I should be worried ... any ideas on how to troubleshoot this? One
>>> thing to mention is that several of my replicas had to do full recoveries
>>> from the leader when they came back online. Indexing was happening when the
>>> replicas failed.
>>> Thanks.
>>> Tim

View raw message