lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ben DeMott (JIRA)" <>
Subject [jira] [Commented] (SOLR-6707) Recovery/election for invalid core results in rapid-fire re-attempts until /overseer/queue is clogged
Date Wed, 23 Aug 2017 22:43:00 GMT


Ben DeMott commented on SOLR-6707:

We have experienced this multiple times.  We host inside AWS and Zookeeper is spread across
different availability zones...
This means that the connection between ZK's has high latency once in awhile which ZK doesn't
seem to like.  I wonder if anyone else is in this situation.
We've never had so many Zookeeper issues as we do now that we've moved our infrastructure
inside AWS.

What triggered a backed up overseer queue for us was a hung ephemeral node in Zookeeper which
I discuss here:

As OP said, once this goes on for long enough Solr runs out of file-descriptors, and eventually
brings down the whole cluster.

This bug in Zookeeper (appears) to be the cause of the hung ephemeral node:

> Recovery/election for invalid core results in rapid-fire re-attempts until /overseer/queue
is clogged
> -----------------------------------------------------------------------------------------------------
>                 Key: SOLR-6707
>                 URL:
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.10
>            Reporter: James Hardwick
>             Fix For: 5.2, 6.0
> We experienced an issue the other day that brought a production solr server down, and
this is what we found after investigating:
> - Running solr instance with two separate cores, one of which is perpetually down because
it's configs are not yet completely updated for Solr-cloud. This was thought to be harmless
since it's not currently in use. 
> - Solr experienced an "internal server error" supposedly because of "No space left on
device" even though we appeared to have ~10GB free. 
> - Solr immediately went into recovery, and subsequent leader election for each shard
of each core. 
> - Our primary core recovered immediately. Our additional core which was never active
in the first place, attempted to recover but of course couldn't due to the improper configs.

> - Solr then began rapid-fire reattempting recovery of said node, trying maybe 20-30 times
per second.
> - This in turn bombarded zookeepers /overseer/queue into oblivion
> - At some point /overseer/queue becomes so backed up that normal cluster coordination
can no longer play out, and Solr topples over. 
> I know this is a bit of an unusual circumstance due to us keeping the dead core around,
and our quick solution has been to remove said core. However I can see other potential scenarios
that might cause the same issue to arise. 

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message