lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Miller (JIRA)" <>
Subject [jira] [Commented] (SOLR-6707) Recovery/election for invalid core results in rapid-fire re-attempts until /overseer/queue is clogged
Date Mon, 26 Jan 2015 03:58:34 GMT


Mark Miller commented on SOLR-6707:

Some of this is probably related to SOLR-7033.

> Recovery/election for invalid core results in rapid-fire re-attempts until /overseer/queue
is clogged
> -----------------------------------------------------------------------------------------------------
>                 Key: SOLR-6707
>                 URL:
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.10
>            Reporter: James Hardwick
>             Fix For: Trunk, 5.1
> We experienced an issue the other day that brought a production solr server down, and
this is what we found after investigating:
> - Running solr instance with two separate cores, one of which is perpetually down because
it's configs are not yet completely updated for Solr-cloud. This was thought to be harmless
since it's not currently in use. 
> - Solr experienced an "internal server error" supposedly because of "No space left on
device" even though we appeared to have ~10GB free. 
> - Solr immediately went into recovery, and subsequent leader election for each shard
of each core. 
> - Our primary core recovered immediately. Our additional core which was never active
in the first place, attempted to recover but of course couldn't due to the improper configs.

> - Solr then began rapid-fire reattempting recovery of said node, trying maybe 20-30 times
per second.
> - This in turn bombarded zookeepers /overseer/queue into oblivion
> - At some point /overseer/queue becomes so backed up that normal cluster coordination
can no longer play out, and Solr topples over. 
> I know this is a bit of an unusual circumstance due to us keeping the dead core around,
and our quick solution has been to remove said core. However I can see other potential scenarios
that might cause the same issue to arise. 

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message