Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 45D9BDA0F for ; Tue, 4 Sep 2012 22:31:09 +0000 (UTC) Received: (qmail 20714 invoked by uid 500); 4 Sep 2012 22:31:07 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 20667 invoked by uid 500); 4 Sep 2012 22:31:07 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 20659 invoked by uid 99); 4 Sep 2012 22:31:07 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Sep 2012 22:31:07 +0000 Date: Wed, 5 Sep 2012 09:31:07 +1100 (NCT) From: "Mark Miller (JIRA)" To: dev@lucene.apache.org Message-ID: <2140502270.35332.1346797867889.JavaMail.jiratomcat@arcas> In-Reply-To: <1499455442.30582.1346683387793.JavaMail.jiratomcat@arcas> Subject: [jira] [Commented] (SOLR-3782) A leader going down while updates are coming in can cause shard inconsistency. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/SOLR-3782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13448123#comment-13448123 ] Mark Miller commented on SOLR-3782: ----------------------------------- I only solved the issue when stopping the leader - there was also a similar issue on session expiration (the leaders update queue could to be emptying as we elect a new leader and beyond). I fixed this as well by shutting down the executor on session expiration and creating a new one for further use. > A leader going down while updates are coming in can cause shard inconsistency. > ------------------------------------------------------------------------------ > > Key: SOLR-3782 > URL: https://issues.apache.org/jira/browse/SOLR-3782 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Reporter: Mark Miller > Assignee: Mark Miller > Fix For: 4.0, 5.0 > > > Harpoon into the head of the great whale I have been chasing for a couple weeks now. > ChaosMonkey test was exposing this. > Turns out the problem was the solr cmd distrib executor - when closing the leader CoreContainer, we would close the zkController while updates can still flow through the distrib executor. The result was that we would send updates from the leader briefly even though there was a new leader. > I had suspected something similar to this at one point in the hunt and started adding some defensive state checks that we wanted to add anyway. I don't think they caught all of this issue due to the limited tightness one of the state checks can get to (checking the cloudstate leader from a replica against the leader indicated by the request). > So the answer is to finally work out how to stop the solr cmd distrib executor - because we need to stop it before closing zkController and giving up our role as leader. > I've worked that all out and the issue no longer seems to be a problem. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org