From issues-return-5324-archive-asf-public=cust-asf.ponee.io@lucene.apache.org  Mon Nov 25 04:30:02 2019
Return-Path: <issues-return-5324-archive-asf-public=cust-asf.ponee.io@lucene.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 18DD5180643
	for <archive-asf-public@cust-asf.ponee.io>; Mon, 25 Nov 2019 05:30:01 +0100 (CET)
Received: (qmail 1968 invoked by uid 500); 25 Nov 2019 04:30:01 -0000
Mailing-List: contact issues-help@lucene.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:issues-help@lucene.apache.org>
List-Unsubscribe: <mailto:issues-unsubscribe@lucene.apache.org>
List-Post: <mailto:issues@lucene.apache.org>
List-Id: <issues.lucene.apache.org>
Reply-To: dev@lucene.apache.org
Delivered-To: mailing list issues@lucene.apache.org
Received: (qmail 1948 invoked by uid 99); 25 Nov 2019 04:30:01 -0000
Received: from mailrelay1-us-west.apache.org (HELO mailrelay1-us-west.apache.org) (209.188.14.139)
    by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 25 Nov 2019 04:30:01 +0000
Received: from jira-he-de.apache.org (static.172.67.40.188.clients.your-server.de [188.40.67.172])
	by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id ADE0FE2C44
	for <issues@lucene.apache.org>; Mon, 25 Nov 2019 04:30:00 +0000 (UTC)
Received: from jira-he-de.apache.org (localhost.localdomain [127.0.0.1])
	by jira-he-de.apache.org (ASF Mail Server at jira-he-de.apache.org) with ESMTP id 241FE7802F0
	for <issues@lucene.apache.org>; Mon, 25 Nov 2019 04:30:00 +0000 (UTC)
Date: Mon, 25 Nov 2019 04:30:00 +0000 (UTC)
From: "Shalin Shekhar Mangar (Jira)" <jira@apache.org>
To: issues@lucene.apache.org
Message-ID: <JIRA.13269327.1574174267000.206557.1574656200145@Atlassian.JIRA>
In-Reply-To: <JIRA.13269327.1574174267000@Atlassian.JIRA>
References: <JIRA.13269327.1574174267000@Atlassian.JIRA> <JIRA.13269327.1574174267014@jira-he-de>
Subject: [jira] [Comment Edited] (SOLR-13945) SPLITSHARD data loss due to
 "rollback"
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394


    [ https://issues.apache.org/jira/browse/SOLR-13945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16981287#comment-16981287 ] 

Shalin Shekhar Mangar edited comment on SOLR-13945 at 11/25/19 4:29 AM:
------------------------------------------------------------------------

[~ichattopadhyaya] - the final commit was added in SOLR-4997 so that documents are visible when the sub-shard replicas come up. -It is not necessary if there is a single replica.- (note it is necessary to call this commit regardless of the replication factor)


was (Author: shalinmangar):
[~ichattopadhyaya] - the final commit was added in SOLR-4997 so that documents are visible when the sub-shard replicas come up. It is not necessary if there is a single replica.

> SPLITSHARD data loss due to "rollback"
> --------------------------------------
>
>                 Key: SOLR-13945
>                 URL: https://issues.apache.org/jira/browse/SOLR-13945
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Ishan Chattopadhyaya
>            Priority: Major
>         Attachments: SOLR-13945.patch, SOLR-13945.patch, SOLR-13945.patch
>
>
> # As per SOLR-7673, there is a commit on the parent shard *after state changes* have happened, i.e. from active/construction/construction to inactive/active/active. Please see https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java#L586-L588
> # Due to SOLR-12509, there's now a cleanup/rollback method called "cleanupAfterFailure" in the finally block that resets the state to active/construction/construction. Please see: https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java#L657
> # When 2 is entered into due to a failure in 1, we have a situation where any documents that went into the subshards (because they are already active by now) are now lost after the parent becomes active.
> If my above understanding is correct, I am wondering:
> # Why is a commit to parent shard needed *after* the parent shard is inactive, subshards are now active and the split operation has completed?
> # This rollback looks very suspicious. If state of subshards is already active and parent is inactive, then what is the need for setting them back to construction? Seems like a crucial check is missing there. Also, why do we reset the subshard status back to construction instead of inactive? It is extremely misleading (and, frankly, ridiculous) for any external clusterstate monitoring tools to see the subshards to go from CONSTRUCTION to ACTIVE to CONSTRUCTION and then the subshard disappearing.


--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org