lucene-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Noble Paul (Jira)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-13945) SPLITSHARD data loss due to "rollback"
Date Tue, 19 Nov 2019 19:34:00 GMT

    [ https://issues.apache.org/jira/browse/SOLR-13945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16977776#comment-16977776
] 

Noble Paul commented on SOLR-13945:
-----------------------------------

Is there any point is doing a rollback after the subshards are published as {{ACTIVE}} ? Because
, they have already started accepting writes and doing a rollback can make the user believe
that the shard split failed but there is NO DATA LOSS. It's much better to just fail and let
the user know that it failed due to something and he needs to take some remedial action

> SPLITSHARD data loss due to "rollback"
> --------------------------------------
>
>                 Key: SOLR-13945
>                 URL: https://issues.apache.org/jira/browse/SOLR-13945
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Ishan Chattopadhyaya
>            Priority: Major
>         Attachments: SOLR-13945.patch, SOLR-13945.patch
>
>
> # As per SOLR-7673, there is a commit on the parent shard *after state changes* have
happened, i.e. from active/construction/construction to inactive/active/active. Please see
https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java#L586-L588
> # Due to SOLR-12509, there's now a cleanup/rollback method called "cleanupAfterFailure"
in the finally block that resets the state to active/construction/construction. Please see:
https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java#L657
> # When 2 is entered into due to a failure in 1, we have a situation where any documents
that went into the subshards (because they are already active by now) are now lost after the
parent becomes active.
> If my above understanding is correct, I am wondering:
> # Why is a commit to parent shard needed *after* the parent shard is inactive, subshards
are now active and the split operation has completed?
> # This rollback looks very suspicious. If state of subshards is already active and parent
is inactive, then what is the need for setting them back to construction? Seems like a crucial
check is missing there. Also, why do we reset the subshard status back to construction instead
of inactive? It is extremely misleading (and, frankly, ridiculous) for any external clusterstate
monitoring tools to see the subshards to go from CONSTRUCTION to ACTIVE to CONSTRUCTION and
then the subshard disappearing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Mime
View raw message