lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shalin Shekhar Mangar (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SOLR-11661) Race condition between core creation thread and recovery request from leader causes inconsistent view of documents
Date Mon, 20 Nov 2017 08:52:00 GMT

     [ https://issues.apache.org/jira/browse/SOLR-11661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Shalin Shekhar Mangar updated SOLR-11661:
-----------------------------------------
    Attachment: 11458-2-MoveReplicaHDFSTest-log.txt

Full logs attached.

Dat and I analyzed the logs and we found this problem:
{code}
# New collection called MoveReplicaHDFSTest_failed_coll is being created. New replicas core_node7
and core_node8 for shard are in process of being created.
# New core MoveReplicaHDFSTest_failed_coll_shard2_replica_n4 core_node7 tries to become leader,
asks MoveReplicaHDFSTest_failed_coll_shard2_replica_n6 core_node8 to sync
# Sync fails because core_node8 has no versions
# core_node7 becomes leader and asks core_node8 to recover
# core_node8 gets a request to recover and starts recovery thread recoveryExecutor-53-thread-1-processing-n:127.0.0.1:61049_solr
# core_node8 enters buffering state
# core_node8 sends prep recovery command to core_node7 and publishes itself in recovery state
# core_node7 has a thread in WaitForState and sees core_node8 as down currently
# At t=70388, some DataStreamer Exception is reported from DFSClient and leader core_node7
logs that  it could not close the HDFS transaction log due to no more good datanodes being
available -- these look like they aren't relevant to the problem
# core_node7 (leader) publishes itself as active
# core_node7 create core is complete
# core_node8 create thread (qtp1713789948-2124) sees that there is a leader and publishes
itself as active, skipping recovery
# core_node8 create core command is successful
# collection create is finished
# core_node7 remains tied in WaitForState because from now on it only sees core_node8 in active
but not in recovery
# the recovery thread in core_node8 remains waiting in prep recovery
# New documents are added to the collection but they aren't visible to searchers because core_node8
is buffering and therefore ignores commit requests
{code}

So there is a race between the core create thread publishing local as active after the leader
has asked said core to recover. This is a side effect of SOLR-9566 which skips recovery for
replicas which are being created as part of a new collection.


> Race condition between core creation thread and recovery request from leader causes inconsistent
view of documents
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-11661
>                 URL: https://issues.apache.org/jira/browse/SOLR-11661
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>            Reporter: Shalin Shekhar Mangar
>             Fix For: 7.2, master (8.0)
>
>         Attachments: 11458-2-MoveReplicaHDFSTest-log.txt
>
>
> While testing SOLR-11458, [~ab] ran into an interesting failure which resulted in different
document counts between leader and replica. The test is MoveReplicaHDFSTest on jira/solr-11458-2
branch.
> The failure is rare but reproducible on beasting:
> {code}
> reproduce with: ant test  -Dtestcase=MoveReplicaHDFSTest -Dtests.method=testNormalFailedMove
-Dtests.seed=161856CB543CD71C -Dtests.slow=true -Dtests.locale=ar-SA -Dtests.timezone=US/Michigan
-Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1
>    [junit4] FAILURE 14.2s | MoveReplicaHDFSTest.testNormalFailedMove <<<
>    [junit4]    > Throwable #1: java.lang.AssertionError: expected:<100> but
was:<56>
>    [junit4]    > 	at __randomizedtesting.SeedInfo.seed([161856CB543CD71C:31134983787E4905]:0)
>    [junit4]    > 	at org.apache.solr.cloud.MoveReplicaTest.testFailedMove(MoveReplicaTest.java:305)
>    [junit4]    > 	at org.apache.solr.cloud.MoveReplicaHDFSTest.testNormalFailedMove(MoveReplicaHDFSTest.java:69)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message