lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cao Manh Dat (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SOLR-12066) Autoscaling move replica can cause core initialization failure on the original JVM
Date Wed, 28 Mar 2018 04:24:00 GMT

     [ https://issues.apache.org/jira/browse/SOLR-12066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Cao Manh Dat updated SOLR-12066:
--------------------------------
    Attachment: SOLR-12066.patch

> Autoscaling move replica can cause core initialization failure on the original JVM
> ----------------------------------------------------------------------------------
>
>                 Key: SOLR-12066
>                 URL: https://issues.apache.org/jira/browse/SOLR-12066
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: AutoScaling, SolrCloud
>            Reporter: Varun Thacker
>            Priority: Major
>             Fix For: 7.4, master (8.0)
>
>         Attachments: SOLR-12066.patch
>
>
> Initially when SOLR-12047 was created it looked like waiting for a state in ZK for only
3 seconds was the culprit for cores not loading up
>  
> But it turns out to be something else. Here are the steps to reproduce this problem
>  
>  - create a 3 node cluster
>  - create a 1 shard X 2 replica collection to use node1 and node2 ( [http://localhost:8983/solr/admin/collections?action=create&name=test_node_lost&numShards=1&nrtReplicas=2&autoAddReplicas=true] )
>  - stop node 2 : ./bin/solr stop -p 7574
>  - Solr will create a new replica on node3 after 30 seconds because of the ".auto_add_replicas"
trigger
>  - At this point state.json has info about replicas being on node1 and node3
>  - Start node2. Bam!
> {code:java}
> java.util.concurrent.ExecutionException: org.apache.solr.common.SolrException: Unable
to create core [test_node_lost_shard1_replica_n2]
> ...
> Caused by: org.apache.solr.common.SolrException: Unable to create core [test_node_lost_shard1_replica_n2]
> at org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1053)
> ...
> Caused by: org.apache.solr.common.SolrException: 
> at org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1619)
> at org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1030)
> ...
> Caused by: org.apache.solr.common.SolrException: coreNodeName core_node4 does not exist
in shard shard1: DocCollection(test_node_lost//collections/test_node_lost/state.json/12)={
> ...{code}
>  
> The practical effects of this is not big since the move replica has already put the replica
on another JVM . But to the user it's super confusing on what's happening. He can never get
rid of this error unless he manually cleans up the data directory on node2 and restart
>  
> Please note: I chose autoAddReplicas=true to reproduce this. but a user could be using
a node lost trigger and and run into the same issue



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message