lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Wu <wuhai...@gmail.com>
Subject Re: solr4.7: leader core does not elected to other active core after sorl OS shutdown, known issue?
Date Mon, 21 Sep 2015 17:42:27 GMT
Hi Shai, still the same question: other peer cores which they are active
did not claim to be leader after a long time.  However, some of the peer
cores claimed to be leaders at earlier time when server stopping. That's
inconsistent results

2015-09-21 10:52 GMT-04:00 Shai Erera <serera@gmail.com>:

> I don't think the process Shalin describes applies to clusterstate.json.
> That JSON object reflects the status Solr "knows" about, or "last known
> status". When Solr is properly shutdown, I believe those attributes are
> cleared from clusterstate.json, as well the leaders give up their lease.
>
> However, when Solr is killed, it takes ZK the 30 seconds or so timeout to
> kill the ephemeral node and release the leader lease. ZK is unaware of
> Solr's clusterstate.json and cannot update the 'leader' property to false.
> It simply releases the lease, so that other cores may claim it.
>
> Perhaps that explains the confusion?
>
> Shai
>
> On Mon, Sep 21, 2015 at 4:36 PM, Jeff Wu <wuhaijie@gmail.com> wrote:
>
> > Hi Shalin,  thank you for the response.
> >
> > We waited longer enough than the ZK session timeout time, and it still
> did
> > not kick off any leader election for these "remained down-leader" cores.
> > That's the question I'm actually asking.
> >
> > Our test scenario:
> >
> > Each solr server has 64 cores, and they are all active, and all leader
> > cores.
> > Shutdown the linux OS.
> > Monitor clusterstate.json over ZK, after enough ZK session timeout value.
> > We noticed some cores has leader election happened. But still saw some
> down
> > cores remains leader.
> >
> > 2015-09-21 9:15 GMT-04:00 Shalin Shekhar Mangar <shalinmangar@gmail.com
> >:
> >
> > > Hi Jeff,
> > >
> > > The leader election relies on ephemeral nodes in Zookeeper to detect
> > > when leader or other nodes have gone down (abruptly). These ephemeral
> > > nodes are automatically deleted by ZooKeeper after the ZK session
> > > timeout which is by default 30 seconds. So if you kill a node then it
> > > can take up to 30 seconds for the cluster to detect it and start a new
> > > leader election. This won't be necessary during a graceful shutdown
> > > because on shutdown the node will give up leader position so that a
> > > new one can be elected. You could tune the zk session timeout to a
> > > lower value but then it makes the cluster more sensitive to GC pauses
> > > which can also trigger new leader elections.
> > >
> > > On Mon, Sep 21, 2015 at 5:55 PM, Jeff Wu <wuhaijie@gmail.com> wrote:
> > > > Our environment still run with Solr4.7. Recently we noticed in a
> test.
> > > When
> > > > we stopped 1 solr server(solr02, which did OS shutdown), all the
> cores
> > of
> > > > solr02 are shown as "down", but remains a few cores still as leaders.
> > > After
> > > > that, we quickly seeing all other servers are still sending requests
> to
> > > > that down solr server, and therefore we saw a lot of TCP waiting
> > threads
> > > in
> > > > thread pool of other solr servers since solr02 already down.
> > > >
> > > > "shard53":{
> > > >         "range":"26660000-2998ffff",
> > > >         "state":"active",
> > > >         "replicas":{
> > > >           "core_node102":{
> > > >             "state":"down",
> > > >             "base_url":"https://solr02.myhost/solr",
> > > >             "core":"collection2_shard53_replica1",
> > > >             "node_name":"https://solr02.myhost_solr",
> > > >             "leader":"true"},
> > > >           "core_node104":{
> > > >             "state":"active",
> > > >             "base_url":"https://solr04.myhost/solr",
> > > >             "core":"collection2_shard53_replica2",
> > > >             "node_name":"https://solr04.myhost/solr_solr"}}},
> > > >
> > > > Is this something known bug in 4.7 and late on fixed? Any reference
> > JIRA
> > > we
> > > > can study about?  If the solr service is stopped gracefully, we can
> see
> > > > leader core election happens and switched to other active core. But
> if
> > we
> > > > just directly shutdown a Solr OS, we can reproduce in our environment
> > > that
> > > > some "Down" cores remains "leader" at ZK clusterstate.json
> > >
> > >
> > >
> > > --
> > > Regards,
> > > Shalin Shekhar Mangar.
> > >
> >
>



-- 
Jeff Wu
---------------------------
CSDL Beijing, China

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message