lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <>
Subject Re: Zk and Solr Cloud
Date Fri, 02 Oct 2015 15:58:37 GMT

Absent nodes going up and down or otherwise changing state, Zookeeper
isn't involved in the normal operations of Solr (adding docs,
querying, all that). That said, things that change the state of the
Solr nodes _do_ involve Zookeeper and the Overseer. The Overseer is
used to serialize and control changing information in the
clusterstate.json (or state.json) and others. If the nodes all tried
to write to Zk directly, it's hard to coordinate. That's a little
simplistic and counterintuitive, but maybe this will help.

When a Solr instance starts up it
1> registers itself as live with ZK
2> creates a listener that ZK pings when there's a state change (some
node goes up or down, goes into recovery, gets added, whatever).
3> gets the current cluster state from ZK.

Thereafter, this particular node doesn't need to ask ZK for anything.
It knows the current topology of cluster and can route requests (index
or query) to the correct Solr replica etc.

Now, let's claim that "something changes". Solr stops on one of the
nodes. Or someone adds a collection. Or..... The overseer usually gets
involved in changing the state on ZK for this new action. Part of that
is that ZK sends an event to all the Solr nodes that have registered
themselves as listeners that causes them to ask ZK for the current
state of the cluster, and each Solr node adjusts its actions based on
this information. Note the kind of thing here that changes and
triggers this is that a whole replica becomes able or unable to carry
out its functions, NOT that the some collection gets another doc added
or answers a query.

Zk also periodically pings each Solr instance that's registered itself
and, if the node fails to respond may force it into recovery & etc.
Again, though, that has nothing to do with standard Solr operations.

So a massive overseer queue tends to indicate that there's a LOT of
state changes, lots of nodes going up and down etc. One implication of
the above is that if you turn on all your nodes in a large cluster at
the same time, there'll be a LOT of activity; they'll all register
themselves, try to elect leaders for shards, to into/out of recovery,
become active, all these are things that trigger overseer activity.

Or there are simply bugs in how the overseer works in the version
you're using, I know there's been a lot of effort to harden that area
over the various versions.

Two things that are "interesting".
1> Only one of your Solr instances hosts the overseer. If you're doing
a restart of _all_ your boxes, it's advisable to bounce the node
that's the overseer _last_. Otherwise you risk an odd situation: the
overseer is elected and starts to work, that node restarts which
causes the overseer role to switch to another node which immediately
is bounced and a new overseer is elected and....

2> As of 5.x, there are two ZK formats
a> the "old" format where the entire clusterstate for all collections
is kept in a single ZK node (/clusterstate.json)
b> the "new" format where each collection has its own state.json that
only contains the state for that collection.

This is very helpful when you have many clusters. In the <a> case, any
time _any_ node changes, _all_ nodes have to get a new state. In <b>,
only the nodes involved in a single collection need to get new
information when any node in _that_ collection change.


On Fri, Oct 2, 2015 at 8:03 AM, Ravi Solr <> wrote:
> Awesome nugget Shawn, I also faced similar issue a while ago while i was
> doing a full re-index. It would be great if such tips are added into FAQ
> type documentation on cwiki. I love the SOLR forum everyday I learn
> something new :-)
> Thanks
> Ravi Kiran Bhaskar
> On Fri, Oct 2, 2015 at 1:58 AM, Shawn Heisey <> wrote:
>> On 10/1/2015 1:26 PM, Rallavagu wrote:
>> > Solr 4.6.1 single shard with 4 nodes. Zookeeper 3.4.5 ensemble of 3.
>> >
>> > See following errors in ZK and Solr and they are connected.
>> >
>> > When I see the following error in Zookeeper,
>> >
>> > unexpected error, closing socket connection and attempting reconnect
>> > Packet len11823809 is out of range!
>> This is usually caused by the overseer queue (stored in zookeeper)
>> becoming extraordinarily huge, because it's being flooded with work
>> entries far faster than the overseer can process them.  This causes the
>> znode where the queue is stored to become larger than the maximum size
>> for a znode, which defaults to about 1MB.  In this case (reading your
>> log message that says len11823809), something in zookeeper has gotten to
>> be 11MB in size, so the zookeeper client cannot read it.
>> I think the zookeeper server code must be handling the addition of
>> children to the queue znode through a code path that doesn't pay
>> attention to the maximum buffer size, just goes ahead and adds it,
>> probably by simply appending data.  I'm unfamiliar with how the ZK
>> database works, so I'm guessing here.
>> If I'm right about where the problem is, there are two workarounds to
>> your immediate issue.
>> 1) Delete all the entries in your overseer queue using a zookeeper
>> client that lets you edit the DB directly.  If you haven't changed the
>> cloud structure and all your servers are working, this should be safe.
>> 2) Set the jute.maxbuffer system property on the startup commandline for
>> all ZK servers and all ZK clients (Solr instances) to a size that's
>> large enough to accommodate the huge znode.  In order to do the deletion
>> mentioned in option 1 above,you might need to increase jute.maxbuffer on
>> the servers and the client you use for the deletion.
>> These are just workarounds.  Whatever caused the huge queue in the first
>> place must be addressed.  It is frequently a performance issue.  If you
>> go to the following link, you will see that jute.maxbuffer is considered
>> an unsafe option:
>> In Jira issue SOLR-7191, I wrote the following in one of my comments:
>> "The giant queue I encountered was about 850000 entries, and resulted in
>> a packet length of a little over 14 megabytes. If I divide 850000 by 14,
>> I know that I can have about 60000 overseer queue entries in one znode
>> before jute.maxbuffer needs to be increased."
>> Thanks,
>> Shawn

View raw message