zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jim Keeney <...@fitterweb.com>
Subject Re: Ensemble fails when one node looses connectivity
Date Fri, 02 Mar 2018 02:59:03 GMT
Steph -

Read about the maxbuffer and am pretty sure that this might explain the
behavior we are seeing since it occurs when there has been a significant
reboot of all the servers. We have over 2 mb of config files for all of our
indexes and if all the Solr nodes are sync ing their configs at once it
seems like that might overflow the buffer.

Newbie question, where would i set the -Djute.maxbuffer ? Should I update
the zkServer.sh file so this is applied every time zookeeper is started or
restarted.

Also, I noted the caution and will make sure that all of the nodes are set
to the same value. Saw some discussion about having to change the zkCli
settings to be larger than that of the server. Is that true?

Thanks in advance.

Jim K.

On Thu, Mar 1, 2018 at 9:13 PM, Jim Keeney <jim@fitterweb.com> wrote:

> Thanks, Yes, I have about 2MB stored in the configurations folders. I will
> increase the jute.maxbuffer and see if that helps.
>
> Jim K.
>
> On Thu, Mar 1, 2018 at 8:58 PM, Steph van Schalkwyk <
> svanschalkwyk@gmail.com> wrote:
>
>> Does the log say anything about timing out on init?
>> Your initLimit is already pretty big, but then we don't know anything
>> about
>> your setup.
>> Are you storing more than 1MB in a znode? Then increase jute.maxbuffer (in
>> java.env as a -Djute.maxbuffer=xxxxxx).
>> I've recently run into that with Fusion 3.1.
>> Post more details, if you would.
>> Good luck.
>> Steph
>>
>>
>> On Thu, Mar 1, 2018 at 7:43 PM, Jim Keeney <jim@fitterweb.com> wrote:
>>
>> > I'm using Zookeeper with solr to create a cluster and I have come across
>> > what seems like an unexpected behavior. The cluster is setup on AWS
>> using
>> > opsworks.  I am using a 3 node zookeeper ensemble. The zookeeper config
>> > on all three nodes is:
>> >
>> > clientPort=2181
>> >
>> > dataDir=/var/opt/zookeeper/data
>> >
>> > tickTime=2000
>> >
>> > autopurge.purgeInterval=24
>> >
>> > initLimit=100
>> >
>> > syncLimit=5
>> >
>> > server.1=172.31.86.130:2888:3888
>> >
>> > server.2=172.31.16.234:2888:3888
>> >
>> > server.3=172.31.73.122:2888:3888
>> >
>> >
>> > Here is the issue:
>> >
>> > If one node in the ensemble fails or is shut down the ensemble carries
>> on.
>> > However, when the node is restarted it's attempt to connect to the other
>> > members of the cluster are rejected. The only way that I have found to
>> > restore the ensemble is to restart all of the nodes within a short time
>> > span of each other.
>> >
>> > If I do that they are able to discover each other  carry on a proper
>> > leader election and restore order.
>> >
>> > Once they are restored everything is fine but if one of the nodes goes
>> > down we are faced wit the same problem.
>> >
>> > How do I ensure that if a node goes down, it can restart and rejoin the
>> > ensemble with out having to manually restart all the other nodes?
>> >
>> > Any help appreciated.
>> >
>> > Thanks.
>> >
>> > Jim K.
>> >
>> >
>> >
>> >
>> > --
>> > Jim Keeney
>> > President, FitterWeb
>> > E: jim@fitterweb.com
>> > M: 703-568-5887 <(703)%20568-5887>
>> >
>> > *FitterWeb Consulting*
>> > *Are you lean and agile enough? *
>> >
>>
>
>
>
> --
> Jim Keeney
> President, FitterWeb
> E: jim@fitterweb.com
> M: 703-568-5887 <(703)%20568-5887>
>
> *FitterWeb Consulting*
> *Are you lean and agile enough? *
>



-- 
Jim Keeney
President, FitterWeb
E: jim@fitterweb.com
M: 703-568-5887 <(703)%20568-5887>

*FitterWeb Consulting*
*Are you lean and agile enough? *

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message