helix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zhen Zhang <nehzgn...@gmail.com>
Subject Re: Question about spectator behavior whenever it is under zookeeper flapping
Date Fri, 27 Jun 2014 20:36:24 GMT
Hi Hang, may I know why the connections between router and zookeeper are
flapping? Is it caused by GC on routers?


On Fri, Jun 27, 2014 at 11:40 AM, kishore g <g.kishore@gmail.com> wrote:

> Hi Hang,
> Good point, I agree that the handling of flapping should be different
> based on the role. For now, we have focused on the participant but as you
> have explained its not the right thing to do for a spectator.
> Keeping the latest information is the right thing to do in Specator. We
> should probably create a JIRA and go over the possible solutions.
> So couple of things we need to decide
> -- keep the latest information
> -- Retry to Zookeeper --
> -- How do we provide a callback to client if they need custom logic.
> Polling HelixManager.isConnected should work but its possible to miss that
> event, for example if your polling interval is 10 seconds if the disconnect
> and connect happens within that time interval client may not notice that.
> Ideally we want to avoid clients understanding the Zookeeper
> state/internals. In the long term this will allow us to plugin a different
> backend for storing state information.
> Thanks,
> Kishore G
> On Fri, Jun 27, 2014 at 11:14 AM, Hang Qi <hangq.1985@gmail.com> wrote:
>> Hi folks,
>> We are using helix 0.6.3 to build our storage system, and our clients
>> rely on the spectator to route traffic to corresponding node.
>> It works very well, however, currently we encounter an issue that almost
>> all the clients fail to route the traffic, and the log shows that
>> ERROR org.apache.helix.manager.zk.ZKHelixManager) - instanceName:
>> hostname is flapping. diconnect it.  maxDisconnectThreshold: 5 disconnects
>> in 300000ms.
>> Look at the code, there is flapping detection mechanism in
>> ZKHelixManager, and in case of zookeeper flapping, it will disconnect
>> itself, and in turn it will call resetHandlers in disconnect() method,
>> result in the routingTableProvider reset, thus the routingTable becomes
>> empty.
>> When browsing the jira, I find that this feature was introduced by
>> helix-31 and helix-32. I like the idea of zookeeper flapping detection and
>> disconnect when it happens for participant and controller, that makes the
>> whole cluster more stable.
>> However, in the spectator's perspective, the more reasonable behavior is
>> that it keeps using the most up to date state from zookeeper even if
>> zookeeper is down in my opinion. Besides, it should keep retrying to
>> connect to the zookeeper, or provide some callback to let client know. What
>> do you think?
>> So my question is, what is the most practical way to handle this in
>> client? Currently we use the work around to increase the value of
>> helixmanager.maxDisconnectThreshold. Is there any callback I could register
>> to get notified about the disconnect event, does polling
>> HelixManager#isConnect works?
>> Thanks
>> Hang Qi

View raw message