helix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kishore g <g.kish...@gmail.com>
Subject Re: Question about spectator behavior whenever it is under zookeeper flapping
Date Fri, 27 Jun 2014 18:40:26 GMT
Hi Hang,

Good point, I agree that the handling of flapping should be different based
on the role. For now, we have focused on the participant but as you have
explained its not the right thing to do for a spectator.

Keeping the latest information is the right thing to do in Specator. We
should probably create a JIRA and go over the possible solutions.

So couple of things we need to decide
-- keep the latest information
-- Retry to Zookeeper --
-- How do we provide a callback to client if they need custom logic.

Polling HelixManager.isConnected should work but its possible to miss that
event, for example if your polling interval is 10 seconds if the disconnect
and connect happens within that time interval client may not notice that.

Ideally we want to avoid clients understanding the Zookeeper
state/internals. In the long term this will allow us to plugin a different
backend for storing state information.

Kishore G

On Fri, Jun 27, 2014 at 11:14 AM, Hang Qi <hangq.1985@gmail.com> wrote:

> Hi folks,
> We are using helix 0.6.3 to build our storage system, and our clients rely
> on the spectator to route traffic to corresponding node.
> It works very well, however, currently we encounter an issue that almost
> all the clients fail to route the traffic, and the log shows that
> ERROR org.apache.helix.manager.zk.ZKHelixManager) - instanceName: hostname
> is flapping. diconnect it.  maxDisconnectThreshold: 5 disconnects in
> 300000ms.
> Look at the code, there is flapping detection mechanism in ZKHelixManager,
> and in case of zookeeper flapping, it will disconnect itself, and in turn
> it will call resetHandlers in disconnect() method, result in the
> routingTableProvider reset, thus the routingTable becomes empty.
> When browsing the jira, I find that this feature was introduced by
> helix-31 and helix-32. I like the idea of zookeeper flapping detection and
> disconnect when it happens for participant and controller, that makes the
> whole cluster more stable.
> However, in the spectator's perspective, the more reasonable behavior is
> that it keeps using the most up to date state from zookeeper even if
> zookeeper is down in my opinion. Besides, it should keep retrying to
> connect to the zookeeper, or provide some callback to let client know. What
> do you think?
> So my question is, what is the most practical way to handle this in
> client? Currently we use the work around to increase the value of
> helixmanager.maxDisconnectThreshold. Is there any callback I could register
> to get notified about the disconnect event, does polling
> HelixManager#isConnect works?
> Thanks
> Hang Qi

View raw message