helix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Varun Sharma <va...@pinterest.com>
Subject Re: Excessive ZooKeeper load
Date Tue, 03 Feb 2015 04:17:07 GMT
My total external view across all resources is roughly 3M in size and there
are 100 clients downloading it twice for every node restart - thats 600M of
data for every restart. So I guess that is causing this issue. We are
thinking of doing some tricks to limit the # of clients to 1 from 100. I
guess that should help significantly.

Varun

On Mon, Feb 2, 2015 at 7:37 PM, Zhen Zhang <zzhang@linkedin.com> wrote:

>  Hey Varun,
>
>  I guess your external view is pretty large, since each external view
> callback takes ~3s. The RoutingTableProvider is callback based, so only
> when there is a change in the external view, RoutingTableProvider will read
> the entire external view from ZK. During the rolling upgrade, there are
> lots of live instance change, which may lead to a lot of changes in the
> external view. One possible way to mitigate the issue is to smooth the
> traffic by having some delays in between bouncing nodes. We can do a rough
> estimation on how many external view changes you might have during the
> upgrade, how many listeners you have, and how large is the external views.
> Once we have these numbers, we might know the ZK bandwidth requirement. ZK
> read bandwidth can be scaled by adding ZK observers.
>
>  ZK watcher is one time only, so every time a listener receives a
> callback, it will re-register its watcher again to ZK.
>
>  It's normally unreliable to depend on delta changes instead of reading
> the entire znode. There might be some corner cases where you would lose
> delta changes if you depend on that.
>
>  For the ZK connection issue, do you have any log on the ZK server side
> regarding this connection?
>
>  Thanks,
> Jason
>
>   ------------------------------
> *From:* Varun Sharma [varun@pinterest.com]
> *Sent:* Monday, February 02, 2015 4:41 PM
> *To:* user@helix.apache.org
> *Subject:* Re: Excessive ZooKeeper load
>
>   I believe there is a misbehaving client. Here is a stack trace - it
> probably lost connection and is now stampeding it:
>
>  "ZkClient-EventThread-104-terrapinzk001a:2181,terrapinzk
> 002b:2181,terrapinzk003e:2181" daemon prio=10 tid=0x00007f534144b800
> nid=0x7db5 in Object.wait() [0x00007f52ca9c3000]
>
>    java.lang.Thread.State: WAITING (on object monitor)
>
>         at java.lang.Object.wait(Native Method)
>
>         at java.lang.Object.wait(Object.java:503)
>
>         at
> org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1309)
>
>         - locked <0x00000004fb0d8c38> (a
> org.apache.zookeeper.ClientCnxn$Packet)
>
>         at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1036)
>
>         at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069)
>
>         at org.I0Itec.zkclient.ZkConnection.exists(ZkConnection.java:95)
>
>         at org.I0Itec.zkclient.ZkClient$11.call(ZkClient.java:823)
>
> *        at
> org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675)*
>
> *        at org.I0Itec.zkclient.ZkClient.watchForData(ZkClient.java:820)*
>
> *        at
> org.I0Itec.zkclient.ZkClient.subscribeDataChanges(ZkClient.java:136)*
>
>         at org.apache.helix.manager.zk
> .CallbackHandler.subscribeDataChange(CallbackHandler.java:241)
>
>         at org.apache.helix.manager.zk
> .CallbackHandler.subscribeForChanges(CallbackHandler.java:287)
>
>         at org.apache.helix.manager.zk
> .CallbackHandler.invoke(CallbackHandler.java:202)
>
>         - locked <0x000000056b75a948> (a org.apache.helix.manager.zk
> .ZKHelixManager)
>
>         at org.apache.helix.manager.zk
> .CallbackHandler.handleDataChange(CallbackHandler.java:338)
>
>         at org.I0Itec.zkclient.ZkClient$6.run(ZkClient.java:547)
>
>         at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)
>
> On Mon, Feb 2, 2015 at 4:28 PM, Varun Sharma <varun@pinterest.com> wrote:
>
>> I am wondering what is causing the zk subscription to happen every 2-3
>> seconds - is this a new watch being established every 3 seconds ?
>>
>>  Thanks
>>  Varun
>>
>> On Mon, Feb 2, 2015 at 4:23 PM, Varun Sharma <varun@pinterest.com> wrote:
>>
>>> Hi,
>>>
>>>  We are serving a few different resources whose total # of partitions
>>> is ~ 30K. We just did a rolling restart fo the cluster and the clients
>>> which use the RoutingTableProvider are stuck in a bad state where they are
>>> constantly subscribing to changes in the external view of a cluster. Here
>>> is the helix log on the client after our rolling restart was finished - the
>>> client is constantly polling ZK. The zookeeper node is pushing 300mbps
>>> right now and most of the traffic is being pulled by clients. Is this a
>>> race condition - also is there an easy way to make the clients not poll so
>>> aggressively. We restarted one of the clients and we don't see these same
>>> messages anymore. Also is it possible to just propagate external view diffs
>>> instead of the whole big znode ?
>>>
>>>  15/02/03 00:21:18 INFO zk.CallbackHandler: 104 END:INVOKE
>>> /main_a/EXTERNALVIEW
>>> listener:org.apache.helix.spectator.RoutingTableProvider Took: 3340ms
>>>
>>> 15/02/03 00:21:18 INFO zk.CallbackHandler: 104 START:INVOKE
>>> /main_a/EXTERNALVIEW
>>> listener:org.apache.helix.spectator.RoutingTableProvider
>>>
>>> 15/02/03 00:21:18 INFO zk.CallbackHandler: pinacle2084 subscribes
>>> child-change. path: /main_a/EXTERNALVIEW, listener:
>>> org.apache.helix.spectator.RoutingTableProvider@76984879
>>>
>>> 15/02/03 00:21:22 INFO zk.CallbackHandler: 104 END:INVOKE
>>> /main_a/EXTERNALVIEW
>>> listener:org.apache.helix.spectator.RoutingTableProvider Took: 3371ms
>>>
>>> 15/02/03 00:21:22 INFO zk.CallbackHandler: 104 START:INVOKE
>>> /main_a/EXTERNALVIEW
>>> listener:org.apache.helix.spectator.RoutingTableProvider
>>>
>>> 15/02/03 00:21:22 INFO zk.CallbackHandler: pinacle2084 subscribes
>>> child-change. path: /main_a/EXTERNALVIEW, listener:
>>> org.apache.helix.spectator.RoutingTableProvider@76984879
>>>
>>> 15/02/03 00:21:25 INFO zk.CallbackHandler: 104 END:INVOKE
>>> /main_a/EXTERNALVIEW
>>> listener:org.apache.helix.spectator.RoutingTableProvider Took: 3281ms
>>>
>>> 15/02/03 00:21:25 INFO zk.CallbackHandler: 104 START:INVOKE
>>> /main_a/EXTERNALVIEW
>>> listener:org.apache.helix.spectator.RoutingTableProvider
>>>
>>>
>>>
>>
>

Mime
View raw message