helix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Varun Sharma <va...@pinterest.com>
Subject Re: Excessive ZooKeeper load
Date Thu, 05 Feb 2015 21:53:44 GMT
I assume that it also gets called when external views get modified ? How
can i distinguish if there was an Add, a modify or a delete ?

Thanks
Varun

On Thu, Feb 5, 2015 at 9:27 AM, Zhen Zhang <zzhang@linkedin.com> wrote:

>  Yes. It will get invoked when external views are added or deleted.
>  ------------------------------
> *From:* Varun Sharma [varun@pinterest.com]
> *Sent:* Thursday, February 05, 2015 1:27 AM
>
> *To:* user@helix.apache.org
> *Subject:* Re: Excessive ZooKeeper load
>
>   I had another question - does the RoutingTableProvider
> onExternalViewChange call get invoked when a resource gets deleted (and
> hence its external view znode) ?
>
> On Wed, Feb 4, 2015 at 10:54 PM, Zhen Zhang <zzhang@linkedin.com> wrote:
>
>>  Yes. I think we did this in the incubating stage or even before. It's
>> probably in a separate branch for some performance evaluation.
>>
>>  ------------------------------
>> *From:* kishore g [g.kishore@gmail.com]
>> *Sent:* Wednesday, February 04, 2015 9:54 PM
>>
>> *To:* user@helix.apache.org
>> *Subject:* Re: Excessive ZooKeeper load
>>
>>    Jason, I remember having the ability to compress/decompress and
>> before we added the support to bucketize, compression was used to support
>> large number of partitions. However I dont see the code anywhere. Did we do
>> this on a separate branch?
>>
>>  thanks,
>> Kishore G
>>
>> On Wed, Feb 4, 2015 at 3:30 PM, Zhen Zhang <zzhang@linkedin.com> wrote:
>>
>>>  Hi Varun, we can certainly add compression and have a config for
>>> turning it on/off. We do have implemented compression in our own zkclient
>>> before. The issue for compression might be:
>>> 1) cpu consumption on controller will increase.
>>> 2) hard to debug
>>>
>>>  Thanks,
>>> Jason
>>>  ------------------------------
>>> *From:* kishore g [g.kishore@gmail.com]
>>> *Sent:* Wednesday, February 04, 2015 3:08 PM
>>>
>>> *To:* user@helix.apache.org
>>> *Subject:* Re: Excessive ZooKeeper load
>>>
>>>    we do have the ability to compress the data. I am not sure if there
>>> is a easy way to turn on/off the compression.
>>>
>>> On Wed, Feb 4, 2015 at 2:49 PM, Varun Sharma <varun@pinterest.com>
>>> wrote:
>>>
>>>> I am wondering if its possible to gzip the external view znode - a
>>>> simple gzip cut down the data size by 25X. Is it possible to plug in
>>>> compression/decompression as zookeeper nodes are read ?
>>>>
>>>>  Varun
>>>>
>>>> On Mon, Feb 2, 2015 at 8:53 PM, kishore g <g.kishore@gmail.com> wrote:
>>>>
>>>>> There are multiple options we can try here.
>>>>> what if we used cacheddataaccessor for this use case?.clients will
>>>>> only read if node has changed. This optimization can benefit all use
cases.
>>>>>
>>>>> What about batching the watch triggers. Not sure which version of
>>>>> helix has this option.
>>>>>
>>>>> Another option is to use a poll based roundtable instead of watch
>>>>> based. This can coupled with cacheddataaccessor can be over efficient.
>>>>>
>>>>> Thanks,
>>>>> Kishore G
>>>>>  On Feb 2, 2015 8:17 PM, "Varun Sharma" <varun@pinterest.com> wrote:
>>>>>
>>>>>> My total external view across all resources is roughly 3M in size
and
>>>>>> there are 100 clients downloading it twice for every node restart
- thats
>>>>>> 600M of data for every restart. So I guess that is causing this issue.
We
>>>>>> are thinking of doing some tricks to limit the # of clients to 1
from 100.
>>>>>> I guess that should help significantly.
>>>>>>
>>>>>>  Varun
>>>>>>
>>>>>> On Mon, Feb 2, 2015 at 7:37 PM, Zhen Zhang <zzhang@linkedin.com>
>>>>>> wrote:
>>>>>>
>>>>>>>  Hey Varun,
>>>>>>>
>>>>>>>  I guess your external view is pretty large, since each external
>>>>>>> view callback takes ~3s. The RoutingTableProvider is callback
>>>>>>> based, so only when there is a change in the external view,
>>>>>>> RoutingTableProvider will read the entire external view from
ZK. During the
>>>>>>> rolling upgrade, there are lots of live instance change, which
may lead to
>>>>>>> a lot of changes in the external view. One possible way to mitigate
the
>>>>>>> issue is to smooth the traffic by having some delays in between
bouncing
>>>>>>> nodes. We can do a rough estimation on how many external view
changes you
>>>>>>> might have during the upgrade, how many listeners you have, and
how large
>>>>>>> is the external views. Once we have these numbers, we might know
the ZK
>>>>>>> bandwidth requirement. ZK read bandwidth can be scaled by adding
ZK
>>>>>>> observers.
>>>>>>>
>>>>>>>  ZK watcher is one time only, so every time a listener receives
a
>>>>>>> callback, it will re-register its watcher again to ZK.
>>>>>>>
>>>>>>>  It's normally unreliable to depend on delta changes instead
of
>>>>>>> reading the entire znode. There might be some corner cases where
you would
>>>>>>> lose delta changes if you depend on that.
>>>>>>>
>>>>>>>  For the ZK connection issue, do you have any log on the ZK server
>>>>>>> side regarding this connection?
>>>>>>>
>>>>>>>  Thanks,
>>>>>>> Jason
>>>>>>>
>>>>>>>   ------------------------------
>>>>>>> *From:* Varun Sharma [varun@pinterest.com]
>>>>>>> *Sent:* Monday, February 02, 2015 4:41 PM
>>>>>>> *To:* user@helix.apache.org
>>>>>>> *Subject:* Re: Excessive ZooKeeper load
>>>>>>>
>>>>>>>    I believe there is a misbehaving client. Here is a stack trace
-
>>>>>>> it probably lost connection and is now stampeding it:
>>>>>>>
>>>>>>>  "ZkClient-EventThread-104-terrapinzk001a:2181,terrapinzk
>>>>>>> 002b:2181,terrapinzk003e:2181" daemon prio=10
>>>>>>> tid=0x00007f534144b800 nid=0x7db5 in Object.wait() [0x00007f52ca9c3000]
>>>>>>>
>>>>>>>    java.lang.Thread.State: WAITING (on object monitor)
>>>>>>>
>>>>>>>         at java.lang.Object.wait(Native Method)
>>>>>>>
>>>>>>>         at java.lang.Object.wait(Object.java:503)
>>>>>>>
>>>>>>>         at
>>>>>>> org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1309)
>>>>>>>
>>>>>>>         - locked <0x00000004fb0d8c38> (a
>>>>>>> org.apache.zookeeper.ClientCnxn$Packet)
>>>>>>>
>>>>>>>         at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1036)
>>>>>>>
>>>>>>>         at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069)
>>>>>>>
>>>>>>>         at org.I0Itec.zk
>>>>>>> client.ZkConnection.exists(ZkConnection.java:95)
>>>>>>>
>>>>>>>         at org.I0Itec.zkclient.ZkClient$11.call(ZkClient.java:823)
>>>>>>>
>>>>>>> *        at
>>>>>>> org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675)*
>>>>>>>
>>>>>>> *        at
>>>>>>> org.I0Itec.zkclient.ZkClient.watchForData(ZkClient.java:820)*
>>>>>>>
>>>>>>> *        at
>>>>>>> org.I0Itec.zkclient.ZkClient.subscribeDataChanges(ZkClient.java:136)*
>>>>>>>
>>>>>>>         at org.apache.helix.manager.zk
>>>>>>> .CallbackHandler.subscribeDataChange(CallbackHandler.java:241)
>>>>>>>
>>>>>>>         at org.apache.helix.manager.zk
>>>>>>> .CallbackHandler.subscribeForChanges(CallbackHandler.java:287)
>>>>>>>
>>>>>>>         at org.apache.helix.manager.zk
>>>>>>> .CallbackHandler.invoke(CallbackHandler.java:202)
>>>>>>>
>>>>>>>         - locked <0x000000056b75a948> (a org.apache.helix.manager.zk
>>>>>>> .ZKHelixManager)
>>>>>>>
>>>>>>>         at org.apache.helix.manager.zk
>>>>>>> .CallbackHandler.handleDataChange(CallbackHandler.java:338)
>>>>>>>
>>>>>>>         at org.I0Itec.zkclient.ZkClient$6.run(ZkClient.java:547)
>>>>>>>
>>>>>>>         at org.I0Itec.zk
>>>>>>> client.ZkEventThread.run(ZkEventThread.java:71)
>>>>>>>
>>>>>>> On Mon, Feb 2, 2015 at 4:28 PM, Varun Sharma <varun@pinterest.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I am wondering what is causing the zk subscription to happen
every
>>>>>>>> 2-3 seconds - is this a new watch being established every
3 seconds ?
>>>>>>>>
>>>>>>>>  Thanks
>>>>>>>>  Varun
>>>>>>>>
>>>>>>>> On Mon, Feb 2, 2015 at 4:23 PM, Varun Sharma <varun@pinterest.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>>  We are serving a few different resources whose total
# of
>>>>>>>>> partitions is ~ 30K. We just did a rolling restart fo
the cluster and the
>>>>>>>>> clients which use the RoutingTableProvider are stuck
in a bad state where
>>>>>>>>> they are constantly subscribing to changes in the external
view of a
>>>>>>>>> cluster. Here is the helix log on the client after our
rolling restart was
>>>>>>>>> finished - the client is constantly polling ZK. The zookeeper
node is
>>>>>>>>> pushing 300mbps right now and most of the traffic is
being pulled by
>>>>>>>>> clients. Is this a race condition - also is there an
easy way to make the
>>>>>>>>> clients not poll so aggressively. We restarted one of
the clients and we
>>>>>>>>> don't see these same messages anymore. Also is it possible
to just
>>>>>>>>> propagate external view diffs instead of the whole big
znode ?
>>>>>>>>>
>>>>>>>>>  15/02/03 00:21:18 INFO zk.CallbackHandler: 104 END:INVOKE
>>>>>>>>> /main_a/EXTERNALVIEW
>>>>>>>>> listener:org.apache.helix.spectator.RoutingTableProvider
Took: 3340ms
>>>>>>>>>
>>>>>>>>> 15/02/03 00:21:18 INFO zk.CallbackHandler: 104 START:INVOKE
>>>>>>>>> /main_a/EXTERNALVIEW
>>>>>>>>> listener:org.apache.helix.spectator.RoutingTableProvider
>>>>>>>>>
>>>>>>>>> 15/02/03 00:21:18 INFO zk.CallbackHandler: pinacle2084
subscribes
>>>>>>>>> child-change. path: /main_a/EXTERNALVIEW, listener:
>>>>>>>>> org.apache.helix.spectator.RoutingTableProvider@76984879
>>>>>>>>>
>>>>>>>>> 15/02/03 00:21:22 INFO zk.CallbackHandler: 104 END:INVOKE
>>>>>>>>> /main_a/EXTERNALVIEW
>>>>>>>>> listener:org.apache.helix.spectator.RoutingTableProvider
Took: 3371ms
>>>>>>>>>
>>>>>>>>> 15/02/03 00:21:22 INFO zk.CallbackHandler: 104 START:INVOKE
>>>>>>>>> /main_a/EXTERNALVIEW
>>>>>>>>> listener:org.apache.helix.spectator.RoutingTableProvider
>>>>>>>>>
>>>>>>>>> 15/02/03 00:21:22 INFO zk.CallbackHandler: pinacle2084
subscribes
>>>>>>>>> child-change. path: /main_a/EXTERNALVIEW, listener:
>>>>>>>>> org.apache.helix.spectator.RoutingTableProvider@76984879
>>>>>>>>>
>>>>>>>>> 15/02/03 00:21:25 INFO zk.CallbackHandler: 104 END:INVOKE
>>>>>>>>> /main_a/EXTERNALVIEW
>>>>>>>>> listener:org.apache.helix.spectator.RoutingTableProvider
Took: 3281ms
>>>>>>>>>
>>>>>>>>> 15/02/03 00:21:25 INFO zk.CallbackHandler: 104 START:INVOKE
>>>>>>>>> /main_a/EXTERNALVIEW
>>>>>>>>> listener:org.apache.helix.spectator.RoutingTableProvider
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>>
>

Mime
View raw message