Hi,
Do you know if any of the services that use your ZK create ACLs that are potentially unique
and one-time-ish? I recently hit a similar problem and discovered that the DataTree has an
ACL cache that never gets anything removed from it. That was by far and away the largest
memory consumer I found when analysing the heap dump. If this is the case then you should
see lots of ACL objects on the heap.
I filed a JIRA for this and keep meaning to submit a patch but sadly haven't got round to
it. As an interim solution, I wrote a tool which uses the DataTree class and the serialisation
utils to purge this cache of unused entries. I my case it shrank the snapshot from 500MB
to 12MB! The time to write the snapshot went from 40 seconds to less than 1 second as a result.
Thanks,
Karol
> On 24 Apr 2015, at 18:45, CP Mishra <mishracp@gmail.com> wrote:
>
> Hi,
>
> I am running a 3 node ZK ensemble on 3 VMs (2 CPU, 32GB RAM) in the test
> environment. Lately, I have been getting OutOfMemoryError on all three ZK
> nodes. ZK has been configured with 6GB heap size. The same ZK ensemble is
> shared between Kafka, HDFS HA and another custom service.
>
> I analyzed the heap dump and 5.8+ GB is being used by DataTree. I don't
> have a purge policy in place and size of ZK data directory stands at ~14 GB
> now. There is enough space on the disk holding ZK data (20% used).
>
> As soon as I restart a ZK node, it grows to use all 6GB and starts Full GC
> every 1-2 sec. In 3-5 minutes, it throws OOM: GC Overhead exceeded.
>
> I would appreciate any help in diagnosing the issue.
>
> Thanks,
> CP Mishra
|