zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henry Robinson <he...@cloudera.com>
Subject Re: ZooKeeper's resident set size grows but does not shrink
Date Tue, 29 May 2012 20:04:38 GMT
Hi Brian -

(Copying the list as well for general interest)

I dug into this a bit this weekend. The heap dump does show that heap usage
is unexpectedly high, and that ZK is using more memory than you might think
it should.

The root cause is that each server maintains a 'committed log' of 500
proposals in memory. This is to speed up the case where another server is
trying to catch up, and is only behind by < 500 proposals - the up-to-date
server can send them directly.

So each of the last 500 proposals, and their associated data, are kept in
memory. For the experiments I ran, I created and then deleted a znode 1000
times with 250k of data. So 50% of the last 500 transactions were 'large'.
As a result, I expected to see ~66MB of extra data in the java heap.

What I actually saw was ~192MB taken up by byte arrays. Some digging into
the heap showed that the data in the commit log is actually copied into
*three* different places.

This doesn't fully explain the 1.5G byte[] usage that you're seeing. It
might be worth forcing a full GC from jvisualvm or similar and seeing if
anything gets cleaned up. Another way to test my hypothesis is to 'flush'
the commit log with 500 small transactions - repeatedly setting a znode's
data to "", for example - this should free up the commit log and you should
see heap usage drop significantly. Of course, the RSS will still remain
high, for reasons discussed earlier. I'd love to see the results if you
still have the machines available to try these two things on.

I've filed https://issues.apache.org/jira/browse/ZOOKEEPER-1473 to track
the triple-memory issue. The maximum exposure for a single instance is an
extra 1G in heap - not good, but not disastrous and it only shows up under
a particular workload. Still, would be good to get it fixed.

Thanks,
Henry


On 23 May 2012 18:14, Brian Oki <brian@nimbula.com> wrote:

> Henry,
>
> Thanks for the quick reply. Here's the output of jmap -heap:
>
> using thread-local object allocation.
> Parallel GC with 4 thread(s)
>
> Heap Configuration:
>    MinHeapFreeRatio = 40
>    MaxHeapFreeRatio = 70
>    MaxHeapSize      = 3221225472 (3072.0MB)
>    NewSize          = 1310720 (1.25MB)
>    MaxNewSize       = 17592186044415 MB
>    OldSize          = 5439488 (5.1875MB)
>    NewRatio         = 2
>    SurvivorRatio    = 8
>    PermSize         = 21757952 (20.75MB)
>    MaxPermSize      = 174063616 (166.0MB)
>
> Heap Usage:
> PS Young Generation
> Eden Space:
>    capacity = 641073152 (611.375MB)
>    used     = 367863672 (350.82213592529297MB)
>    free     = 273209480 (260.55286407470703MB)
>    57.38247980785818% used
> From Space:
>    capacity = 212992000 (203.125MB)
>    used     = 188128576 (179.41339111328125MB)
>    free     = 24863424 (23.71160888671875MB)
>    88.32659254807692% used
> To Space:
>    capacity = 216334336 (206.3125MB)
>    used     = 0 (0.0MB)
>    free     = 216334336 (206.3125MB)
>    0.0% used
> PS Old Generation
>    capacity = 1739915264 (1659.3125MB)
>    used     = 1083663768 (1033.462303161621MB)
>    free     = 656251496 (625.8501968383789MB)
>    62.282559985633874% used
> PS Perm Generation
>    capacity = 21757952 (20.75MB)
>    used     = 9952064 (9.49102783203125MB)
>    free     = 11805888 (11.25897216796875MB)
>    45.73989316641566% used
>
> The histogram shows the following (a portion), if that helps.  You can see
> that there's ~1.5 GB of byte[] in the heap of stuff we're uncertain about.
> I didn't bother to dump the heap in binary format. All of the znodes and
> data created by the test have been deleted.
>
>
> Object Histogram:
>
> num       #instances    #bytes    Class description
> --------------------------------------------------------------------------
> 1:        112235    1546244888    byte[]
> 2:        20804    69853352    int[]
> 3:        40920    4718272    char[]
> 4:        14659    2296304    * ConstMethodKlass
> 5:        39018    1872864    java.nio.HeapByteBuffer
> 6:        14659    1767608    * MethodKlass
> 7:        37364    1494560    java.util.HashMap$KeyIterator
> 8:        1268    1475520    * ConstantPoolKlass
> 9:        24692    1144072    * SymbolKlass
> 10:        31844    1019008    java.lang.String
> 11:        1268    928960    * InstanceKlassKlass
> 12:        1181    895016    * ConstantPoolCacheKlass
> 13:        36811    883464    java.lang.Long
> 14:        19442    777680    java.lang.ref.SoftReference
> 15:        22814    730048    java.util.HashMap$Entry
> 16:        19781    633088    java.lang.Object[]
>
> Sincerely,
>
> Brian
>
>
>
> On Wed, May 23, 2012 at 6:01 PM, Henry Robinson <henry@cloudera.com>wrote:
>
>> Although the amount of Java heap that ZK might be using may go down, the
>> JVM process will still hang on to the physical memory allocated for it and
>> if there is no external pressure from other processes Linux will not need
>> to swap it to disk, hence the RSS will remain roughly constant.
>>
>> That is, the amount of 'real' memory used by a JVM doesn't tell you how
>> much of the JVM's heap is being used. If you believe that the heap usage by
>> ZK is too high, because GC is not finding enough free objects to return to
>> the heap, then that will cause a problem because if you ever do have memory
>> pressure then ZK will start swapping, which is bad.
>>
>> In general, processes on Linux don't usually give memory back - they use
>> as much as they need concurrently, and then the operating system eventually
>> swaps out the unused pages if it needs to.
>>
>> Can you paste the output of jmap -heap <zk-pid> into a reply? That will
>> allow us to see how much of the heap is really being used.
>>
>> Thanks,
>> Henry
>>
>>
>> On 23 May 2012 17:41, Brian Oki <brian@nimbula.com> wrote:
>>
>>> Hello,
>>>
>>> We use ZooKeeper 3.3.3.  On a 3-node site, we've been using Patrick
>>> Hunt's
>>> publicly available latencies test suite to create scenarios that will
>>> help
>>> us to understand the memory, CPU and disk requirements for a deployment
>>> of
>>> ZooKeeper for our type of workload.  We use a fourth node as the
>>> ZooKeeper
>>> (ZK) client to conduct the tests.
>>>
>>> We modified zk-latencies.py slightly to just create-set-delete znodes
>>> only.
>>>  In particular, we create 1000 permanent znodes, each written with
>>> 250,000
>>> bytes of data.  We do this create-set-delete in a loop, sleeping for 5
>>> seconds between iterations.
>>>
>>> We observe at the ZK leader that the Resident Set Size (RSS) memory
>>> climbs
>>> rapidly to 2.6 GB on an 8 GB RAM node.  The Java heap size of each ZK
>>> server daemon is 3 GB.
>>>
>>> Further, once the test has gone through 15 iterations, all the znodes
>>> created on behalf of the test have been deleted.  There is no further
>>> write
>>> activity to ZK, and no read activity at all.  The system is quiesced.  No
>>> other services are competing for the disk, CPU or RAM during the test.
>>>
>>> Our question is this: The RSS of the ZK leader (and the followers) seems
>>> to
>>> remain at 2.6 GB after the test has completed.  Why?
>>>
>>> We would expect that since all relevant znodes for the test have been
>>> deleted, the leader's RSS should have shrunk considerably, even after 1
>>> hour has passed.  Are we missing something?
>>>
>>> We have used jmap to inspect the heap.  To understand the heap contents
>>> requires detailed implementation knowledge that we don't have, so we
>>> didn't
>>> pursue this avenue any further.
>>>
>>> Configuration:
>>>   3 node servers running ZK daemons as 3-server ensemble
>>>   1 client machine
>>>   each node has 8 GB RAM
>>>   each node has 4 cores
>>>   each node has a 465 GB disk
>>>   ZK release: 3.3.3
>>>   ZK server java heap size: 3 GB
>>>  GC: concurrent low-pause garbage collector
>>>   NIC: bonded 1 Gb NIC
>>>
>>> Thank you.
>>>
>>> Sincerely,
>>>
>>> Brian
>>>
>>
>>
>>
>> --
>> Henry Robinson
>> Software Engineer
>> Cloudera
>> 415-994-6679
>>
>
>


-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message