cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Stupp (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
Date Wed, 24 Dec 2014 14:21:15 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257721#comment-14257721
] 

Robert Stupp edited comment on CASSANDRA-7438 at 12/24/14 2:21 PM:
-------------------------------------------------------------------

I had the opportunity to test OHC on a big machine.
First: it works - very happy about that :)

Some things I want to notice:
* high number of segments do not have any really measurable influence (default of 2* # of
cores is fine)
* throughput heavily depends on serialization (hash entry size) - Java8 gave about 10% to
15% improvement in some tests (either on {{Unsafe.copyMemory}} or something related like JNI
barrier)
* the number of entries per bucket stays pretty low with the default load factor of .75 -
vast majority has 0 or 1 entries, some 2 or 3 and few up to 8

Issue (not solvable yet):
It works great for hash entries to approx. 64kB with good to great throughput. Above that
barrier it first works good but after some time the system spends a huge amount of CPU time
(~95%) in {{malloc()}} / {{free()}} (with jemalloc, Unsafe.allocate is not worth discussing
at all on Linux).
I tried to add some „memory buffer cache“ that caches free’d hash entries for reuse.
But it turned out that in the end it would be too complex if done right. The current implementation
is still in the code, but must be explicitly enabled with a system property. Workloads with
small entries and high number of threads easily trigger Linux OOM protection (that kills the
process). Please note that it works with large hash entries - but throughput drops dramatically
to just a few thousand writes per second.

Some numbers (value sizes have gaussian distribution). Had to do these tests in a hurry because
I had to give back the machine. Code used during these tests is tagged as {{0.1-SNAP-Bench}}
in git. Throughput is limited by {{malloc()}} / {{free()}} and most tests did only use 50%
of available CPU capacity (on _c3.8xlarge_ - 32 cores, Intel Xeon E5-2680v2 @2.8GHz, 64GB).
* 1k..200k value size, 32 threads, 1M keys, 90% read ratio, 32GB: 22k writes/sec, 200k reads/sec,
~8k evictions/sec, write: 8ms (99perc), read: 3ms(99perc)
* 1k..64k value size, 500 threads, 1M keys, 90% read ratio, 32GB: 55k writes/sec, 499k reads/sec,
~2k evictions/sec, write: .1ms (99perc), read: .03ms(99perc)
* 1k..64k value size, 500 threads, 1M keys, 50% read ratio, 32GB: 195k writes/sec, 195k reads/sec,
~9k evictions/sec, write: .2ms (99perc), read: .1ms(99perc)
* 1k..64k value size, 500 threads, 1M keys, 10% read ratio, 32GB: 185k writes/sec, 20k reads/sec,
~7k evictions/sec, write: 4ms (99perc), read: .07ms(99perc)
* 1k..16k value size, 500 threads, 5M keys, 90% read ratio, 32GB: 110k writes/sec, 1M reads/sec,
30k evictions/sec, write: .04ms (99perc), read: .01ms(99perc)
* 1k..16k value size, 500 threads, 5M keys, 50% read ratio, 32GB: 420k writes/sec, 420k reads/sec,
125k evictions/sec, write: .06ms (99perc), read: .01ms(99perc)
* 1k..16k value size, 500 threads, 5M keys, 10% read ratio, 32GB: 435k writes/sec, 48k reads/sec,
130k evictions/sec, write: .06ms (99perc), read: .01ms(99perc)
* 1k..4k value size, 500 threads, 20M keys, 90% read ratio, 32GB: 140k writes/sec, 1.25M reads/sec,
50k evictions/sec, write: .02ms (99perc), read: .005ms(99perc)
* 1k..4k value size, 500 threads, 20M keys, 50% read ratio, 32GB: 530k writes/sec, 530k reads/sec,
220k evictions/sec, write: .04ms (99perc), read: .005ms(99perc)
* 1k..4k value size, 500 threads, 20M keys, 10% read ratio, 32GB: 665k writes/sec, 74k reads/sec,
250k evcictions/sec, write: .04ms (99perc), read: .005ms(99perc)

Command line to execute the benchmark:
{code}
java -jar ohc-benchmark/target/ohc-benchmark-0.1-SNAPSHOT.jar -rkd 'uniform(1..20000000)'
-wkd 'uniform(1..20000000)' -vs 'gaussian(1024..4096,2)' -r .1 -cap 32000000000 -d 86400 -t
500 -dr 8

-r = read rate
-d = duration
-t = # of threads
-dr = # of driver threads that feed the worker threads
-rkd = read key distribution
-wkd = write key distribution
-vs = value size
-cap = capacity
{code}

Sample bucket histogram from 20M test:
{code}
    [0..0]: 8118604
    [1..1]: 5892298
    [2..2]: 2138308
    [3..3]: 518089
    [4..4]: 94441
    [5..5]: 13672
    [6..6]: 1599
    [7..7]: 189
    [8..9]: 16
{code}

After trapping into that memory management issue with varying allocation sized of some few
kB to several MB, I think that it’s still worth to work on an own off-heap memory management.
Maybe some block-based approach (fixed or variable). But that’s out of the scope of this
ticket.

EDIT: The problem with high system-CPU usage only persists on systems with multiple CPUs.
Cross check with the second CPU socket disabled - calling the benchmark with {{taskset 0xffff
java -jar ...}}  does not show 95% system CPU usage.


was (Author: snazy):
I had the opportunity to test OHC on a big machine.
First: it works - very happy about that :)

Some things I want to notice:
* high number of segments do not have any really measurable influence (default of 2* # of
cores is fine)
* throughput heavily depends on serialization (hash entry size) - Java8 gave about 10% to
15% improvement in some tests (either on {{Unsafe.copyMemory}} or something related like JNI
barrier)
* the number of entries per bucket stays pretty low with the default load factor of .75 -
vast majority has 0 or 1 entries, some 2 or 3 and few up to 8

Issue (not solvable yet):
It works great for hash entries to approx. 64kB with good to great throughput. Above that
barrier it first works good but after some time the system spends a huge amount of CPU time
(~95%) in {{malloc()}} / {{free()}} (with jemalloc, Unsafe.allocate is not worth discussing
at all on Linux).
I tried to add some „memory buffer cache“ that caches free’d hash entries for reuse.
But it turned out that in the end it would be too complex if done right. The current implementation
is still in the code, but must be explicitly enabled with a system property. Workloads with
small entries and high number of threads easily trigger Linux OOM protection (that kills the
process). Please note that it works with large hash entries - but throughput drops dramatically
to just a few thousand writes per second.

Some numbers (value sizes have gaussian distribution). Had to do these tests in a hurry because
I had to give back the machine. Code used during these tests is tagged as {{0.1-SNAP-Bench}}
in git. Throughput is limited by {{malloc()}} / {{free()}} and most tests did only use 50%
of available CPU capacity (on _c3.8xlarge_ - 32 cores, Intel Xeon E5-2680v2 @2.8GHz, 64GB).
* 1k..200k value size, 32 threads, 1M keys, 90% read ratio, 32GB: 22k writes/sec, 200k reads/sec,
~8k evictions/sec, write: 8ms (99perc), read: 3ms(99perc)
* 1k..64k value size, 500 threads, 1M keys, 90% read ratio, 32GB: 55k writes/sec, 499k reads/sec,
~2k evictions/sec, write: .1ms (99perc), read: .03ms(99perc)
* 1k..64k value size, 500 threads, 1M keys, 50% read ratio, 32GB: 195k writes/sec, 195k reads/sec,
~9k evictions/sec, write: .2ms (99perc), read: .1ms(99perc)
* 1k..64k value size, 500 threads, 1M keys, 10% read ratio, 32GB: 185k writes/sec, 20k reads/sec,
~7k evictions/sec, write: 4ms (99perc), read: .07ms(99perc)
* 1k..16k value size, 500 threads, 5M keys, 90% read ratio, 32GB: 110k writes/sec, 1M reads/sec,
30k evictions/sec, write: .04ms (99perc), read: .01ms(99perc)
* 1k..16k value size, 500 threads, 5M keys, 50% read ratio, 32GB: 420k writes/sec, 420k reads/sec,
125k evictions/sec, write: .06ms (99perc), read: .01ms(99perc)
* 1k..16k value size, 500 threads, 5M keys, 10% read ratio, 32GB: 435k writes/sec, 48k reads/sec,
130k evictions/sec, write: .06ms (99perc), read: .01ms(99perc)
* 1k..4k value size, 500 threads, 20M keys, 90% read ratio, 32GB: 140k writes/sec, 1.25M reads/sec,
50k evictions/sec, write: .02ms (99perc), read: .005ms(99perc)
* 1k..4k value size, 500 threads, 20M keys, 50% read ratio, 32GB: 530k writes/sec, 530k reads/sec,
220k evictions/sec, write: .04ms (99perc), read: .005ms(99perc)
* 1k..4k value size, 500 threads, 20M keys, 10% read ratio, 32GB: 665k writes/sec, 74k reads/sec,
250k evcictions/sec, write: .04ms (99perc), read: .005ms(99perc)

Command line to execute the benchmark:
{code}
java -jar ohc-benchmark/target/ohc-benchmark-0.1-SNAPSHOT.jar -rkd 'uniform(1..20000000)'
-wkd 'uniform(1..20000000)' -vs 'gaussian(1024..4096,2)' -r .1 -cap 32000000000 -d 86400 -t
500 -dr 8

-r = read rate
-d = duration
-t = # of threads
-dr = # of driver threads that feed the worker threads
-rkd = read key distribution
-wkd = write key distribution
-vs = value size
-cap = capacity
{code}

Sample bucket histogram from 20M test:
{code}
    [0..0]: 8118604
    [1..1]: 5892298
    [2..2]: 2138308
    [3..3]: 518089
    [4..4]: 94441
    [5..5]: 13672
    [6..6]: 1599
    [7..7]: 189
    [8..9]: 16
{code}

After trapping into that memory management issue with varying allocation sized of some few
kB to several MB, I think that it’s still worth to work on an own off-heap memory management.
Maybe some block-based approach (fixed or variable). But that’s out of the scope of this
ticket.

> Serializing Row cache alternative (Fully off heap)
> --------------------------------------------------
>
>                 Key: CASSANDRA-7438
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7438
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>         Environment: Linux
>            Reporter: Vijay
>            Assignee: Vijay
>              Labels: performance
>             Fix For: 3.0
>
>         Attachments: 0001-CASSANDRA-7438.patch, tests.zip
>
>
> Currently SerializingCache is partially off heap, keys are still stored in JVM heap as
BB, 
> * There is a higher GC costs for a reasonably big cache.
> * Some users have used the row cache efficiently in production for better results, but
this requires careful tunning.
> * Overhead in Memory for the cache entries are relatively high.
> So the proposal for this ticket is to move the LRU cache logic completely off heap and
use JNI to interact with cache. We might want to ensure that the new implementation match
the existing API's (ICache), and the implementation needs to have safe memory access, low
overhead in memory and less memcpy's (As much as possible).
> We might also want to make this cache configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message