cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ariel Weisberg (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
Date Tue, 04 Nov 2014 16:13:35 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14196296#comment-14196296
] 

Ariel Weisberg edited comment on CASSANDRA-7438 at 11/4/14 4:13 PM:
--------------------------------------------------------------------

bq. No we don't. We have locks per Segment, this is very similar to lock stripping/Java's
concurrent hash map.
Thanks for clearing that up

bq. Not really we lock globally when we reach 100% of the space and we freeup to 80% of the
space and we spread the overhead to other threads based on who ever has the item partition
lock. It won't be hard to make this part of the queue thread and will try it for the next
release of lruc.

OK, that make sense. 20% of the cache could be many milliseconds of work if you are using
many gigabytes of cache. That's not a great thing to foist on a random victim thread. If you
handed that to the queue thread, well I think you run into another issue which is that the
ring buffer doesn't appear to check for queue full? The queue thread could go out to lunch
for a while. Not a big deal, but finer grained scheduling will probably be necessary.

bq. If you look at the code closer to memcached. Actually I started of stripping memcached
code so we can run it in process instead of running as a separate process and removing the
global locks in queue reallocation etc and eventually diverged too much from it. The other
reason it doesn't use slab allocators is because we wanted the memory allocators to do the
right thing we already have tested Cassandra with Jemalloc.

Ah very cool.

jemalloc is not a moving allocator where as it looks like memcached slabs implement rebalancing
to accommodate changes in size distribution. That would actually be one of the really nice
things to keep IMO. On large memory systems with a cache that scales and performs you would
end up dedicating as much RAM as possible to the row cache/key cache and not the page cache
since the page cache is not as granular (correct me if the story for C* is different). If
you dedicate 80% of RAM to the cache that doesn't leave a lot of space left for fragmentation.
By using a heap allocator you also lose the ability to implement hard predictable limits on
memory used by the cache since you didn't map it yourself. I could be totally off base and
jemalloc might be good enough.

bq. There is some comments above which has the reasoning for it (please see the above comments).
PS: I believe there was some tickets on Current RowCache complaining about the overhead.
I don't have a performance beef with JNI, especially the way you have done which I think is
pretty efficient. I think the overhead of JNI (one or two slightly more expensive function
calls) would be eclipsed by things like the cache misses, coherence, and pipeline stalls that
are part of accessing and maintaining a concurrent cache (Java or C++). It's all just intuition
without comparative microbenchmarks of the two caches. Java might look a little faster just
due to allocator performance, but we know you pay for that in other ways.

I think what you have made scratches the itch for a large cache quite well, and beats the
status quo. I don't agree that Unsafe couldn't do the exact same thing with no on heap references.

The hash table, ring buffer, and individual item entries are all being malloced and you can
do that from Java using Unsafe. You don't need to implement a ring buffer because you can
use Disruptor. I also wonder if splitting the cache into several instances each with a coarse
lock per instance wouldn't result in simpler, and I know performance is not an issue, fast
enough code. I don't want to advocate doing something different for performance, but rather
that there is the possibility of a relatively simple implementation via Unsafe.

You could coalesce all the contended fields for each instance (stats, lock field, LRU head)
into a single cache line, and then rely on a single barrier when releasing a coarse grained
lock. The fine grained locking and CASing results in several pipeline stalls because the memory
barriers that are implicit in each one require the store buffers to drain. There may even
be a suitable off heap map implementation out there already.


was (Author: aweisberg):
.bq No we don't. We have locks per Segment, this is very similar to lock stripping/Java's
concurrent hash map.
Thanks for clearing that up

.bq Not really we lock globally when we reach 100% of the space and we freeup to 80% of the
space and we spread the overhead to other threads based on who ever has the item partition
lock. It won't be hard to make this part of the queue thread and will try it for the next
release of lruc.

OK, that make sense. 20% of the cache could be many milliseconds of work if you are using
many gigabytes of cache. That's not a great thing to foist on a random victim thread. If you
handed that to the queue thread, well I think you run into another issue which is that the
ring buffer doesn't appear to check for queue full? The queue thread could go out to lunch
for a while. Not a big deal, but finer grained scheduling will probably be necessary.

.bq If you look at the code closer to memcached. Actually I started of stripping memcached
code so we can run it in process instead of running as a separate process and removing the
global locks in queue reallocation etc and eventually diverged too much from it. The other
reason it doesn't use slab allocators is because we wanted the memory allocators to do the
right thing we already have tested Cassandra with Jemalloc.

Ah very cool.

jemalloc is not a moving allocator where as it looks like memcached slabs implement rebalancing
to accommodate changes in size distribution. That would actually be one of the really nice
things to keep IMO. On large memory systems with a cache that scales and performs you would
end up dedicating as much RAM as possible to the row cache/key cache and not the page cache
since the page cache is not as granular (correct me if the story for C* is different). If
you dedicate 80% of RAM to the cache that doesn't leave a lot of space left for fragmentation.
By using a heap allocator you also lose the ability to implement hard predictable limits on
memory used by the cache since you didn't map it yourself. I could be totally off base and
jemalloc might be good enough.

.bq There is some comments above which has the reasoning for it (please see the above comments).
PS: I believe there was some tickets on Current RowCache complaining about the overhead.
I don't have a performance beef with JNI, especially the way you have done which I think is
pretty efficient. I think the overhead of JNI (one or two slightly more expensive function
calls) would be eclipsed by things like the cache misses, coherence, and pipeline stalls that
are part of accessing and maintaining a concurrent cache (Java or C++). It's all just intuition
without comparative microbenchmarks of the two caches. Java might look a little faster just
due to allocator performance, but we know you pay for that in other ways.

I think what you have made scratches the itch for a large cache quite well, and beats the
status quo. I don't agree that Unsafe couldn't do the exact same thing with no on heap references.

The hash table, ring buffer, and individual item entries are all being malloced and you can
do that from Java using Unsafe. You don't need to implement a ring buffer because you can
use Disruptor. I also wonder if splitting the cache into several instances each with a coarse
lock per instance wouldn't result in simpler, and I know performance is not an issue, fast
enough code. I don't want to advocate doing something different for performance, but rather
that there is the possibility of a relatively simple implementation via Unsafe.

You could coalesce all the contended fields for each instance (stats, lock field, LRU head)
into a single cache line, and then rely on a single barrier when releasing a coarse grained
lock. The fine grained locking and CASing results in several pipeline stalls because the memory
barriers that are implicit in each one require the store buffers to drain. There may even
be a suitable off heap map implementation out there already.

> Serializing Row cache alternative (Fully off heap)
> --------------------------------------------------
>
>                 Key: CASSANDRA-7438
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7438
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>         Environment: Linux
>            Reporter: Vijay
>            Assignee: Vijay
>              Labels: performance
>             Fix For: 3.0
>
>         Attachments: 0001-CASSANDRA-7438.patch
>
>
> Currently SerializingCache is partially off heap, keys are still stored in JVM heap as
BB, 
> * There is a higher GC costs for a reasonably big cache.
> * Some users have used the row cache efficiently in production for better results, but
this requires careful tunning.
> * Overhead in Memory for the cache entries are relatively high.
> So the proposal for this ticket is to move the LRU cache logic completely off heap and
use JNI to interact with cache. We might want to ensure that the new implementation match
the existing API's (ICache), and the implementation needs to have safe memory access, low
overhead in memory and less memcpy's (As much as possible).
> We might also want to make this cache configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message