cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Ellis (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format
Date Wed, 02 Mar 2016 14:07:18 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175634#comment-15175634
] 

Jonathan Ellis edited comment on CASSANDRA-11206 at 3/2/16 2:06 PM:
--------------------------------------------------------------------

bq. For partitions < 64k (partitions without an IndexInfo object) we could skip the indirection
during reads via RowIndexEntry at all by extending the IndexSummary and directly store the
offset into the data file

Since the idea here is to do something simple that we can be confident about shipping in 3.6
if CASSANDRA-9754 isn't ready, let's avoid making changes to the on disk layout.

To clarify for others following along,

bq. Remove IndexInfo from the key cache (not from the index file on disk, of course)

This sounds scary but it's core to the goal here: if we're going to support large partitions,
we can't afford the overhead either of keeping the entire summary on heap, or of reading it
from disk in the first place.  (If we're reading a 1KB row, then reading 2MB of summary first
on a cache miss is a huge overhead.)  Moving the key cache off heap (CASSANDRA-9738) would
have helped with the first but not the second.

So one approach is to go back to the old strategy of only caching the partition key location,
and then go through the index bsearch using the offsets map every time.  For small partitions
this will be trivial and I hope negligible to the performance story vs the current cache.
 (If not, we can look at a hybrid strategy, but I'd like to avoid that complexity if possible.)

bq. what I was thinking was that the key cache instead of storing a copy of the RIE it would
store an offset into the index that is the location of the RIE. Then the RIE could be accessed
off heap via a memory mapping without doing any allocations or copies

I was thinking that even the offsets alone for a 4GB partition are going to be 256KB, so we
don't want to cache the entire offsets map.  But the other side there is that if you have
a bunch of 4GB partitions you won't have very many of them.  16TB of data would be 1GB of
offsets which is within the bounds of reasonable when off heap.  And your approach may require
less logic changes than the one above, since we're still "caching" the entire summary, sort
of; only adding an extra indirection to read the IndexInfo entries.  So that might well be
simpler.


was (Author: jbellis):
bq. For partitions < 64k (partitions without an IndexInfo object) we could skip the indirection
during reads via RowIndexEntry at all by extending the IndexSummary and directly store the
offset into the data file

Since the idea here is to do something simple that we can be confident about shipping in 3.6
if CASSANDRA-9754 isn't ready, let's avoid making changes to the on disk layout, i.e., your
Plan B.

To clarify for others following along,

bq. Remove IndexInfo from the key cache (not from the index file on disk, of course)

This sounds scary but it's core to the goal here: if we're going to support large partitions,
we can't afford the overhead either of keeping the entire summary on heap, or of reading it
from disk in the first place.  (If we're reading a 1KB row, then reading 2MB of summary first
on a cache miss is a huge overhead.)  Moving the key cache off heap (CASSANDRA-9738) would
have helped with the first but not the second.

So one approach is to go back to the old strategy of only caching the partition key location,
and then go through the index bsearch using the offsets map every time.  For small partitions
this will be trivial and I hope negligible to the performance story vs the current cache.
 (If not, we can look at a hybrid strategy, but I'd like to avoid that complexity if possible.)

bq. what I was thinking was that the key cache instead of storing a copy of the RIE it would
store an offset into the index that is the location of the RIE. Then the RIE could be accessed
off heap via a memory mapping without doing any allocations or copies

I was thinking that even the offsets alone for a 4GB partition are going to be 256KB, so we
don't want to cache the entire offsets map.  But the other side there is that if you have
a bunch of 4GB partitions you won't have very many of them.  16TB of data would be 1GB of
offsets which is within the bounds of reasonable when off heap.  And your approach may require
less logic changes than the one above, since we're still "caching" the entire summary, sort
of; only adding an extra indirection to read the IndexInfo entries.  So that might well be
simpler.

> Support large partitions on the 3.0 sstable format
> --------------------------------------------------
>
>                 Key: CASSANDRA-11206
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11206
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Jonathan Ellis
>            Assignee: Robert Stupp
>             Fix For: 3.x
>
>
> Cassandra saves a sample of IndexInfo objects that store the offset within each partition
of every 64KB (by default) range of rows.  To find a row, we binary search this sample, then
scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss, we deserialize
the entire set of IndexInfo, which both creates a lot of GC overhead (as noted in CASSANDRA-9754)
but is also non-negligible i/o activity (relative to reading a single 64KB row range) as partitions
get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform the IndexInfo
bsearch while only deserializing IndexInfo that we need to compare against, i.e. log(N) deserializations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message