cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Kjellman (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
Date Fri, 28 Aug 2015 17:46:47 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720314#comment-14720314
] 

Michael Kjellman commented on CASSANDRA-9754:
---------------------------------------------

We've had a bunch of discussions around this over the past few weeks and I think i finally
have a grasp of the entire issue. The issue here is that large CQL partitions (4GB,75GB,etc)
end up with large 200MB+ serialized indexes. The current logic is when we don't get a cache
hit to deserialize the entire thing and split it into IndexInfo objects which contain 2 ByteBuffers
(first and last key), and 2 longs (Offset and Width). This means we get a very very large
amount of small most likely very shortly lived objects creating garbage on the heap  --- and
with a high probability they will be evicted from the cache anyways. On disk we just lay out
the objects down with the assumption the entire thing will always be deserialized when it's
needed and never accessed from disk without deserializing the entire thing.

I think the only option here is to make a change to the actual way we lay things out on disk.
Two options would be a Skip List or a B+ Tree where we mmap the pages of the index and try
to do something intelligent to avoid actually bringing objects onto the heap as much as possible.
The downside a B+ Tree would be the overhead of creating it on flush and it's log(n) (although
the current code is log(n) too as we binary search over the objects we deserialized into the
List, but just do it on the heap.

The only references I could find to B+ Trees in this project were CASSANDRA-6709 and CASSANDRA-7447.
I think we don't need to reinvent the wheel here and entirely change the storage format but
I think if we just use a targeted data structure *just* for the Index we might get something
nice. The question would be what impact will this have for "normal" rows/partitions.

Any input on other on disk data structures we might want to consider would be great.

The other issue is that I'd love to be able to only cache the column that we got a hit on
for the cache. Unfortunately that might be difficult. Today we binary search over the entire
List<IndexInfo> to find hits. If you get a column that's in between the first and last
name you return the left node and go and check and hopefully it's actually there. As we essentially
have interval-ish objects here along with non fixed length values it does make things a bit
more fun.

> Make index info heap friendly for large CQL partitions
> ------------------------------------------------------
>
>                 Key: CASSANDRA-9754
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: sankalp kohli
>            Assignee: Michael Kjellman
>            Priority: Minor
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects are IndexInfo
and its ByteBuffers. This is specially bad in endpoints with large CQL partitions. If a CQL
partition is say 6,4GB, it will have 100K IndexInfo objects and 200K ByteBuffers. This will
create a lot of churn for GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message