cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Kjellman (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
Date Wed, 08 Jun 2016 19:59:21 GMT


Michael Kjellman commented on CASSANDRA-9754:

Alright, happy to finally be able to write this. :) I'm attaching a v1 diff containing Birch!

h4. Why is it named Birch?
B+ Tree -> Trees that start with the letter B -> Birch... get it? haha...

h4. Description
Birch is a B+ish/inspired tree aimed at improving the performance of the SSTable index in
Cassandra (especially with large partitions).

The existing implementation scales poorly with the size of the index/ row as the entire index
must be deserialized onto the heap even to look for a single element. This puts significant
pressure on the heap, where one read to a large partition will cause at the minimum a long
painful CMS GC pause or -- in the worst case -- an OOM. 

The Birch implementation has a predictable fixed cost for reads at the expense of the additional
on disk overhead for the tree itself -- with an implementation that is the same complexity
O(log(n)) as the existing implementation. Every row added to the SSTable is also added to
the primary index. If the size of the row is greater than 64kb we build an index (otherwise
we just encode the position in the sstable for that row). All entries encoded into the index
are page aligned and padded to the nearest boundary (4096 bytes by default). Every segment
can be marked as either internally padded/aligned along a boundary or non-padded/aligned (up
to 2GB). Birch indexes are aligned into 4096 byte nodes (both leaf and inner). Keys will be
encoded inside the node itself, unless they exceed the size of the node/2. In that case, the
size of the node/2 is encoded into the node itself and the offset of the remaining bytes in
the overflow page is encoded. This enables predictable fixed performance of the tree, but
accommodates variable length keys/elements.

h4. Notes on v1 of the diff (in no particular order)
 * I broke the changes into two logical parts: The first abstracts out the existing Index
implementation and adds no new logic. The second includes a IndexedEntry implementation backed
by a Birch tree.
 * The attached v1 patch is written for 2.1, I have already started rebasing the patch onto
trunk and hope to finish that shortly and post a the trunk based patch
 * There's some high level Javadoc documentation in BirchWriter and PageAlignedWriter on the
layout of the tree on disk, serialization and deserialization paths, and higher level goals
of the classes
 * The next steps are to start getting feedback from reviews and the community. I also have
profiled the tree itself but profiling the tree integrated into the stack and optimizing non-performant
code paths is next (after the immediate task to rebase the change onto trunk)
 * There are still a few todo's I've left in regards to handling backwards compatibility,
parts of the code I expect might be non-performant, and things I'd like to discuss on the
"correct" implementation/behavior etc
 * I have a few unit tests that still don't pass and still need to be root caused... I've
taken the approach this entire time that the unit tests shouldn't be touched to pass, so there
is still a few behavioral regressions I've accidentally introduced. The current failing tests
 ** AutoSavingCacheTest
 ** SecondaryIndexTest
 ** BatchlogManagerTest
 ** KeyCacheTest
 ** ScrubTest
 ** IndexSummaryManagerTest
 ** LegacySSTableTest
 ** MultiSliceTest
 * I need to write a unit test to test reading the legacy/existing primary index implementation
 * By the nature of the index's role in the database, the unit test coverage is actually pretty
extensive as any read and write touches the index in some capacity

I'll be giving a talk at NGCC tomorrow (Thursday the 9th) to go over the high level design
I ended up with and considerations I had to take into account once I actually got deep inside
this part of the code.

Looking forward to feedback!

> Make index info heap friendly for large CQL partitions
> ------------------------------------------------------
>                 Key: CASSANDRA-9754
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: sankalp kohli
>            Assignee: Michael Kjellman
>            Priority: Minor
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects are IndexInfo
and its ByteBuffers. This is specially bad in endpoints with large CQL partitions. If a CQL
partition is say 6,4GB, it will have 100K IndexInfo objects and 200K ByteBuffers. This will
create a lot of churn for GC. Can this be improved by not creating so many objects?

This message was sent by Atlassian JIRA

View raw message