cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Kjellman (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions
Date Tue, 09 Feb 2016 20:54:18 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15139735#comment-15139735
] 

Michael Kjellman commented on CASSANDRA-9754:
---------------------------------------------

[~jkrupan] ~2GB is the max target at the moment I'd recommend from experience.

The current implementation will create a IndexInfo entry every 64kb (by default - but I highly
doubt anyone actually changes this default) worth of data. Each IndexInfo object contains
the offset into the sstable where the partition/row starts, the length to read, and the name.
These IndexInfo objects are placed into a list and binary searched over to find the name closest
to the query. Then, we go to that offset in the sstable and start reading the actual data.


The issue here that makes things so bad with large partitions is when doing an Indexed read
across a given partition the entire list of indexinfo objects is currently just serialized
one after another into the index file on disk. To use it we have to read the entire thing
off disk, deserializing every IndexInfo object, place it into a list, and the binary search
across it. This creates a ton of small objects very quickly that are likely to be promoted
and thus create a lot of GC pressure.

If you take the average size of each column you have in a row you can figure out how many
index entry objects will be created (for every 64k of your data in that partition). I've found
that once the index info array will contain > 300k objects things get bad.

The implementation I'm *almost* done with has the same big O complexity (O(log(n))) as the
current implementation but instead the index is backed by page cache aligned mmap'ed segments
(B+ tree-ish with an overflow page implementation similar to that of SQLite). This means we
can now walk the IndexEntry objects an only bring the 4k chunks onto the heap that are involved
in the binary search for the correct entry itself.

The tree itself is finished and heavily tested. I've also already abstracted out the index
implementation in Cassandra so that the current implementation and the new one I'll be proposing
and contributing here can be dropped in easily without special casing the code all over the
place to check the SSTable descriptor for what index implementation was used. All the unit
tests and d-tests pass after my abstraction work. The final thing I'm almost done with is
refactoring my Page Cache Aligned/Aware File Writer to be SegmentedFile aware (and make sure
all the math works when the offset into the actual file will differ depending on the segment
etc).

> Make index info heap friendly for large CQL partitions
> ------------------------------------------------------
>
>                 Key: CASSANDRA-9754
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: sankalp kohli
>            Assignee: Michael Kjellman
>            Priority: Minor
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects are IndexInfo
and its ByteBuffers. This is specially bad in endpoints with large CQL partitions. If a CQL
partition is say 6,4GB, it will have 100K IndexInfo objects and 200K ByteBuffers. This will
create a lot of churn for GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message