cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Oleg Anastasyev (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-6446) Faster range tombstones on wide partitions
Date Thu, 23 Jan 2014 10:14:41 GMT


Oleg Anastasyev commented on CASSANDRA-6446:

One more bug found in v2, and, probably in v3: RangeTombstoneList.iterator does not search
for appropriate thombstones, if from or till are specified as empty filter in SliceQueryFilter.
For v2 we fixed it with
diff --git a/src/java/org/apache/cassandra/db/ b/src/java/org/apache/cassandra/db/
index 35af5f8..eae65e9 100644
--- a/src/java/org/apache/cassandra/db/
+++ b/src/java/org/apache/cassandra/db/
@@ -353,13 +353,13 @@ public class RangeTombstoneList implements Iterable<RangeTombstone>

     public Iterator<RangeTombstone> iterator(ByteBuffer from, ByteBuffer till)
-        int startIdx = searchInternal(from, 0);
+        int startIdx = from.equals(ByteBufferUtil.EMPTY_BYTE_BUFFER) ? 0 : searchInternal(from,
         final int start = startIdx < 0 ? -startIdx-1 : startIdx;

         if (start >= size)
             return Iterators.<RangeTombstone>emptyIterator();

-        int finishIdx = searchInternal(till, start);
+        int finishIdx = till.equals(ByteBufferUtil.EMPTY_BYTE_BUFFER) ? size : searchInternal(till,
         // if stopIdx is the first range after 'till' we care only until the previous range
         final int finish = finishIdx < 0 ? -finishIdx-2 : finishIdx;



> Faster range tombstones on wide partitions
> ------------------------------------------
>                 Key: CASSANDRA-6446
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Oleg Anastasyev
>            Assignee: Oleg Anastasyev
>             Fix For: 2.1
>         Attachments: 0001-6446-write-path-v2.txt, 0002-6446-Read-patch-v2.txt, 6446-Read-patch-v3.txt,
6446-write-path-v3.txt, RangeTombstonesReadOptimization.diff, RangeTombstonesWriteOptimization.diff
> Having wide CQL rows (~1M in single partition) and after deleting some of them, we found
inefficiencies in handling of range tombstones on both write and read paths.
> I attached 2 patches here, one for write path (RangeTombstonesWriteOptimization.diff)
and another on read (RangeTombstonesReadOptimization.diff).
> On write path, when you have some CQL rows deletions by primary key, each of deletion
is represented by range tombstone. On put of this tombstone to memtable the original code
takes all columns from memtable from partition and checks DeletionInfo.isDeleted by brute
for loop to decide, should this column stay in memtable or it was deleted by new tombstone.
Needless to say, more columns you have on partition the slower deletions you have heating
your CPU with brute range tombstones check. 
> The RangeTombstonesWriteOptimization.diff patch for partitions with more than 10000 columns
loops by tombstones instead and checks existance of columns for each of them. Also it copies
of whole memtable range tombstone list only if there are changes to be made there (original
code copies range tombstone list on every write).
> On read path, original code scans whole range tombstone list of a partition to match
sstable columns to their range tomstones. The RangeTombstonesReadOptimization.diff patch scans
only necessary range of tombstones, according to filter used for read.

This message was sent by Atlassian JIRA

View raw message