cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shawn Walker (JIRA)" <j...@apache.org>
Subject [jira] [Created] (CASSANDRA-8542) Make get_range_slices and related succeed most of the time on tombstone heavy column families
Date Sun, 28 Dec 2014 20:41:13 GMT
Shawn Walker created CASSANDRA-8542:
---------------------------------------

             Summary: Make get_range_slices and related succeed most of the time on tombstone
heavy column families
                 Key: CASSANDRA-8542
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8542
             Project: Cassandra
          Issue Type: Wish
          Components: Core
            Reporter: Shawn Walker
            Priority: Minor


I've got a ColumnFamily in which some rows end up being used in a queue-like fashion, and
where many tombstones tend to pile up at the left end of the row.  As a result, I run into
{{TombstoneOverwhelming}} (e.g. CASSANDRA-6117) when trying to list the columns of said rows.

Please don't yell at me too loudly. I know the issue I'm describing will generally be considered
as being due to poor use of Cassandra.  I understand the rationale of the current behavior,
and am hesitant to play with fire by increasing {{tombstone_fail_threshold}} to a high value.
 Instead, what I propose is an alternate method of resolving such queries on the read path.

This is based on the following observation: on the coordinator node, when {{RangeSliceResponseResolver}}
is resolving a range slice query, it needn't be aware of all tombstones that all responding
nodes have within the specified range.  Instead, it would suffice if it could determine only
those tombstones which aren't shared by all responding nodes. E.g. if node #1 responds with
tombstones {A, B, D}, node #2 responds with tombstones {A, B, C}, and node #3 responds with
tombstones {A, B, C, D}, {{RangeSliceResponseResolver}} need only actually know about the
tombstones {C, D}: All of the responses will already have removed any relevant data for the
tombstones {A, B} from their individual responses.

Arranging for {{RangeSliceResolver}} to discover only the non-common tombstones can be accomplished
by using a variant of the "invertible bloom filter" data structure described in e.g. http://conferences.sigcomm.org/sigcomm/2011/papers/sigcomm/p218.pdf.

I see accomplishing this in a few discrete steps:
1. Implement an appropriate variant of the "invertible bloom filter".  I've started this already,
and will shortly upload a patch including my {{InvertibleBloomReconciler}} implementation.
 From a black box perspective, {{InvertibleBloomReconcilerTest.verifyFiveWayReconciliation()}}
demonstrates how this data structure and algorithm could be used.
2. ("db layer") Wrap {{InvertibleBloomReconciler}} into {{TombstoneReconciler}}, a structure
for spilling tombstones into.  Refacter large swaths of {{org.apache.cassandra.db}}'s read
path to accomodate the possibility of placing tombstones discovered during a read into a{{TombstoneReconciler}}
instead of returning them within a {{Row}}, {{List<Row>}}, {{ColumnFamily}}, etc.  I've
attempted a start on this, though this means carrying the {{TombstoneReconciler}} around through
most of {{ColumnFamilyStore}}, practically all of {{org.apache.db.filter}}, and other places
I haven't yet discovered.  I'm wondering if I wouldn't be better off waiting for CASSANDRA-8099
before starting this step -- a fully iterator flow through {{org.apache.cassandra.db}} might
make this easier.
3. ("dynamo layer") Arrange for {{RangeSliceCommand}} to include parameters for the IBR (table
size, seed, hash count), possibly by making these part of CFMetaData.  Allow a {{RangeSliceResponse}}
to optionally return a {{TombstoneReconciler}} in addition to its {{List<Row>}}.  Refactor
{{RangeSliceResponseResolver}} to be capable of handling {{TombstoneReconciler}} s.  Possibly
refactor {{StorageProxy.getRangeSlices(...)}} to disable read repair if any responses contained
a {{TombstoneReconciler}}.

Since the invertible bloom filter is a probabilistic data structure, it is possible that resolving
a query in this manner could fail.  As such, I'm proposing that the current read path not
make use of the {{TombstoneReconciler}} idea unless it would otherwise encounter a {{TombstoneOverwhelming}}
situation.

Also, there is an astronomically minute possibility (on the order if 1 / 2^120 ) that tombstones
reconciled via a {{TombstoneReconciler}} could be bad data.  Possibly less astronomical if
someone were attempting to maliciously abuse this functionality, given the fact that I'm using
the Murmur3 hash instead of a cryptographic hash within {{InvertibleBloomReconciler}}.  As
such, I'm suggesting that read repair not be enabled on this path, just in case.

Any thoughts?  Would there be any interest in an implementation as I've described?  Any suggestions
on who I might ask for help navigating {{org.apache.cassandra.db}} if/when I run into trouble
there?




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message