cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shawn Walker (JIRA)" <>
Subject [jira] [Updated] (CASSANDRA-8542) Make get_range_slices and related succeed most of the time on tombstone heavy column families
Date Sun, 28 Dec 2014 20:57:13 GMT


Shawn Walker updated CASSANDRA-8542:
    Issue Type: New Feature  (was: Wish)

> Make get_range_slices and related succeed most of the time on tombstone heavy column
> ---------------------------------------------------------------------------------------------
>                 Key: CASSANDRA-8542
>                 URL:
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Shawn Walker
>            Priority: Minor
>         Attachments: trunk-8542-InvertibleBloomReconciler.txt
> I've got a ColumnFamily in which some rows end up being used in a queue-like fashion,
and where many tombstones tend to pile up at the left end of the row.  As a result, I run
into {{TombstoneOverwhelming}} (e.g. CASSANDRA-6117) when trying to list the columns of said
> Please don't yell at me too loudly. I know the issue I'm describing will generally be
considered as being due to poor use of Cassandra.  I understand the rationale of the current
behavior, and am hesitant to play with fire by increasing {{tombstone_fail_threshold}} to
a high value.  Instead, what I propose is an alternate method of resolving such queries on
the read path.
> ----
> This is based on the following observation: on the coordinator node, when {{RangeSliceResponseResolver}}
is resolving a range slice query, it needn't be aware of all tombstones that all responding
nodes have within the specified range.  Instead, it would suffice if it could determine only
those tombstones which aren't shared by all responding nodes. E.g. if node #1 responds with
tombstones (A, B, D), node #2 responds with tombstones (A, B, C), and node #3 responds with
tombstones (A, B, C, D), {{RangeSliceResponseResolver}} need only actually know about the
tombstones (C, D): All of the responders will already have removed any relevant data for the
tombstones (A, B) from their individual responses.
> Arranging for {{RangeSliceResponseResolver}} to discover only the non-common tombstones
can be accomplished by using a variant of the "invertible bloom filter" data structure described
in e.g.
> I see accomplishing this in a few discrete steps:
> 1. Implement an appropriate variant of the "invertible bloom filter".  I've started this
already, and will shortly upload a patch including my {{InvertibleBloomReconciler}} implementation.
 From a black box perspective, {{InvertibleBloomReconcilerTest.verifyFiveWayReconciliation()}}
demonstrates how this data structure and algorithm could be used.
> 2. ("db layer") Wrap {{InvertibleBloomReconciler}} into {{TombstoneReconciler}}, a structure
for spilling tombstones into.  Refacter large swaths of {{org.apache.cassandra.db}}'s read
path to accomodate the possibility of placing tombstones discovered during a read into a{{TombstoneReconciler}}
instead of returning them within a {{Row}}, {{List<Row>}}, {{ColumnFamily}}, etc.  I've
attempted a start on this, though this means carrying the {{TombstoneReconciler}} around through
most of {{ColumnFamilyStore}}, practically all of {{org.apache.db.filter}}, and other places
I haven't yet discovered.  I'm wondering if I wouldn't be better off waiting for CASSANDRA-8099
before starting this step -- a fully iterator flow through {{org.apache.cassandra.db}} might
make this easier.
> 3. ("dynamo layer") Arrange for {{RangeSliceCommand}} to include parameters for the IBR
(table size, seed, hash count), possibly by making these part of CFMetaData.  Allow a {{RangeSliceResponse}}
to optionally return a {{TombstoneReconciler}} in addition to its {{List<Row>}}.  Refactor
{{RangeSliceResponseResolver}} to be capable of handling {{TombstoneReconciler}} s.  Possibly
refactor {{StorageProxy.getRangeSlices(...)}} to disable read repair if any responses contained
a {{TombstoneReconciler}}.
> Since the invertible bloom filter is a probabilistic data structure, it is possible that
resolving a query in this manner could fail.  As such, I'm proposing that the current read
path not make use of the {{TombstoneReconciler}} idea unless it would otherwise encounter
a {{TombstoneOverwhelming}} situation.
> Also, there is an astronomically minute possibility (on the order if 1 / 2^120 ) that
tombstones reconciled via a {{TombstoneReconciler}} could be bad data.  Possibly less astronomical
if someone were attempting to maliciously abuse this functionality, given the fact that I'm
using the Murmur3 hash instead of a cryptographic hash within {{InvertibleBloomReconciler}}.
 As such, I'm suggesting that read repair not be enabled on this path, just in case.
> ----
> Any thoughts?  Would there be any interest in an implementation as I've described?  Any
suggestions on who I might ask for help navigating {{org.apache.cassandra.db}} if/when I run
into trouble there?

This message was sent by Atlassian JIRA

View raw message