cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benjamin Janssen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-8542) Make get_range_slices and related succeed most of the time on tombstone heavy column families
Date Wed, 28 Jan 2015 04:15:35 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-8542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14294688#comment-14294688
] 

Benjamin Janssen commented on CASSANDRA-8542:
---------------------------------------------

Anyone had a chance to look into this ticket yet?

> Make get_range_slices and related succeed most of the time on tombstone heavy column
families
> ---------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-8542
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8542
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Shawn Walker
>            Priority: Minor
>         Attachments: trunk-8542-InvertibleBloomReconciler.txt
>
>
> I've got a ColumnFamily in which some rows end up being used in a queue-like fashion,
and where many tombstones tend to pile up at the left end of the row.  As a result, I run
into {{TombstoneOverwhelming}} (e.g. CASSANDRA-6117) when trying to list the columns of said
rows.
> Please don't yell at me too loudly. I know the issue I'm describing will generally be
considered as being due to poor use of Cassandra.  I understand the rationale of the current
behavior, and am hesitant to play with fire by increasing {{tombstone_fail_threshold}} to
a high value.  Instead, what I propose is an alternate method of resolving such queries on
the read path.
> ----
> This is based on the following observation: on the coordinator node, when {{RangeSliceResponseResolver}}
is resolving a range slice query, it needn't be aware of all tombstones that all responding
nodes have within the specified range.  Instead, it would suffice if it could determine only
those tombstones which aren't shared by all responding nodes. E.g. if node #1 responds with
tombstones (A, B, D), node #2 responds with tombstones (A, B, C), and node #3 responds with
tombstones (A, B, C, D), {{RangeSliceResponseResolver}} need only actually know about the
tombstones (C, D): All of the responders will already have removed any relevant data for the
tombstones (A, B) from their individual responses.
> Arranging for {{RangeSliceResponseResolver}} to discover only the non-common tombstones
can be accomplished by using a variant of the "invertible bloom filter" data structure described
in e.g. http://conferences.sigcomm.org/sigcomm/2011/papers/sigcomm/p218.pdf.  Using an invertible
Bloom filter, each responding node can build a (roughly) fixed size data structure holding
a representation of all the tombstones that node encounters.  The coordinator node can then
combine the invertible Bloom filters.  If there aren't too many non-common tombstones, the
coordinator node will be able to enumerate all of them, and so resolve the range slice query.
> I see accomplishing this in a few discrete steps:
> 1. Implement an appropriate variant of the "invertible bloom filter".  I've started this
already, and will shortly upload a patch including my {{InvertibleBloomReconciler}} implementation.
 From a black box perspective, {{InvertibleBloomReconcilerTest.verifyFiveWayReconciliation()}}
demonstrates how this data structure and algorithm could be used.
> 2. ("db layer") Wrap {{InvertibleBloomReconciler}} into {{TombstoneReconciler}}, a structure
for spilling tombstones into.  Refactor large swaths of {{org.apache.cassandra.db}}'s read
path to accomodate the possibility of placing tombstones discovered during a read into a {{TombstoneReconciler}}
instead of returning them within a {{Row}}, {{List<Row>}}, {{ColumnFamily}}, etc.  I've
attempted a start on this, though this means carrying the {{TombstoneReconciler}} around through
most of {{ColumnFamilyStore}}, practically all of {{org.apache.db.filter}}, and other places
I haven't yet discovered.  I'm wondering if I wouldn't be better off waiting for CASSANDRA-8099
before starting this step -- a fully iterator flow through {{org.apache.cassandra.db}} could
make this easier, cleaner, and have significantly lower code impact.
> 3. ("dynamo layer") Arrange for {{RangeSliceCommand}} to include parameters for the IBR
(table size, hash count, seed), possibly by making these part of {{CFMetaData}}.  Allow a
{{RangeSliceResponse}} to optionally return a {{TombstoneReconciler}} in addition to its {{List<Row>}}.
 Refactor {{RangeSliceResponseResolver}} to be capable of handling {{TombstoneReconciler}}
s.  Possibly refactor {{StorageProxy.getRangeSlices(...)}} to disable read repair if any responses
contained a {{TombstoneReconciler}}.
> Since the invertible bloom filter is a probabilistic data structure, it is possible that
resolving a query in this manner could fail.  As such, I'm proposing that the current read
path not make use of the {{TombstoneReconciler}} idea unless it would otherwise encounter
a {{TombstoneOverwhelming}} situation.
> Also, there is an astronomically minute possibility (on the order if 1 / 2^120 ) that
tombstones reconciled via a {{TombstoneReconciler}} could be bad data.  Possibly less astronomical
if someone were attempting to maliciously abuse this functionality, given the fact that I'm
using the Murmur3 hash instead of a cryptographic hash within {{InvertibleBloomReconciler}}.
 As such, I'm suggesting that read repair not be enabled on this path, just in case.
> ----
> Any thoughts?  Would there be any interest in an implementation as I've described?  Any
suggestions on who I might ask for help navigating {{org.apache.cassandra.db}} if/when I run
into trouble there?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message