cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shawn Walker (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CASSANDRA-8542) Make get_range_slices and related succeed most of the time on tombstone heavy column families
Date Sun, 28 Dec 2014 20:52:13 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-8542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Shawn Walker updated CASSANDRA-8542:
------------------------------------
    Description: 
I've got a ColumnFamily in which some rows end up being used in a queue-like fashion, and
where many tombstones tend to pile up at the left end of the row.  As a result, I run into
{{TombstoneOverwhelming}} (e.g. CASSANDRA-6117) when trying to list the columns of said rows.

Please don't yell at me too loudly. I know the issue I'm describing will generally be considered
as being due to poor use of Cassandra.  I understand the rationale of the current behavior,
and am hesitant to play with fire by increasing {{tombstone_fail_threshold}} to a high value.
 Instead, what I propose is an alternate method of resolving such queries on the read path.

This is based on the following observation: on the coordinator node, when {{RangeSliceResponseResolver}}
is resolving a range slice query, it needn't be aware of all tombstones that all responding
nodes have within the specified range.  Instead, it would suffice if it could determine only
those tombstones which aren't shared by all responding nodes. E.g. if node #1 responds with
tombstones (A, B, D), node #2 responds with tombstones (A, B, C), and node #3 responds with
tombstones (A, B, C, D), {{RangeSliceResponseResolver}} need only actually know about the
tombstones (C, D): All of the responses will already have removed any relevant data for the
tombstones (A, B) from their individual responses.

Arranging for {{RangeSliceResolver}} to discover only the non-common tombstones can be accomplished
by using a variant of the "invertible bloom filter" data structure described in e.g. http://conferences.sigcomm.org/sigcomm/2011/papers/sigcomm/p218.pdf.

I see accomplishing this in a few discrete steps:

1. Implement an appropriate variant of the "invertible bloom filter".  I've started this already,
and will shortly upload a patch including my {{InvertibleBloomReconciler}} implementation.
 From a black box perspective, {{InvertibleBloomReconcilerTest.verifyFiveWayReconciliation()}}
demonstrates how this data structure and algorithm could be used.

2. ("db layer") Wrap {{InvertibleBloomReconciler}} into {{TombstoneReconciler}}, a structure
for spilling tombstones into.  Refacter large swaths of {{org.apache.cassandra.db}}'s read
path to accomodate the possibility of placing tombstones discovered during a read into a{{TombstoneReconciler}}
instead of returning them within a {{Row}}, {{List<Row>}}, {{ColumnFamily}}, etc.  I've
attempted a start on this, though this means carrying the {{TombstoneReconciler}} around through
most of {{ColumnFamilyStore}}, practically all of {{org.apache.db.filter}}, and other places
I haven't yet discovered.  I'm wondering if I wouldn't be better off waiting for CASSANDRA-8099
before starting this step -- a fully iterator flow through {{org.apache.cassandra.db}} might
make this easier.

3. ("dynamo layer") Arrange for {{RangeSliceCommand}} to include parameters for the IBR (table
size, seed, hash count), possibly by making these part of CFMetaData.  Allow a {{RangeSliceResponse}}
to optionally return a {{TombstoneReconciler}} in addition to its {{List<Row>}}.  Refactor
{{RangeSliceResponseResolver}} to be capable of handling {{TombstoneReconciler}} s.  Possibly
refactor {{StorageProxy.getRangeSlices(...)}} to disable read repair if any responses contained
a {{TombstoneReconciler}}.

Since the invertible bloom filter is a probabilistic data structure, it is possible that resolving
a query in this manner could fail.  As such, I'm proposing that the current read path not
make use of the {{TombstoneReconciler}} idea unless it would otherwise encounter a {{TombstoneOverwhelming}}
situation.

Also, there is an astronomically minute possibility (on the order if 1 / 2^120 ) that tombstones
reconciled via a {{TombstoneReconciler}} could be bad data.  Possibly less astronomical if
someone were attempting to maliciously abuse this functionality, given the fact that I'm using
the Murmur3 hash instead of a cryptographic hash within {{InvertibleBloomReconciler}}.  As
such, I'm suggesting that read repair not be enabled on this path, just in case.

----

Any thoughts?  Would there be any interest in an implementation as I've described?  Any suggestions
on who I might ask for help navigating {{org.apache.cassandra.db}} if/when I run into trouble
there?


  was:
I've got a ColumnFamily in which some rows end up being used in a queue-like fashion, and
where many tombstones tend to pile up at the left end of the row.  As a result, I run into
{{TombstoneOverwhelming}} (e.g. CASSANDRA-6117) when trying to list the columns of said rows.

Please don't yell at me too loudly. I know the issue I'm describing will generally be considered
as being due to poor use of Cassandra.  I understand the rationale of the current behavior,
and am hesitant to play with fire by increasing {{tombstone_fail_threshold}} to a high value.
 Instead, what I propose is an alternate method of resolving such queries on the read path.

This is based on the following observation: on the coordinator node, when {{RangeSliceResponseResolver}}
is resolving a range slice query, it needn't be aware of all tombstones that all responding
nodes have within the specified range.  Instead, it would suffice if it could determine only
those tombstones which aren't shared by all responding nodes. E.g. if node #1 responds with
tombstones (A, B, D), node #2 responds with tombstones (A, B, C), and node #3 responds with
tombstones (A, B, C, D), {{RangeSliceResponseResolver}} need only actually know about the
tombstones (C, D): All of the responses will already have removed any relevant data for the
tombstones (A, B) from their individual responses.

Arranging for {{RangeSliceResolver}} to discover only the non-common tombstones can be accomplished
by using a variant of the "invertible bloom filter" data structure described in e.g. http://conferences.sigcomm.org/sigcomm/2011/papers/sigcomm/p218.pdf.

I see accomplishing this in a few discrete steps:

1. Implement an appropriate variant of the "invertible bloom filter".  I've started this already,
and will shortly upload a patch including my {{InvertibleBloomReconciler}} implementation.
 From a black box perspective, {{InvertibleBloomReconcilerTest.verifyFiveWayReconciliation()}}
demonstrates how this data structure and algorithm could be used.

2. ("db layer") Wrap {{InvertibleBloomReconciler}} into {{TombstoneReconciler}}, a structure
for spilling tombstones into.  Refacter large swaths of {{org.apache.cassandra.db}}'s read
path to accomodate the possibility of placing tombstones discovered during a read into a{{TombstoneReconciler}}
instead of returning them within a {{Row}}, {{List<Row>}}, {{ColumnFamily}}, etc.  I've
attempted a start on this, though this means carrying the {{TombstoneReconciler}} around through
most of {{ColumnFamilyStore}}, practically all of {{org.apache.db.filter}}, and other places
I haven't yet discovered.  I'm wondering if I wouldn't be better off waiting for CASSANDRA-8099
before starting this step -- a fully iterator flow through {{org.apache.cassandra.db}} might
make this easier.

3. ("dynamo layer") Arrange for {{RangeSliceCommand}} to include parameters for the IBR (table
size, seed, hash count), possibly by making these part of CFMetaData.  Allow a {{RangeSliceResponse}}
to optionally return a {{TombstoneReconciler}} in addition to its {{List<Row>}}.  Refactor
{{RangeSliceResponseResolver}} to be capable of handling {{TombstoneReconciler}} s.  Possibly
refactor {{StorageProxy.getRangeSlices(...)}} to disable read repair if any responses contained
a {{TombstoneReconciler}}.

Since the invertible bloom filter is a probabilistic data structure, it is possible that resolving
a query in this manner could fail.  As such, I'm proposing that the current read path not
make use of the {{TombstoneReconciler}} idea unless it would otherwise encounter a {{TombstoneOverwhelming}}
situation.

Also, there is an astronomically minute possibility (on the order if 1 / 2^120 ) that tombstones
reconciled via a {{TombstoneReconciler}} could be bad data.  Possibly less astronomical if
someone were attempting to maliciously abuse this functionality, given the fact that I'm using
the Murmur3 hash instead of a cryptographic hash within {{InvertibleBloomReconciler}}.  As
such, I'm suggesting that read repair not be enabled on this path, just in case.

Any thoughts?  Would there be any interest in an implementation as I've described?  Any suggestions
on who I might ask for help navigating {{org.apache.cassandra.db}} if/when I run into trouble
there?



> Make get_range_slices and related succeed most of the time on tombstone heavy column
families
> ---------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-8542
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8542
>             Project: Cassandra
>          Issue Type: Wish
>          Components: Core
>            Reporter: Shawn Walker
>            Priority: Minor
>         Attachments: trunk-8542-InvertibleBloomReconciler.txt
>
>
> I've got a ColumnFamily in which some rows end up being used in a queue-like fashion,
and where many tombstones tend to pile up at the left end of the row.  As a result, I run
into {{TombstoneOverwhelming}} (e.g. CASSANDRA-6117) when trying to list the columns of said
rows.
> Please don't yell at me too loudly. I know the issue I'm describing will generally be
considered as being due to poor use of Cassandra.  I understand the rationale of the current
behavior, and am hesitant to play with fire by increasing {{tombstone_fail_threshold}} to
a high value.  Instead, what I propose is an alternate method of resolving such queries on
the read path.
> This is based on the following observation: on the coordinator node, when {{RangeSliceResponseResolver}}
is resolving a range slice query, it needn't be aware of all tombstones that all responding
nodes have within the specified range.  Instead, it would suffice if it could determine only
those tombstones which aren't shared by all responding nodes. E.g. if node #1 responds with
tombstones (A, B, D), node #2 responds with tombstones (A, B, C), and node #3 responds with
tombstones (A, B, C, D), {{RangeSliceResponseResolver}} need only actually know about the
tombstones (C, D): All of the responses will already have removed any relevant data for the
tombstones (A, B) from their individual responses.
> Arranging for {{RangeSliceResolver}} to discover only the non-common tombstones can be
accomplished by using a variant of the "invertible bloom filter" data structure described
in e.g. http://conferences.sigcomm.org/sigcomm/2011/papers/sigcomm/p218.pdf.
> I see accomplishing this in a few discrete steps:
> 1. Implement an appropriate variant of the "invertible bloom filter".  I've started this
already, and will shortly upload a patch including my {{InvertibleBloomReconciler}} implementation.
 From a black box perspective, {{InvertibleBloomReconcilerTest.verifyFiveWayReconciliation()}}
demonstrates how this data structure and algorithm could be used.
> 2. ("db layer") Wrap {{InvertibleBloomReconciler}} into {{TombstoneReconciler}}, a structure
for spilling tombstones into.  Refacter large swaths of {{org.apache.cassandra.db}}'s read
path to accomodate the possibility of placing tombstones discovered during a read into a{{TombstoneReconciler}}
instead of returning them within a {{Row}}, {{List<Row>}}, {{ColumnFamily}}, etc.  I've
attempted a start on this, though this means carrying the {{TombstoneReconciler}} around through
most of {{ColumnFamilyStore}}, practically all of {{org.apache.db.filter}}, and other places
I haven't yet discovered.  I'm wondering if I wouldn't be better off waiting for CASSANDRA-8099
before starting this step -- a fully iterator flow through {{org.apache.cassandra.db}} might
make this easier.
> 3. ("dynamo layer") Arrange for {{RangeSliceCommand}} to include parameters for the IBR
(table size, seed, hash count), possibly by making these part of CFMetaData.  Allow a {{RangeSliceResponse}}
to optionally return a {{TombstoneReconciler}} in addition to its {{List<Row>}}.  Refactor
{{RangeSliceResponseResolver}} to be capable of handling {{TombstoneReconciler}} s.  Possibly
refactor {{StorageProxy.getRangeSlices(...)}} to disable read repair if any responses contained
a {{TombstoneReconciler}}.
> Since the invertible bloom filter is a probabilistic data structure, it is possible that
resolving a query in this manner could fail.  As such, I'm proposing that the current read
path not make use of the {{TombstoneReconciler}} idea unless it would otherwise encounter
a {{TombstoneOverwhelming}} situation.
> Also, there is an astronomically minute possibility (on the order if 1 / 2^120 ) that
tombstones reconciled via a {{TombstoneReconciler}} could be bad data.  Possibly less astronomical
if someone were attempting to maliciously abuse this functionality, given the fact that I'm
using the Murmur3 hash instead of a cryptographic hash within {{InvertibleBloomReconciler}}.
 As such, I'm suggesting that read repair not be enabled on this path, just in case.
> ----
> Any thoughts?  Would there be any interest in an implementation as I've described?  Any
suggestions on who I might ask for help navigating {{org.apache.cassandra.db}} if/when I run
into trouble there?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message