Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E55B71792A for ; Wed, 28 Jan 2015 15:30:34 +0000 (UTC) Received: (qmail 26989 invoked by uid 500); 28 Jan 2015 15:30:35 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 26895 invoked by uid 500); 28 Jan 2015 15:30:35 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 26589 invoked by uid 99); 28 Jan 2015 15:30:35 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Jan 2015 15:30:35 +0000 Date: Wed, 28 Jan 2015 15:30:35 +0000 (UTC) From: "Aleksey Yeschenko (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CASSANDRA-8542) Make get_range_slices and related succeed most of the time on tombstone heavy column families MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-8542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295289#comment-14295289 ] Aleksey Yeschenko commented on CASSANDRA-8542: ---------------------------------------------- It's 3.1 territory, at best, if we ever decided to do this, so no, not yet. > Make get_range_slices and related succeed most of the time on tombstone heavy column families > --------------------------------------------------------------------------------------------- > > Key: CASSANDRA-8542 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8542 > Project: Cassandra > Issue Type: New Feature > Components: Core > Reporter: Shawn Walker > Priority: Minor > Attachments: trunk-8542-InvertibleBloomReconciler.txt > > > I've got a ColumnFamily in which some rows end up being used in a queue-like fashion, and where many tombstones tend to pile up at the left end of the row. As a result, I run into {{TombstoneOverwhelming}} (e.g. CASSANDRA-6117) when trying to list the columns of said rows. > Please don't yell at me too loudly. I know the issue I'm describing will generally be considered as being due to poor use of Cassandra. I understand the rationale of the current behavior, and am hesitant to play with fire by increasing {{tombstone_fail_threshold}} to a high value. Instead, what I propose is an alternate method of resolving such queries on the read path. > ---- > This is based on the following observation: on the coordinator node, when {{RangeSliceResponseResolver}} is resolving a range slice query, it needn't be aware of all tombstones that all responding nodes have within the specified range. Instead, it would suffice if it could determine only those tombstones which aren't shared by all responding nodes. E.g. if node #1 responds with tombstones (A, B, D), node #2 responds with tombstones (A, B, C), and node #3 responds with tombstones (A, B, C, D), {{RangeSliceResponseResolver}} need only actually know about the tombstones (C, D): All of the responders will already have removed any relevant data for the tombstones (A, B) from their individual responses. > Arranging for {{RangeSliceResponseResolver}} to discover only the non-common tombstones can be accomplished by using a variant of the "invertible bloom filter" data structure described in e.g. http://conferences.sigcomm.org/sigcomm/2011/papers/sigcomm/p218.pdf. Using an invertible Bloom filter, each responding node can build a (roughly) fixed size data structure holding a representation of all the tombstones that node encounters. The coordinator node can then combine the invertible Bloom filters. If there aren't too many non-common tombstones, the coordinator node will be able to enumerate all of them, and so resolve the range slice query. > I see accomplishing this in a few discrete steps: > 1. Implement an appropriate variant of the "invertible bloom filter". I've started this already, and will shortly upload a patch including my {{InvertibleBloomReconciler}} implementation. From a black box perspective, {{InvertibleBloomReconcilerTest.verifyFiveWayReconciliation()}} demonstrates how this data structure and algorithm could be used. > 2. ("db layer") Wrap {{InvertibleBloomReconciler}} into {{TombstoneReconciler}}, a structure for spilling tombstones into. Refactor large swaths of {{org.apache.cassandra.db}}'s read path to accomodate the possibility of placing tombstones discovered during a read into a {{TombstoneReconciler}} instead of returning them within a {{Row}}, {{List}}, {{ColumnFamily}}, etc. I've attempted a start on this, though this means carrying the {{TombstoneReconciler}} around through most of {{ColumnFamilyStore}}, practically all of {{org.apache.db.filter}}, and other places I haven't yet discovered. I'm wondering if I wouldn't be better off waiting for CASSANDRA-8099 before starting this step -- a fully iterator flow through {{org.apache.cassandra.db}} could make this easier, cleaner, and have significantly lower code impact. > 3. ("dynamo layer") Arrange for {{RangeSliceCommand}} to include parameters for the IBR (table size, hash count, seed), possibly by making these part of {{CFMetaData}}. Allow a {{RangeSliceResponse}} to optionally return a {{TombstoneReconciler}} in addition to its {{List}}. Refactor {{RangeSliceResponseResolver}} to be capable of handling {{TombstoneReconciler}} s. Possibly refactor {{StorageProxy.getRangeSlices(...)}} to disable read repair if any responses contained a {{TombstoneReconciler}}. > Since the invertible bloom filter is a probabilistic data structure, it is possible that resolving a query in this manner could fail. As such, I'm proposing that the current read path not make use of the {{TombstoneReconciler}} idea unless it would otherwise encounter a {{TombstoneOverwhelming}} situation. > Also, there is an astronomically minute possibility (on the order if 1 / 2^120 ) that tombstones reconciled via a {{TombstoneReconciler}} could be bad data. Possibly less astronomical if someone were attempting to maliciously abuse this functionality, given the fact that I'm using the Murmur3 hash instead of a cryptographic hash within {{InvertibleBloomReconciler}}. As such, I'm suggesting that read repair not be enabled on this path, just in case. > ---- > Any thoughts? Would there be any interest in an implementation as I've described? Any suggestions on who I might ask for help navigating {{org.apache.cassandra.db}} if/when I run into trouble there? -- This message was sent by Atlassian JIRA (v6.3.4#6332)