Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6E5136783 for ; Fri, 29 Jul 2011 06:27:00 +0000 (UTC) Received: (qmail 58269 invoked by uid 500); 29 Jul 2011 06:26:59 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 57357 invoked by uid 500); 29 Jul 2011 06:26:45 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 57283 invoked by uid 99); 29 Jul 2011 06:26:35 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Jul 2011 06:26:35 +0000 X-ASF-Spam-Status: No, hits=-2001.2 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Jul 2011 06:26:32 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 5859690D20 for ; Fri, 29 Jul 2011 06:26:10 +0000 (UTC) Date: Fri, 29 Jul 2011 06:26:10 +0000 (UTC) From: "Byron Clark (JIRA)" To: commits@cassandra.apache.org Message-ID: <1071547888.17728.1311920770260.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <249081620.7169.1305217667563.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Updated] (CASSANDRA-2643) read repair/reconciliation breaks slice based iteration at QUORUM MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/CASSANDRA-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Byron Clark updated CASSANDRA-2643: ----------------------------------- Attachment: CASSANDRA-2643-poc.patch The attached [^CASSANDRA-2643-poc.patch], while extremely ugly, proves as a proof of concept that all the data is available and the short read problem can be corrected. > read repair/reconciliation breaks slice based iteration at QUORUM > ----------------------------------------------------------------- > > Key: CASSANDRA-2643 > URL: https://issues.apache.org/jira/browse/CASSANDRA-2643 > Project: Cassandra > Issue Type: Bug > Affects Versions: 0.7.5 > Reporter: Peter Schuller > Assignee: Brandon Williams > Priority: Critical > Fix For: 1.0 > > Attachments: CASSANDRA-2643-poc.patch, reliable_short_read_0.8.sh, short_read.sh, short_read_0.8.sh, slicetest.py > > > In short, I believe iterating over columns is impossible to do reliably with QUORUM due to the way reconciliation works. > The problem is that the SliceQueryFilter is executing locally when reading on a node, but no attempts seem to be made to consider limits when doing reconciliation and/or read-repair (RowRepairResolver.resolveSuperset() and ColumnFamily.resolve()). > If a node slices and comes up with 100 columns, and another node slices and comes up with 100 columns, some of which are unique to each side, reconciliation results in > 100 columns in the result set. In this case the effect is limited to "client gets more than asked for", but the columns still accurately represent the range. This is easily triggered by my test-case. > In addition to the client receiving "too many" columns, I believe some of them will not be satisfying the QUORUM consistency level for the same reasons as with deletions (see discussion below). > Now, there *should* be a problem for tombstones as well, but it's more subtle. Suppose A has: > 1 > 2 > 3 > 4 > 5 > 6 > and B has: > 1 > del 2 > del 3 > del 4 > 5 > 6 > If you now slice 1-6 with count=3 the tombstones from B will reconcile with those from A - fine. So you end up getting 1,5,6 back. This made it a bit difficult to trigger in a test case until I realized what was going on. At first I was "hoping" to see a "short" iteration result, which would mean that the process of iterating until you get a short result will cause spurious "end of columns" and thus make it impossible to iterate correctly. > So; due to 5-6 existing (and if they didn't, you legitimately reached end-of-columns) we do indeed get a result of size 3 which contains 1,5 and 6. However, only node B would have contributed columns 5 and 6; so there is actually no QUORUM consistency on the co-ordinating node with respect to these columns. If node A and C also had 5 and 6, they would not have been considered. > Am I wrong? > In any case; using script I'm about to attach, you can trigger the over-delivery case very easily: > (0) disable hinted hand-off to avoid that interacting with the test > (1) start three nodes > (2) create ks 'test' with rf=3 and cf 'slicetest' > (3) ./slicetest.py hostname_of_node_C insert # let it run for a few seconds, then ctrl-c > (4) stop node A > (5) ./slicetest.py hostname_of_node_C insert # let it run for a few seconds, then ctrl-c > (6) start node A, wait for B and C to consider it up > (7) ./slicetest.py hostname_of_node_A slice # make A co-ordinator though it doesn't necessarily matter > You can also pass 'delete' (random deletion of 50% of contents) or 'deleterange' (delete all in [0.2,0.8]) to slicetest, but you don't trigger a short read by doing that (see discussion above). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira