Return-Path: Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: (qmail 20103 invoked from network); 6 Jul 2010 20:19:16 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 6 Jul 2010 20:19:16 -0000 Received: (qmail 67339 invoked by uid 500); 6 Jul 2010 20:19:16 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 67316 invoked by uid 500); 6 Jul 2010 20:19:16 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 67308 invoked by uid 99); 6 Jul 2010 20:19:16 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Jul 2010 20:19:16 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Jul 2010 20:19:13 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o66KIqFu021314 for ; Tue, 6 Jul 2010 20:18:52 GMT Message-ID: <2982625.222261278447532385.JavaMail.jira@thor> Date: Tue, 6 Jul 2010 16:18:52 -0400 (EDT) From: "Jeremy Hanna (JIRA)" To: commits@cassandra.apache.org Subject: [jira] Commented: (CASSANDRA-1042) ColumnFamilyRecordReader returns duplicate rows In-Reply-To: <22188148.291272730736008.JavaMail.jira@thor> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/CASSANDRA-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12885669#action_12885669 ] Jeremy Hanna commented on CASSANDRA-1042: ----------------------------------------- Sorry if this is redundant but pasting in a thought we had a while ago that motivated the attached patch. If we make sure that the splits are always in ring order and never wrap, it solves the problem. "Token ranges may also wrap -- that is, the end token may be less than the start one. Thus, a range from keyX to keyX is a one-element range, but a range from tokenY to tokenY is the full ring." It does not say what order they will be in when it wraps. Some clients assume that the ordering is natural order while the hadoop client interactions assume that it will be ring order. For example: -- a list of tokens (1,2,3,4,5,6,7,8,9) -- a get_range_slice call with start_token = 5, end_token = 5 Natural order meaning token order from start to finish, returning the results (1,2,3,4,5,6,7.8,9). Ring order or wrapping order meaning it would return the results (5,6,7,8,9,1,2,3,4). > ColumnFamilyRecordReader returns duplicate rows > ----------------------------------------------- > > Key: CASSANDRA-1042 > URL: https://issues.apache.org/jira/browse/CASSANDRA-1042 > Project: Cassandra > Issue Type: Bug > Components: Hadoop > Affects Versions: 0.6 > Reporter: Joost Ouwerkerk > Assignee: Jeremy Hanna > Fix For: 0.6.4 > > Attachments: 1042-0_6.txt, Cassandra-1042-0_6-branch.patch.txt, CASSANDRA-1042-trunk.patch.txt, cassandra.tar.gz, duplicate_keys.rtf > > > There's a bug in ColumnFamilyRecordReader that appears when processing a single split (which happens in most tests that have small number of rows), and potentially in other cases. When the start and end tokens of the split are equal, duplicate rows can be returned. > Example with 5 rows: > token (start and end) = 53193025635115934196771903670925341736 > Tokens returned by first get_range_slices iteration (all 5 rows): > 16955237001963240173058271559858726497 > 40670782773005619916245995581909898190 > 99079589977253916124855502156832923443 > 144992942750327304334463589818972416113 > 166860289390734216023086131251507064403 > Tokens returned by next iteration (first token is last token from > previous, end token is unchanged) > 16955237001963240173058271559858726497 > 40670782773005619916245995581909898190 > Tokens returned by final iteration (first token is last token from > previous, end token is unchanged) > [] (empty) > In this example, the mapper has processed 7 rows in total, 2 of which > were duplicates. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.