cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeremy Hanna (JIRA)" <>
Subject [jira] Commented: (CASSANDRA-1042) ColumnFamilyRecordReader returns duplicate rows
Date Tue, 06 Jul 2010 20:18:52 GMT


Jeremy Hanna commented on CASSANDRA-1042:

Sorry if this is redundant but pasting in a thought we had a while ago that motivated the
attached patch.  If we make sure that the splits are always in ring order and never wrap,
it solves the problem.

"Token ranges may also wrap -- that is, the end token may be less than the start one. Thus,
a range from keyX to keyX is a one-element range, but a range from tokenY to tokenY is the
full ring."

It does not say what order they will be in when it wraps.  Some clients assume that the ordering
is natural order while the hadoop client interactions assume that it will be ring order.

For example:
-- a list of tokens (1,2,3,4,5,6,7,8,9)
-- a get_range_slice call with start_token = 5, end_token = 5
Natural order meaning token order from start to finish, returning the results (1,2,3,4,5,6,7.8,9).
Ring order or wrapping order meaning it would return the results (5,6,7,8,9,1,2,3,4).

> ColumnFamilyRecordReader returns duplicate rows
> -----------------------------------------------
>                 Key: CASSANDRA-1042
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Hadoop
>    Affects Versions: 0.6
>            Reporter: Joost Ouwerkerk
>            Assignee: Jeremy Hanna
>             Fix For: 0.6.4
>         Attachments: 1042-0_6.txt, Cassandra-1042-0_6-branch.patch.txt, CASSANDRA-1042-trunk.patch.txt,
cassandra.tar.gz, duplicate_keys.rtf
> There's a bug in ColumnFamilyRecordReader that appears when processing a single split
(which happens in most tests that have small number of rows), and potentially in other cases.
 When the start and end tokens of the split are equal, duplicate rows can be returned.
> Example with 5 rows:
> token (start and end) = 53193025635115934196771903670925341736
> Tokens returned by first get_range_slices iteration (all 5 rows):
>  16955237001963240173058271559858726497
>  40670782773005619916245995581909898190
>  99079589977253916124855502156832923443
>  144992942750327304334463589818972416113
>  166860289390734216023086131251507064403
> Tokens returned by next iteration (first token is last token from
> previous, end token is unchanged)
>  16955237001963240173058271559858726497
>  40670782773005619916245995581909898190
> Tokens returned by final iteration  (first token is last token from
> previous, end token is unchanged)
>  [] (empty)
> In this example, the mapper has processed 7 rows in total, 2 of which
> were duplicates.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message