cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeremy Hanna (JIRA)" <>
Subject [jira] Commented: (CASSANDRA-1042) ColumnFamilyRecordReader returns duplicate rows
Date Sat, 26 Jun 2010 00:48:53 GMT


Jeremy Hanna commented on CASSANDRA-1042:

Good point.

>From what I could tell in this instance, it would go through the input splits and on the
last input split, it would have an incorrect last value.  So it would go back through and
take that value to the end of the input list.  I would imagine that is where it had wrapped.
 I'm not sure why it had the incorrect last value as the last value in the last input split
though.  If someone is wiser than I in these matters, please chime in.  But it appears that
normalizing how the splits are done so one split does not wrap internally, it solves the problem.

To reproduce easily and with a small dataset: If you don't apply the patch and run the word_count_setup
with only 10 values for text3, usually that will be enough to manifest the problem when running

Also, I might think that if the wrap could be detected when creating the splits, as with this
patch, then it makes sense that wrapping could be detected when reading the rows in the ColumnFamilyRecordReader.
 That could be another way to resolve it.  But I think it's sixes when it comes to the solution.

Like I said, I'm not certain why that incorrect ordering happens on the last split.

> ColumnFamilyRecordReader returns duplicate rows
> -----------------------------------------------
>                 Key: CASSANDRA-1042
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Hadoop
>    Affects Versions: 0.6
>            Reporter: Joost Ouwerkerk
>            Assignee: Jeremy Hanna
>             Fix For: 0.6.4
>         Attachments: 1042-0_6.txt, Cassandra-1042-0_6-branch.patch.txt, CASSANDRA-1042-trunk.patch.txt,
> There's a bug in ColumnFamilyRecordReader that appears when processing a single split
(which happens in most tests that have small number of rows), and potentially in other cases.
 When the start and end tokens of the split are equal, duplicate rows can be returned.
> Example with 5 rows:
> token (start and end) = 53193025635115934196771903670925341736
> Tokens returned by first get_range_slices iteration (all 5 rows):
>  16955237001963240173058271559858726497
>  40670782773005619916245995581909898190
>  99079589977253916124855502156832923443
>  144992942750327304334463589818972416113
>  166860289390734216023086131251507064403
> Tokens returned by next iteration (first token is last token from
> previous, end token is unchanged)
>  16955237001963240173058271559858726497
>  40670782773005619916245995581909898190
> Tokens returned by final iteration  (first token is last token from
> previous, end token is unchanged)
>  [] (empty)
> In this example, the mapper has processed 7 rows in total, 2 of which
> were duplicates.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message