cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Corey Hulen>
Subject Re: Possible bug in Cassandra MapReduce
Date Fri, 18 Jun 2010 22:57:12 GMT
OK...I just verified on a clean EC2 small single instance box using
 I'm pertty sure the Cassandra MapReduce functionality is broken.

If your MapReduce jobs are idempotent then you are OK, but if you are doing
things like word count (as in the supplied example) or key count you will
get double counts.


On Fri, Jun 18, 2010 at 3:15 PM, Corey Hulen <> wrote:

> I thought the same thing, but using the supplied contrib example I just
> delete the /var/lib/data dirs and commit log.
> -Corey
> On Fri, Jun 18, 2010 at 3:11 PM, Phil Stanhope <>wrote:
>> "blow all the data away" ... how do you do that? What is the timestamp
>> precision that you are using when creating key/col or key/supercol/col
>> items?
>> I have seen a fail to write a key when the timestamp is identical to the
>> previous timestamp of a deleted key/col. While I didn't examine the source
>> code, I'm certain that this is do to delete tombstones.
>> I view this as a application error because I was attempting to do this
>> within the GCGraceSeconds time period. If I, however, stopped cassandra,
>> blew away data & commitlogs and restarted the write always succeeds (no
>> surprise there).
>> I turned this behavior into a feature (of sorts). When this happens I
>> increment a formally non-zero portion of the timestamp (the last digit of
>> precision which was always zero) and use this as a counter to track how many
>> times a key/col was updated (max 9 for my purposes).
>> -phil
>> On Jun 18, 2010, at 5:49 PM, Corey Hulen wrote:
>> >
>> > We are using MapReduce to periodical verify and rebuild our secondary
>> indexes along with counting total records.  We started to noticed double
>> counting of unique keys on single machine standalone tests. We were finally
>> able to reproduce the problem using the
>> apache-cassandra-0.6.2-src/contrib/word_count example and just re-running it
>> multiple times.  We are hoping someone can verify the bug.
>> >
>> > re-run the tests and the word count for /tmp/word_count3/part-r-00000
>> will be 1000 +~200  and will change if you blow the data away and re-run.
>>  Notice the setup script loops and only inserts 1000 records so we expect
>> count to be 1000.  Once the data is generated then re-running the setup
>> script and/or mapreduce doesn't change the number (still off).  The key is
>> to blow all the data away and start over which will cause it to change.
>> >
>> > Can someone please verify this behavior?
>> >
>> > -Corey

View raw message