Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 25012 invoked from network); 18 Jun 2010 22:57:41 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 18 Jun 2010 22:57:41 -0000 Received: (qmail 83199 invoked by uid 500); 18 Jun 2010 22:57:40 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 83159 invoked by uid 500); 18 Jun 2010 22:57:40 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 83151 invoked by uid 99); 18 Jun 2010 22:57:39 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Jun 2010 22:57:39 +0000 X-ASF-Spam-Status: No, hits=2.9 required=10.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [74.125.83.44] (HELO mail-gw0-f44.google.com) (74.125.83.44) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Jun 2010 22:57:35 +0000 Received: by gwj16 with SMTP id 16so1160034gwj.31 for ; Fri, 18 Jun 2010 15:57:12 -0700 (PDT) MIME-Version: 1.0 Received: by 10.150.117.25 with SMTP id p25mr1757478ybc.254.1276901832299; Fri, 18 Jun 2010 15:57:12 -0700 (PDT) Received: by 10.151.49.6 with HTTP; Fri, 18 Jun 2010 15:57:12 -0700 (PDT) In-Reply-To: References: <58893209-1479-448C-956D-3F98B435206F@wimba.com> Date: Fri, 18 Jun 2010 15:57:12 -0700 Message-ID: Subject: Re: Possible bug in Cassandra MapReduce From: Corey Hulen To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=000e0cd728362096ae048955e0d8 --000e0cd728362096ae048955e0d8 Content-Type: text/plain; charset=ISO-8859-1 OK...I just verified on a clean EC2 small single instance box using apache-cassandra-0.6.2-src. I'm pertty sure the Cassandra MapReduce functionality is broken. If your MapReduce jobs are idempotent then you are OK, but if you are doing things like word count (as in the supplied example) or key count you will get double counts. -Corey On Fri, Jun 18, 2010 at 3:15 PM, Corey Hulen wrote: > > I thought the same thing, but using the supplied contrib example I just > delete the /var/lib/data dirs and commit log. > > -Corey > > > > > On Fri, Jun 18, 2010 at 3:11 PM, Phil Stanhope wrote: > >> "blow all the data away" ... how do you do that? What is the timestamp >> precision that you are using when creating key/col or key/supercol/col >> items? >> >> I have seen a fail to write a key when the timestamp is identical to the >> previous timestamp of a deleted key/col. While I didn't examine the source >> code, I'm certain that this is do to delete tombstones. >> >> I view this as a application error because I was attempting to do this >> within the GCGraceSeconds time period. If I, however, stopped cassandra, >> blew away data & commitlogs and restarted the write always succeeds (no >> surprise there). >> >> I turned this behavior into a feature (of sorts). When this happens I >> increment a formally non-zero portion of the timestamp (the last digit of >> precision which was always zero) and use this as a counter to track how many >> times a key/col was updated (max 9 for my purposes). >> >> -phil >> >> On Jun 18, 2010, at 5:49 PM, Corey Hulen wrote: >> >> > >> > We are using MapReduce to periodical verify and rebuild our secondary >> indexes along with counting total records. We started to noticed double >> counting of unique keys on single machine standalone tests. We were finally >> able to reproduce the problem using the >> apache-cassandra-0.6.2-src/contrib/word_count example and just re-running it >> multiple times. We are hoping someone can verify the bug. >> > >> > re-run the tests and the word count for /tmp/word_count3/part-r-00000 >> will be 1000 +~200 and will change if you blow the data away and re-run. >> Notice the setup script loops and only inserts 1000 records so we expect >> count to be 1000. Once the data is generated then re-running the setup >> script and/or mapreduce doesn't change the number (still off). The key is >> to blow all the data away and start over which will cause it to change. >> > >> > Can someone please verify this behavior? >> > >> > -Corey >> >> > --000e0cd728362096ae048955e0d8 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable

OK...I just verified on a clean EC2 small single instance bo= x using=A0apache-cassandra-0.6.2-src. =A0I'm pertty sure the Cassandra MapRed= uce functionality is=A0broken.
If your MapReduce jobs are idempotent then you are= OK, but if you are doing things like word count (as in the supplied exampl= e) or key count you will get double counts.

-Corey


On Fri, Jun 18, 2010 at 3:15 PM, Corey Hulen <cj@earnstone.com= > wrote:

I thought the same thing, but using the supplied contrib exa= mple I just delete the /var/lib/data dirs and commit log.

-Corey



On Fri, Jun 18, 2010= at 3:11 PM, Phil Stanhope <pstanhope@wimba.com> wrote:
"blow all the data away" ... how d= o you do that? What is the timestamp precision that you are using when crea= ting key/col or key/supercol/col items?

I have seen a fail to write a key when the timestamp is identical to the pr= evious timestamp of a deleted key/col. While I didn't examine the sourc= e code, I'm certain that this is do to delete tombstones.

I view this as a application error because I was attempting to do this with= in the GCGraceSeconds time period. If I, however, stopped cassandra, blew a= way data & commitlogs and restarted the write always succeeds (no surpr= ise there).

I turned this behavior into a feature (of sorts). When this happens I incre= ment a formally non-zero portion of the timestamp (the last digit of precis= ion which was always zero) and use this as a counter to track how many time= s a key/col was updated (max 9 for my purposes).

-phil

On Jun 18, 2010, at 5:49 PM, Corey Hulen wrote:

>
> We are using MapReduce to periodical verify and rebuild our secondary = indexes along with counting total records. =A0We started to noticed double = counting of unique keys on single machine standalone tests. We were finally= able to reproduce the problem using the apache-cassandra-0.6.2-src/contrib= /word_count example and just re-running it multiple times. =A0We are hoping= someone can verify the bug.
>
> re-run the tests and the word count for /tmp/word_count3/part-r-00000 = will be 1000 +~200 =A0and will change if you blow the data away and re-run.= =A0Notice the setup script loops and only inserts 1000 records so we expec= t count to be 1000. =A0Once the data is generated then re-running the setup= script and/or mapreduce doesn't change the number (still off). =A0The = key is to blow all the data away and start over which will cause it to chan= ge.
>
> Can someone please verify this behavior?
>
> -Corey



--000e0cd728362096ae048955e0d8--