Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 39052D6A6 for ; Sat, 15 Sep 2012 03:21:58 +0000 (UTC) Received: (qmail 33075 invoked by uid 500); 15 Sep 2012 03:21:55 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 33046 invoked by uid 500); 15 Sep 2012 03:21:55 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 33038 invoked by uid 99); 15 Sep 2012 03:21:55 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 15 Sep 2012 03:21:55 +0000 X-ASF-Spam-Status: No, hits=-0.5 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FSL_RCVD_USER,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jeremy.hanna1234@gmail.com designates 209.85.219.44 as permitted sender) Received: from [209.85.219.44] (HELO mail-oa0-f44.google.com) (209.85.219.44) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 15 Sep 2012 03:21:48 +0000 Received: by oagk14 with SMTP id k14so3749108oag.31 for ; Fri, 14 Sep 2012 20:21:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=content-type:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to:x-mailer; bh=5snbT0D4gkmNXnc5sn1y4Mp7qXd5CEHRRN4pjbbAsP4=; b=cV3M1w4CBOFPtml6nHWEV5ClakBaLzuKPfamLSmRy9mIOYP8VelmyFV9dK3SVn1Q2N kh/wsk8TndVU3I/g2SiyJSGzuTc0Rsqt5QSEEvI5thKqn8xo8Bj+CkQDz1O2phNi32If E5Z7PqsKOF9Rgngk2VJXf79t2xVMp9YHKahiUZyx6vfKAjPotcaceDhok4+uEjvM58Co Rz/PVbjUkimtI/rFzBY91uKHwGs2UZ4uNBb8FAp/1l/Y7hR6zuKuUOgHnURE+hAu0c4l lUACE60/JhJcBQ37KE+KMUUt3AzrsdV5wCLiJIsGOGY7+t4ITaf/77eaL6xpAJZ6FYbk yFjg== Received: by 10.182.187.98 with SMTP id fr2mr6235190obc.73.1347679287329; Fri, 14 Sep 2012 20:21:27 -0700 (PDT) Received: from [10.0.1.10] ([70.114.230.192]) by mx.google.com with ESMTPS id o4sm3358458oef.11.2012.09.14.20.21.26 (version=TLSv1/SSLv3 cipher=OTHER); Fri, 14 Sep 2012 20:21:26 -0700 (PDT) Content-Type: text/plain; charset=iso-8859-1 Mime-Version: 1.0 (Mac OS X Mail 6.0 \(1486\)) Subject: Re: Differences in row iteration behavior From: Jeremy Hanna In-Reply-To: <5053F0DF.8050500@conga.com> Date: Fri, 14 Sep 2012 22:21:29 -0500 Content-Transfer-Encoding: quoted-printable Message-Id: References: <5053F0DF.8050500@conga.com> To: user@cassandra.apache.org X-Mailer: Apple Mail (2.1486) Are there any deletions in your data? The Hadoop support doesn't filter = out tombstones, though you may not be filtering them out in your code = either. I've used the hadoop support for doing a lot of data validation = in the past and as long as you're sure that the code is sound, I'm = pretty confident in it. On Sep 14, 2012, at 10:07 PM, Todd Fast wrote: > Hi-- >=20 > We are iterating rows in a column family two different ways and are = seeing radically different row counts. We are using 1.0.8 and = RandomPartitioner on a 3-node cluster. >=20 > In the first case, we have a trivial Hadoop job that counts 29M rows = using the standard MR pattern for counting (mapper outputs a single key = with a value of 1, reducer adds up all the values). >=20 > In the second case, we have a simple Quartz batch job which counts = only 10M rows. We are iterating using chained calls to get_row_slices, = as described on the wiki: = http://wiki.apache.org/cassandra/FAQ#iter_world We've also implemented = the batch job using Pelops, with and without chaining. In all cases, the = job counts just 10M rows, and it is not encountering any errors. >=20 > We are confident that we are doing everything right in both cases (no = bugs), yet the results are baffling. Tests in smaller, single-node = environments results in consistent counts between the two methods, but = we don't have the same amount of data nor the same topology. >=20 > Is the right answer 29M or 10M? Any clues to what we're seeing? >=20 > Todd