From user-return-29297-apmail-cassandra-user-archive=cassandra.apache.org@cassandra.apache.org Tue Oct 2 17:19:02 2012 Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E667CD4EB for ; Tue, 2 Oct 2012 17:19:01 +0000 (UTC) Received: (qmail 836 invoked by uid 500); 2 Oct 2012 17:18:59 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 805 invoked by uid 500); 2 Oct 2012 17:18:59 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 795 invoked by uid 99); 2 Oct 2012 17:18:59 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Oct 2012 17:18:59 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of 0x6e6562@gmail.com designates 209.85.214.44 as permitted sender) Received: from [209.85.214.44] (HELO mail-bk0-f44.google.com) (209.85.214.44) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Oct 2012 17:18:51 +0000 Received: by bkcjc3 with SMTP id jc3so5646281bkc.31 for ; Tue, 02 Oct 2012 10:18:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:message-id:in-reply-to:references:subject:x-mailer :mime-version:content-type; bh=ciHJGfdCrP2s7HgDiU7LJCIIpdaz2mvIqSMY7Y7Q5jI=; b=PLr5sR1dLXeujVjdDawyfL8MiS7ymZv75MfJjWW4gWbHtNP5mHcYZYGmE96iY8FSEu 7+IJLSoiqY4i2fn/L9OZRpqAd/D2XpfvtLqBLJ5XIFKCdjoDLansXVEx0VprSXG+mx/z +DbDMotF9YIS7mVZsxr69cwX86sQu17ObTVQGJGXdG685nC6qfcJyZBfSgNoOEz/woRo ZuL6S+bX8KOaNov5Aj+MeyMcXk7zmcYJpDqVbgPmkJZK3F6AXBGbjxr06BAVLC0dwB3n mk7Bfq72TMo8wsUVt6cbvU0Jmd3BnjVsJlnI75g32Iupe+uk6EpQmkI1xjIMSRJ87nf6 sCKg== Received: by 10.205.118.135 with SMTP id fq7mr1163022bkc.50.1349198309775; Tue, 02 Oct 2012 10:18:29 -0700 (PDT) Received: from [10.12.54.12] (smwoki-lupubpool-1-709.wifi.virginmedia.com. [82.13.98.197]) by mx.google.com with ESMTPS id ia2sm1883591bkc.11.2012.10.02.10.18.25 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 02 Oct 2012 10:18:26 -0700 (PDT) Date: Tue, 2 Oct 2012 18:18:25 +0100 From: Ben Hood <0x6e6562@gmail.com> To: user@cassandra.apache.org Message-ID: <61089BF884104AE0BC5A1EEA7654E0D3@gmail.com> In-Reply-To: <0F7F79E7-2CFB-4568-BDD2-2A56CC054B62@gmail.com> References: <0F7F79E7-2CFB-4568-BDD2-2A56CC054B62@gmail.com> Subject: Re: 1000's of column families X-Mailer: sparrow 1.3.2 (build 507.11) MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="506b21e1_3352255a_1bc0" X-Virus-Checked: Checked by ClamAV on apache.org --506b21e1_3352255a_1bc0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Content-Disposition: inline Jeremy, On Tuesday, October 2, 2012 at 17:06, Jeremy Hanna wrote: > Another option that may or may not work for you is the support in Cassandra 1.1+ to use a secondary index as an input to your mapreduce job. What you might do is add a field to the column family that represents which virtual column family that it is part of. Then when doing mapreduce jobs, you could use that field as the secondary index limiter. Secondary index mapreduce is not as efficient since you first get all of the keys and then do multigets to get the data that you need for the mapreduce job. However, it's another option for not scanning the whole column family. > Interesting. This is probably a stupid question but why shouldn't you be able to use the secondary index to go straight to the slices that belong to the attribute you are searching by? Is this something to do with the way Cassandra is exposed as an InputFormat for Hadoop or is this a general property for searching by secondary index? Ben --506b21e1_3352255a_1bc0 Content-Type: text/html; charset="utf-8" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline
Jeremy,
=20

On Tuesday, October 2, 2012 at 17:= 06, Jeremy Hanna wrote:

Another option that m= ay or may not work for you is the support in Cassandra 1.1+ to use a seco= ndary index as an input to your mapreduce job. What you might do is add = a field to the column family that represents which virtual column family = that it is part of. Then when doing mapreduce jobs, you could use that f= ield as the secondary index limiter. Secondary index mapreduce is not as= efficient since you first get all of the keys and then do multigets to g= et the data that you need for the mapreduce job. However, it's another o= ption for not scanning the whole column family.
<= div>
Interesting. This is probably a stupid question but wh= y shouldn't you be able to use the secondary index to go straight to the = slices that belong to the attribute you are searching by=3F Is this somet= hing to do with the way Cassandra is exposed as an Input=46ormat for Hado= op or is this a general property for searching by secondary index=3F

Ben
=20

--506b21e1_3352255a_1bc0--