Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 21D2A7FAE for ; Fri, 11 Nov 2011 05:21:01 +0000 (UTC) Received: (qmail 15600 invoked by uid 500); 11 Nov 2011 05:20:58 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 15572 invoked by uid 500); 11 Nov 2011 05:20:58 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 15563 invoked by uid 99); 11 Nov 2011 05:20:57 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 Nov 2011 05:20:57 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of edlinuxguru@gmail.com designates 209.85.210.172 as permitted sender) Received: from [209.85.210.172] (HELO mail-iy0-f172.google.com) (209.85.210.172) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 Nov 2011 05:20:52 +0000 Received: by iaeo4 with SMTP id o4so4853828iae.31 for ; Thu, 10 Nov 2011 21:20:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; bh=er7qsFajspN/9r4HVc9llUCEDCTuP6SQi3nemDSfgiM=; b=aGGihMRZB4Kp6fRozvgprK+/weNM7rgqCxSXr8q88Pcy/+otqaYw06jGB5m3ni2tAI JYiFz3+SNZCmDPSXuYzuVQYEZ8kuFfpwtAB0YB3UjaCRBNsntGWGCY7iSmqjsoM22jI5 bsZhEPgVo+XFY2ULqk+MWkVnkYBNQjnEPFGLI= MIME-Version: 1.0 Received: by 10.43.49.131 with SMTP id va3mr10601586icb.51.1320988831820; Thu, 10 Nov 2011 21:20:31 -0800 (PST) Received: by 10.42.140.195 with HTTP; Thu, 10 Nov 2011 21:20:31 -0800 (PST) Date: Fri, 11 Nov 2011 00:20:31 -0500 Message-ID: Subject: Efficient map reduce over ranges of Cassandra data From: Edward Capriolo To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=bcaec529a03912ab7604b16eaedf --bcaec529a03912ab7604b16eaedf Content-Type: text/plain; charset=ISO-8859-1 Hey all, I know there are several tickets in the pipe that should make it possible do use secondary indexes to run map reduce jobs that do not have to ingest the entire dataset such as: https://issues.apache.org/jira/browse/CASSANDRA-1600 I had ended up creating a sharded secondary index in user space (I just call it ordered buckets), described here: http://www.slideshare.net/edwardcapriolo/casbase-presentation/27 Looking at the ordered buckets implementation I realized it is a perfect candidate for "efficient map reduce" since it is easy to split. A unit test of that implementation is here: https://github.com/edwardcapriolo/casbase/blob/master/src/test/java/com/jointhegrid/casbase/hadoop/OrderedBucketInputFormatTest.java With this you can current do efficient map reduce on cassandra data, while waiting for other integrated solutions to come along. --bcaec529a03912ab7604b16eaedf Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hey all,

I know there are several tickets in the pipe that should ma= ke it possible do use secondary indexes to run map reduce jobs that do not = have to ingest the entire dataset such as:

https://issues.apache.org/jira/brow= se/CASSANDRA-1600

I had ended up creating a sharded secondary index in user space (I just= call it ordered buckets), described here:

http://www.slideshare.n= et/edwardcapriolo/casbase-presentation/27

Looking at the ordered buckets implementation I realized it is a perfec= t candidate for "efficient map reduce" since it is easy to split.=

A unit test of that implementation is here:

https://github.com/ed= wardcapriolo/casbase/blob/master/src/test/java/com/jointhegrid/casbase/hado= op/OrderedBucketInputFormatTest.java

With this you can current do efficient map reduce on cassandra data, wh= ile waiting for other integrated solutions to come along.

--bcaec529a03912ab7604b16eaedf--