cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Efficient map reduce over ranges of Cassandra data
Date Fri, 11 Nov 2011 05:20:31 GMT
Hey all,

I know there are several tickets in the pipe that should make it possible
do use secondary indexes to run map reduce jobs that do not have to ingest
the entire dataset such as:

https://issues.apache.org/jira/browse/CASSANDRA-1600

I had ended up creating a sharded secondary index in user space (I just
call it ordered buckets), described here:

http://www.slideshare.net/edwardcapriolo/casbase-presentation/27

Looking at the ordered buckets implementation I realized it is a perfect
candidate for "efficient map reduce" since it is easy to split.

A unit test of that implementation is here:

https://github.com/edwardcapriolo/casbase/blob/master/src/test/java/com/jointhegrid/casbase/hadoop/OrderedBucketInputFormatTest.java

With this you can current do efficient map reduce on cassandra data, while
waiting for other integrated solutions to come along.

Mime
View raw message