cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Kennedy <stinkym...@gmail.com>
Subject Re: map reduce job over indexed range of keys
Date Fri, 25 Feb 2011 00:45:52 GMT
Right, so I'm interpreting silence as a confirmation on all points. I
opened:
https://issues.apache.org/jira/browse/CASSANDRA-2245
https://issues.apache.org/jira/browse/CASSANDRA-2246

to work on these.

On Wed, Feb 23, 2011 at 5:31 PM, Matt Kennedy <stinkymatt@gmail.com> wrote:

> Let me start out by saying that I think I'm going to have to write a patch
> to get what I want, but I'm fine with that.  I just wanted to check here
> first to make sure that I'm not missing something obvious.
>
> I'd like to be able to run a MapReduce job that takes a value in an indexed
> column as a parameter, and use that to select the data that the MapReduce
> job operates on.  Right now, it looks like this isn't possible because
> org.apache.cassandra.hadoop.ColumnFamilyRecordReader will only fetch data
> with get_range_slices, not get_indexed_slices.
>
> An example might be useful.  Let's say I want to run a map reduce job over
> all the data for a particular country.  Right now I can do this in Map
> Reduce by simply discarding all the data that is not from the country I want
> to process on. I suspect it will be faster if I can reduce the size of the
> Map Reduce job by only selecting the data I want by using secondary indexes
> in Cassandra.
>
> So, first question: Am I wrong?  Is there some clever way to enable the
> behavior I'm looking for (without modifying the cassandra codebase)?
>
> Second question: If I'm not wrong, should I open a JIRA issue for this and
> start coding up this feature?
>
> Finally, the real reason that I want to get this working is so that I can
> enhance the CassandraStorage pig loadfunc so that it can take query
> parameters on in the URL string that is used to specify the keyspace and
> column family.  So for example, you might load data into Pig with this
> sytax:
>
> rows = LOAD 'cassandra://mykeyspace/mycolumnfamily?country=UK' using
> CassandraStorage();
>
> I'd like to get some feedback on that syntax.
>
> Thanks,
> Matt Kennedy
>

Mime
View raw message