incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: Hadoop Integration: Limiting scan to a range of keys
Date Mon, 03 Dec 2012 21:04:53 GMT
For background, you may find the wide row setting useful http://www.datastax.com/docs/1.1/cluster_architecture/hadoop_integration

AFAIK all the input row readers for Hadoop do range scans. And I think the support for setting
the start and end token is used so that jobs only select data which is local to the node.
It's not really possible to select individual rows by token. 

If you had a secondary index on the row you could use the setInputRange overload that takes
an index expression. 

Or it may be easier to use hive. 

Hope that helps. 
 
-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 1/12/2012, at 3:04 PM, Jamie Rothfeder <jamie.rothfeder@gmail.com> wrote:

> Hey All,
> 
> I have a bunch of time-series data stored in a cluster using a ByteOrderedPartitioner.
My keys are time buckets representing events that occurred in an hour. I've been trying to
write a mapreduce job that considers only events with in a certain time range by specifying
an input range, but this doesn't seem to be working.
> 
> I expect the following code to scan data for a single key (1353456000), but it is scanning
all keys.
> 
> int key = 1353456000;
> IPartitioner part = ConfigHelper.getInputPartitioner(job.getConfiguration());
> Token token =  part.getToken(ByteBufferUtil.bytes(key));
> ConfigHelper.setInputRange(job.getConfiguration(), part.getTokenFactory().toString(token),
part.getTokenFactory().toString(token));
> 
> Any idea what I'm doing wrong?
> 
> Thanks,
> Jamie


Mime
View raw message